Home / Chroniques / How AI could humanise robots
Généré par l'IA / Generated using AI
π Science and technology π Society

How AI could humanise robots

anonyme
Edward Johns
Director of the Robot Learning Lab at Imperial College London
Key takeaways
  • Large language models (LLMs) and vision-language models will have a major impact on the future of robotics.
  • Robots can now communicate in natural language, break tasks down into steps and reason using images.
  • However, LLMs do not effectively enable robots to manipulate their environment with their hands or interact with a 3D universe.
  • There is potential for developing robotics using generative AI, such as enabling robots to reason in video and in action.

Watch­ing videos released by robot­ics com­pan­ies like Tesla and Fig­ure, it could seem like robots will walk into our homes tomor­row, able to execute any com­mand a human asks them to do thanks to advance­ments in large lan­guage mod­els (LLMs). That may be com­ing down the pike, but there are some sub­stan­tial hurdles to over­come first, says Edward Johns, dir­ect­or of the Robot Learn­ing Lab at Imper­i­al Col­lege London.

We have seen stratospheric advances in the field of large language models. Is that going to boost robotics forward?

Edward Johns. What has happened with large neur­al net­works like lan­guage mod­els and vis­ion-lan­guage mod­els will have a big impact on robot­ics— it’s already help­ing with some of the chal­lenges we’ve had. But we’re cer­tainly not going to see a Chat­G­PT-like moment in robot­ics overnight.

LLMs enable oper­at­ors to use nat­ur­al lan­guage when com­mu­nic­at­ing with the robot, rather than input­ting code. That’s use­ful because, ulti­mately, that’s how we want humans to inter­act with them. More import­antly, these mod­els can unlock a new way of reas­on­ing for robots. Chat­G­PT, for instance, can break tasks down into steps. So, for instance, if you ask it how to make a sand­wich, it will say: you need bread, you need to buy bread, you need to find a shop, get your wal­let, leave the house, etc. That means robots can learn to break down tasks intern­ally, and we know they per­form bet­ter when they have a step-by-step guide.

Over the past few months, we’ve also seen the emer­gence of so-called “vis­ion-lan­guage mod­els” that allow the robot to reas­on not only in lan­guage but with images. That’s import­ant because, at some point, the robots need to add visu­al inform­a­tion to their reas­on­ing to nav­ig­ate their environment.

What, then, is the limit to using LLMs for robots?

While these are inter­est­ing mod­els to probe, they are solv­ing some of the easi­er chal­lenges in robot­ics. They have not had a huge impact in terms of dex­trous manip­u­la­tion, for instance — manip­u­la­tion with hands. That’s really what robot­ics is still miss­ing, and it is really dif­fi­cult. Our hands do thou­sands and thou­sands of com­plex tasks every day.

One prob­lem is that these vis­ion lan­guage mod­els are very good semantic­ally, but they won’t be able to help the robot inter­act with a 3D envir­on­ment, because they are trained on 2D images. For robots to be able to reas­on on that level, they need a huge amount of robot­ics data, which just doesn’t exist. Some people think this will hap­pen very quickly, like the flash­point we have had since the emer­gence of Chat­G­PT — that’s cer­tainly what we’re hear­ing in the star­tup com­munit­ies. But in the con­text of Chat­G­PT, the data already exis­ted online. It’s going to take a long time to com­pile that robot­ics data.

The kind of abil­it­ies that you see from these com­pan­ies lead­ing robot­ics com­pan­ies like Tesla and Fig­ure are very impress­ive. For example, Fig­ure has some inter­est­ing video demos where some­body is con­vers­ing with a robot per­form­ing tasks with its hands. But these robots still need to be trained to do spe­cif­ic tasks using machine learn­ing approaches such as rein­force­ment learn­ing — whereby you tell the robot to do a task, and you tell it wheth­er it gets it right after a few tries — or the now more pop­u­lar imit­a­tion learn­ing, where a human demon­strates a task that the robot needs to imitate.

These com­pan­ies are likely col­lect­ing thou­sands or pos­sibly mil­lions of demon­stra­tions to train the robots, which is a time-con­sum­ing and expens­ive pro­cess. There’s not a huge amount of sci­entif­ic nov­elty there. It seems very unlikely that those robots will soon be able to per­form any task you want from just a lan­guage com­mand. And none of these com­pan­ies are claim­ing that their robots can do this now. They’re say­ing it will hap­pen in the future. I think it will be years, maybe dec­ades, before that happens.

Wouldn’t the robots be able to gather the data they need and compile it with the information they learn from LLMs? 

I think that’s what some people bet­ting on. Can we let robots col­lect that data them­selves —mean­ing we leave them in a room overnight with a task and objects — and see what they learned overnight? That’s the type of think­ing used in rein­force­ment learn­ing, and the com­munity has pre­vi­ously moved away from this approach after it real­ised that it was gen­er­at­ing some frus­trat­ing res­ults that weren’t going any­where. But we could see it swing back in the con­text of these vis­ion-lan­guage models.

There is still scope for sci­entif­ic dis­cov­ery in robot­ics. I think there’s still a lot of work to do. For instance, I work on try­ing to get robots to learn a task with­in a few minutes and with a non-expert teacher.

Do you think LLMs and vision-language models in robotics will just be a flash in the pan?

I don’t think so. It’s true that these new approaches have only had a minor impact in robot­ics com­pared to older meth­ods. How­ever, while clas­sic­al engin­eer­ing has reached some­what of a sat­ur­a­tion point, the vis­ion lan­guage mod­els will improve over time.

Cast­ing our minds to the future, for instance, we could see gen­er­at­ive AI mod­els pro­duce a video pre­dict­ing the con­sequence of its actions. If we can get to that point, then the robot can start to reas­on in video and action — there’s a lot of poten­tial there for robotics.

Interview by Marianne Guenot

Support accurate information rooted in the scientific method.

Donate