Chatbot-Powered Robot by Google DeepMind Is a Part of a Greater Change

Thanks to a significant language model enhancement, Google DeepMind said today, a tall, slim wheeled robot has been busy serving as a tour guide and informal office assistant in a crowded open-plan office in Mountain View, California. The robot navigates its environment and interprets commands using the most recent iteration of Google’s Gemini big language model.

For example, when a human says to the robot, “Find me somewhere to write,” it obeys and shows the user where the spotless whiteboard is situated in the structure.

Gemini is a “Google helper” robot that can understand its surroundings and navigate appropriately when given commands that call for some common sense thinking. It can process text and video in addition to consuming large amounts of information from previously recorded video tours of the office. In order to generate precise actions, such turning in reaction to commands or what it perceives in front of it, the robot integrates Gemini with an algorithm.

Demis Hassabis, the CEO of Google DeepMind, stated to WIRED that Gemini’s multimodal skills would probably enable new robot abilities when it was first unveiled in December. He continued by saying that the researchers of the company were working diligently to test the model’s robotic capabilities.

Even when given challenging directions like “Where did I leave my coaster?” the robot’s navigational accuracy was up to 90% dependable, according to a new article documenting the project’s progress. The team notes that DeepMind’s technology “has greatly increased the robot usability and improved the naturalness of human-robot interaction.”

The presentation effectively demonstrates how huge language models could be able to do valuable tasks in the real world. Though, as Google and OpenAI have recently shown, chatbots like Gemini are increasingly capable of processing both visual and audio input, they still mostly function inside the boundaries of a web browser or app. Hassabis unveiled an improved version of Gemini in May that could interpret an office layout as seen by a smartphone camera.

Scientists from academia and business are working feverishly to figure out how to apply language models to improve robot capabilities. Nearly two dozen papers involving the usage of vision language models are listed in the May schedule of the International Conference on Robotics and Automation, a well-liked gathering for robotics specialists.

Capital is flowing into firms that want to use robotics to leverage AI advancements. After receiving an initial funding of $70 million, several of the researchers associated with the Google project left the company to found Physical Intelligence, a startup that aims to provide robots with general problem-solving capabilities by fusing large language models with real-world training. The founders of Carnegie Mellon University’s Skild AI had a similar objective. It revealed $300 million in funding this month.

A few years ago, for a robot to travel well, it would have required a map of its surroundings as well as carefully considered commands. Big language models provide valuable insights into the real world, and their more recent iterations, dubbed vision language models, are trained not only on text but also on photos and videos. These models are capable of responding to queries involving perception. Gemini enables Google’s robot to interpret oral and visual directions, deciphering a whiteboard sketch that indicates a path to a new location.

The researchers state in their study that they intend to test the technique on several robot types. They go on to say that more sophisticated queries, such “Do they have my favorite drink today?” from a user who has a lot of empty Coke cans on their desk, ought to be understandable to Gemini.

Komal Patil: