Google DeepMind has introduced Robotic Transformer 2 (RT-2), a vision-language-action (VLA) model that aims to revolutionize robotic control through the use of plain language instructions. The goal is to create robots that can navigate human environments with ease, much like the beloved robotic companions portrayed in science fiction.
Using a vast language model similar to ChatGPT, RT-2 is trained using online text and images, enabling it to generalize and recognize patterns that allow it to perform untrained tasks. Google demonstrated RT-2’s capabilities by showcasing its ability to identify and discard trash without any prior training. It also successfully pinpointed a dinosaur figurine when instructed to do so. These feats are significant as they eliminate the need for labor-intensive and manual data acquisition in robot training.
RT-2’s success can be attributed to Google DeepMind’s use of transformer AI models, which have been praised for their generalization abilities. The model is built upon Google’s previous AI innovations, such as the Pathways Language and Image model (PaLI-X) and the Pathways Language model Embodied (PaLM-E). RT-2 was also co-trained using data from its predecessor, RT-1, collected over a span of 17 months.
The framework of RT-2 involves refining a pre-trained VLM model with robotics and web data. This allows the model to process images from robot cameras and predict subsequent actions. Actions are represented as tokens, similar to word fragments, which facilitates the robot’s control. RT-2 also employs chain-of-thought reasoning, enabling it to make multi-stage decisions based on specific situations. Comparative tests revealed that RT-2 outperforms RT-1, achieving a 62% success rate compared to RT-1’s 32%.
However, RT-2 does have its limitations. While web data enhances generalization, it does not grant the robot new physical skills that it has not directly practiced. Google recognizes these constraints but remains optimistic, viewing RT-2 as a significant step toward the development of general-purpose robots.
1. The source article