Is 2024 The Year of Advanced Robotics?

Are we on the brink of debunking Moravec’s Paradox? In the 1980s, AI and robotics researcher Hans Moravec highlighted a counterintuitive aspect of AI: tasks requiring high-level reasoning — like chess or Go — are easier for AI to master than basic sensory and motor skills — such as walking or identifying your mom’s face — which humans find instinctive. Adding complexity, these “simpler” skills actually demand much more computational power. This insight sheds light on the complexity of replicating human-like perception and dexterity, outcomes of millions of years of evolution, as opposed to logical reasoning, a more recent development. In today’s AI and ML landscape, this paradox underscores the challenges in creating robots and AI systems capable of seamlessly navigating and interacting with the physical world.

However, last week, Bernt Bornich, CEO and founder of 1x, a humanoid robotics company, wrote, “New progress update on the droids dropping in 4 weeks, looks like Moravec’s paradox might be debunked, and we just didn’t have the data.” I suspect that this has something to do with the advancements in foundation models. Originally known for their ability to perform a wide range of tasks based on a single type of data (like text for language models), these models become “multimodal” when integrating and interpreting information across different sensory inputs, closely mirroring human-like understanding.

Could the embodiment of AI, with all its sensory inputs, plus reasoning-imitation algorithms like LLMs, be the pool of data that disproves Moravec’s paradox?

Another intriguing development caught my attention. Huang Jensen, Nvidia’s CEO, responded to a question from Wired about what current development could change everything. Jensen replied, “There are a couple of things. One doesn’t really have a name, but it’s part of the work we’re doing in foundational robotics. If you can generate text and images, can you also generate motion? The answer is probably yes. And if you can generate motion, you can understand the intent and generate a generalized version of articulation. Therefore, humanoid robotics should be right around the corner.”

Something to observe in the coming weeks!

In related news from the robotics universe, Figure AI, a humanoid robotics startup, made headlines by raising approximately $675 million in funding. What’s more impressive is the list of backers: Amazon, NVIDIA, Microsoft, OpenAI, Intel, LG, and Samsung. This indicates a strong belief in the potential of humanoid robotics to disrupt various sectors.

Yet, there are skeptical voices. Rodney Brooks, who coined Nouvelle AI*, posted last week: “Tele-op robots presented as autonomous, like the Tesla Optimus humanoid folding a shirt, and 1X humanoid robots, are misrepresentations of what robots are actually doing, which can also be called LIES. Note that the Stanford robot cooking and cleaning videos are also tele-operated.”

If 2023 was the year of LLMs, are we ready to evolve to an embodied AI and make 2024 the year of robots?

🎁 Bonus: The freshest research papers from the week of Feb 19 — Feb 25

Enhancing Large Language Models (LLMs)

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens: Expands the processing capability of LLMs to handle over 2 million tokens, pushing the boundaries of context window sizes for more comprehensive understanding and generation tasks. Read the paper.
OmniPred: Language Models as Universal Regressors: Demonstrates the versatility of LLMs in performing numerical regression tasks, suggesting their potential as universal tools for predictive modeling across a variety of domains. Read the paper.
Divide-or-Conquer? Which Part Should You Distill Your LLM?: Investigates efficient strategies for distilling large models into smaller, more manageable ones, particularly for reasoning tasks, emphasizing the importance of decomposition over problem-solving. Read the paper.
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition: Proposes a method to enhance the efficiency of self-attention mechanisms in LLMs, crucial for improving performance and reducing resource consumption. Read the paper.
USER-LLM: Efficient LLM Contextualization with User Embeddings: Introduces a framework for personalizing LLM interactions using user embeddings, enhancing the model’s responsiveness to individual user preferences and histories.

Multimodal and Multi-Agent Systems

World Model on Million-Length Video and Language with RingAttention: Explores integrating video and language for advanced AI understanding and interaction, leveraging a novel RingAttention mechanism for efficient multimodal learning. Read the paper.
AgentScope: A Flexible yet Robust Multi-Agent Platform: Develops a multi-agent platform that enhances cooperation and flexibility among agents, addressing the complexity of multi-agent systems and their practical applications. Read the paper.
TinyLLaVA: A Framework of Small-scale Large Multimodal Models: Focuses on the design and analysis of small-scale multimodal models, proving that with strategic optimizations, smaller models can achieve or surpass the performance of larger counterparts. Read the paper.
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling: Introduces a versatile multimodal language model that processes speech, text, images, and music, demonstrating the power of discrete representations in unifying various data modalities within a single framework. This approach simplifies the integration of new modalities without needing to modify the underlying architecture or training methodologies. read the paper
A Touch, Vision, and Language Dataset for Multimodal Alignment: Presents a novel dataset that enhances multimodal understanding by incorporating touch with vision and language, aiming to advance touch-vision-language alignment and understanding through a tactile encoder and text generation model. read the paper

Advancements in Specific Domains

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information: Introduces a novel object detection model that leverages programmable gradient information for enhanced accuracy and efficiency in learning. Read the paper.
Beyond A: Better Planning with Transformers via Search Dynamics Bootstrapping*: Demonstrates the application of Transformers in complex planning tasks, offering a method that surpasses traditional search algorithms in efficiency and effectiveness. Read the paper.

Developer Tools and APIs

API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs: Presents a vast dataset designed for training LLMs to interact with APIs, addressing the challenge of creating effective models for API usage and integration. Read the paper.
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement: Develops an open-source system for code generation, execution, and refinement, facilitated by a dataset of multi-turn interactions, aiming to bridge the gap between code generation models and practical coding tasks. Read the paper.

Security and Adversarial Research

Coercing LLMs to do and reveal (almost) anything: Explores the susceptibility of LLMs to a wide range of adversarial attacks, highlighting the need for comprehensive security measures to protect against unintended behaviors and data extraction. Read the paper.

Model Efficiency and Quantization

OneBit: Towards Extremely Low-bit Large Language Models: Discusses a novel framework for quantizing LLM weight matrices to 1-bit to drastically reduce storage and computational demands while maintaining performance, enabling efficient deployment of LLMs on resource-constrained devices. read the paper

Instruction Tuning and Data Quality

Reformatted Alignment: Introduces REALIGN, a method for refining instruction data quality for LLMs to better align with human values, emphasizing the importance of instruction data quality in model alignment and suggesting areas for further exploration in LLM science. read the paper
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models: Proposes a novel method for instruction tuning that generates synthetic instruction data across all disciplines, showcasing a scalable and customizable approach to instruction tuning without relying on specific training data. read the paper
Instruction-tuned Language Models are Better Knowledge Learners: Proposes a pre-instruction-tuning method to enhance LLMs’ knowledge updating capabilities, demonstrating significant improvements in factual knowledge absorption and cross-domain generalization. read the paper

*Also published here.
*