New memory system helps robots interact and work side-by-side with humans

MIT researchers built a robot memory system that links objects, places, and time so machines can answer real-world questions later.

Joshua Shavit
Shy Cohen
Written By: Shy Cohen/
Edited By: Joshua Shavit
Add as a preferred source in Google
MIT’s DAAAM system helps robots remember objects, places, and timing in large environments using language-rich 3D maps.

MIT’s DAAAM system helps robots remember objects, places, and timing in large environments using language-rich 3D maps. (CREDIT: Wikimedia / CC BY-SA 4.0)

A robot on a factory floor can carry parts, scan shelves, and move around people with growing skill. What it still struggles to do is something a human worker handles almost without thinking: remember where an unfinished item was left yesterday, and retrieve it when asked.

That gap is what MIT researchers are trying to close with a new memory system for robots called DAAAM, short for Describe Anything, Anywhere, At Any Moment. The framework is built to help machines remember not just what is in an environment. Moreover, it helps them remember where it is, when it appeared, and how to retrieve that information later using ordinary language.

The idea is simple to state and hard to pull off. A robot working in a large building, on a campus, or across a factory needs a memory that can connect place, object, and time. For instance, it should be able to answer questions such as where it last saw a red screwdriver, how long it stayed inside a space, or which bike outside a building had a flat tire.

MIT researchers have developed a long-term memory framework for robots that combines advanced map representations with rich descriptions of the environment. Here, a moving robot attaches detailed descriptions to the bicycles it sees at it explores. (CREDIT: MIT Researchers)

Working side by side with robots

“If we want robots to work side-by-side with humans and interact better with humans, they must speak the same language,” said Luca Carlone, an associate professor in MIT’s Department of Aeronautics and Astronautics, principal investigator in the Laboratory for Information and Decision Systems, and director of the MIT SPARK Laboratory.

“The robot must be able to reason about time and space the same way humans do. That is essentially what our method is doing. It is turning a traditional map into a language-based map that is easier for the robot to think about and access using language,” he continued.

Carlone worked on the project with lead author Nicolas Gorlo, an MIT graduate student, and Lukas Schmid, a former MIT research scientist who is now a professor at the University of Technology Nuremberg in Germany. The work was recently presented at the Conference on Computer Vision and Pattern Recognition.

A map that remembers more than walls

The problem sits at the crossroads of computer vision and robotic mapping. Vision systems can often describe a scene in rich detail, but they usually process one image or one object at a time. Meanwhile, robotic mapping systems can build 3D maps of large spaces. However, they often lack detailed language descriptions or require too much computation to run quickly.

DAAAM tries to bridge that divide.

Using DAAAM, a robot can quickly access their memory to answer complex queries about its environment in plain language. Here, to answer a query, the robot searches its memory using the word "sculpture" to recall artworks it saw on campus. (CREDIT: MIT Researchers)

As a robot moves through an environment, the system attaches natural-language descriptions to what it sees. A building might be identified as the Stata Center, along with a note about its architecture. A bike rack might be described as holding five bicycles, with one red bike showing a flat tire. Those descriptions are then linked to a 3D map so the machine can connect the object to a specific place.

That matters because memory is not just about storing visual snapshots. It is about organizing them in a way that lets a machine later answer real questions. For example, a robot using DAAAM could potentially remember that the damaged red bicycle was outside a particular building. This is more useful than merely recalling that it once appeared in a certain camera frame.

Speed became the real obstacle

Rich description comes at a cost. Existing systems that produce detailed annotations can take several seconds to label only a handful of objects. For a robot moving through a cluttered real-world space, that is far too slow.

“The faster the robot can form this spatial memory, the more efficient it will be performing actions in the environment,” Carlone said.

To reduce that bottleneck, the MIT team designed DAAAM to group nearby objects and choose only the most useful camera views for description. The system selects key frames that offer the clearest view of several objects at once. Then, it annotates them in batches rather than one by one.

That step speeds the process by about an order of magnitude, according to the research. Instead of repeatedly describing the same object from many angles, the robot labels each object once and stores the result inside its map.

We present Describe Anything, Anywhere, at Any Moment (DAAAM), a real-time, large-scale, spatio-temporal memory for embodied question answering and 4D reasoning. (CREDIT: arXiv)

“We annotate every object only once, so our framework can run in very large-scale environments in real time,” Gorlo said. “And by clustering objects into regions, it can answer a wide range of queries about objects and locations in the environment.”

The result is a memory system that remains geographically grounded. Objects are not stored as isolated text entries or loose image captions. Instead, they are tied to a structured four-dimensional scene graph, essentially a map that includes both 3D location and changes over time.

Better answers over longer stretches

Once that memory is built, the next challenge is retrieval. A robot may need to search through a huge number of objects, descriptions, and time stamps to answer a single question. Therefore, DAAAM uses a language model with specialized retrieval tools to pull out the relevant details while reducing the risk of hallucinations.

If someone asks about a sculpture near a campus building, for example, the system can search by the word “sculpture,” by the building’s location, or by both.

In tests on spatiotemporal question answering, DAAAM outperformed competing methods. On the original NaVQA benchmark, the system’s descriptive question accuracy reached 0.672, ahead of other listed approaches. On the team’s revised object-centric version of that benchmark, DAAAM reached 0.711 question accuracy. This compares with 0.463 for one ReMEmbR variant and 0.299 for ConceptGraphs.

An overview of the proposed approach. Given an RGB-D video stream, we first segment the scene into fragments and track them over time in image space using a lightweight tracker. (CREDIT: arXiv)

The system also showed stronger performance on long sequences and temporal reasoning. In the object-centric benchmark, it reported a positional error of 41.75 meters and a temporal error of 1.792 minutes. The researchers also tested it on sequential task grounding. In this test, a robot must connect language instructions to actions in 3D space. There, DAAAM posted a task accuracy of 11.22 percent, ahead of the methods listed for comparison.

The team says the framework can run at the sensor rate of 10 hertz on the CODa dataset while handling large-scale environments. In addition, it also scaled to sequences longer than 35 minutes and distances over 1.5 kilometers.

Useful, but not finished

The system still has limits. The model used to generate detailed descriptions can miss unusual features or hallucinate toward more common ones. The paper gives one example: elevator doors incorrectly described as having handles.

The annotation speed may also be too slow for faster-moving machines such as aerial robots or some virtual reality systems. On average, a single worker thread could annotate about 5.2 new fragments per second on a desktop GPU. The authors say that is enough for a mobile ground robot, but not necessarily for every platform.

There are also longer-term memory questions. DAAAM keeps a history of descriptions for dynamic objects, and the researchers note that this record may not scale indefinitely without better summarization.

Even with those constraints, the work points toward a more practical kind of robotic memory, one that is not just visual, but situated.

Mock-example for the frame selection heuristic. (CREDIT: arXiv)

“We want to design a new type of memory, a spatiotemporal memory, that enables an AI-powered robot to remember real interactions and sensor observations,” Carlone said. “Like ChatGPT, but grounded in the real world and capable of answering any question about the environment, like ‘Where did I leave my wallet?’”

Practical implications of the research

This work could make robots more useful in places where people expect context, not just movement. In factories, that could mean sending a robot to retrieve a partly assembled component left in a specific bin the night before.

When it comes to maintenance, an augmented reality system could flag changes or anomalies based on what it observed earlier. In navigation, the same kind of memory could help commuters or workers get directions tied to landmarks, events, and timing.

The broader value is not just that a robot can see a scene, but that it can remember it in a way people naturally ask about it.

Research findings are available online in the journal arXiv.

The original story "New memory system helps robots interact and work side-by-side with humans" is published in The Brighter Side of News.



Like these kind of feel good stories? Get The Brighter Side of News' newsletter.


Shy Cohen
Shy CohenScience and Technology Writer

Shy Cohen
Writer

Shy Cohen is a Washington-based science and technology writer covering advances in artificial intelligence, machine learning, and computer science. Having published articles on MSN, AOL News, and Yahoo News, Shy reports news and writes clear, plain-language explainers that examine how emerging technologies shape society. Drawing on decades of experience, including long tenures at Microsoft and work as an independent consultant, he brings an engineering-informed perspective to his reporting. His work focuses on translating complex research and fast-moving developments into accurate, engaging stories, with a methodical, reader-first approach to research, interviews, and verification.