Universal Navigation for Every Moving Robot

Share article
Why aren’t we yet seeing a ‘Waymo’ for every robot in every domain? Humans can navigate via land, sea, and air in unseen environments, yet the navigation stacks today are tailored to each platform and heavily reliant on domain-specific assumptions that don't generalize.
Mobile robots are marketed as fully autonomous, but achieving reliable autonomy in dynamic real-world environments remains an industry-wide challenge. The gap between promise and reality continues to grow, showing up in every part of the autonomy delivery cycle: slow development, deployments, and rollouts. Navigation stacks today perform well in constrained settings with carefully calibrated sensors and well-defined operating conditions. Indoor AMRs and autonomous vehicles impress in demos and controlled pilots, but scaling that consistently requires significant operational support, extended rollout cycles, and constrained workflows. When conditions deviate even slightly from anticipated scenarios, autonomy degrades sharply. This highlights the need for more robust and quick to deploy solutions.
This issue is most evident in unstructured and RF-degraded environments. In Ukraine, drones have to be flown manually through fiber-optic tethers because signal interference has rendered GPS — a critical dependency for most mobile robots — unreliable. Systems branded as “autonomous” are exposed for what they are: partially human-operated machines prone to failure when autonomy is most needed.
This is not a deployment or maturity issue, but a core dependency failure. Robotics remains anchored to decades-old classical computer vision techniques with rigid assumptions, expensive sensors, and custom integrations. Although they tend to be predictable when they fail, these approaches were never designed to adapt to the long-tailed scenarios of the real world.
In this post, we’ll dive into some of the approaches used in autonomous navigation today, and what Tera is building towards: software-only, domain agnostic navigation. By leveraging spatial foundation models compatible with any camera, Tera enables every form of mobile robots to navigate, even in scenarios where existing solutions fail. The result is autonomy that scales without brittle dependencies, excessive hardware, or platform-specific customization.
Limitations of Traditional Approaches
Humans rely mostly on vision to navigate, but most mobile robots only partially use vision due to fragility and lack of spatial intelligence in software. This translates into restrictive operating conditions and needing additional sensors and external signals to work reliably. Another problem is efficiency. In memory and compute-restricted mobile robots today, the perception layer onboard systems mostly uses simple 2D/3D geometry and preloaded maps, with limited ability to form and change semantically meaningful maps in real-time, which is central to how humans navigate effortlessly in new places. By closely coupling outputs from sophisticated sensors with basic but clever perception algorithms that track features (e.g. roads, trees), traditional visual navigation is tailored to each use case, and relies on human assumptions about the data that don’t transfer across domains. Below, we cover a few widely-used approaches and limitations.
Relative Positioning
Visual odometry (e.g. optical flow): Visual odometry (VO) is a technique used to estimate movement from visual inputs. It estimates motion from the camera by measuring pixel displacement between frames. This makes it inherently fragile in everyday scenarios where texture-poor scenes, changing illumination, motion blur, vibration, camera artifacts, and dynamic objects can impact accuracy. This is problematic for vehicles that do not have other sensors to reconcile scale, because monocular VO does not carry inherent knowledge of the absolute geometry and scale, causing scale drift – which leads to increasingly large errors. This is even more challenging for agile aerial robots like quadcopters flying at low-altitude in which large inter-frame motion and instability in visual statistics can drive sudden increase in prediction errors. In practice, monocular VO accumulates error from scale ambiguity so quickly that it’s never used by itself, but in conjunction with other sensors or maps.
Hardware-heavy sensors are costly and take effort to deploy: IMUs, stereo cameras, and LiDAR provide the ability to reconcile scale better, but impose hard constraints on costs and how robots operate. Navigation-grade IMUs are costly. Stereo vision depends on sufficient distance between viewpoints, distance to objects, and texture to resolve depth. In the real world, depth accuracy degrades rapidly in outdoor and aerial robotics in low-texture scenes, at long distances, or under adverse lighting. LiDAR extends range, but adds weight, power draw, and cost, introducing platform-level tradeoffs that can reduce endurance, payload capacity, and scalability. Defense applications are especially challenged to adopt active sensors, such as LiDAR, because it can be detected.
Both sensing modalities degrade in dust, rain, fog, and smoke, while their effective range often falls well short of what high-speed drones or ground robots require to plan safely at operational velocities. The result is a paradox: systems built around stereo and LiDAR become more expensive, heavier, and more complex, yet remain bound by environmental and operational constraints. These constraints make sensor-centric navigation brittle and economically unscalable, especially for outdoor platforms, at scale, and in contested environments.
Absolute Positioning with Classical Map Matching
Most forms of navigation with real economic value require persistent-memory over medium to long-range, not just moving around an office or a home. This usually requires absolute positioning relative to a global frame of reference (like latitude and longitude). Most will point to preloaded maps as a prerequisite without reliable location-beaming RF signals (like RTK/GPS). Preloaded maps coupled with classical map matching are useful in controlled or slow changing environments. In the real world, traditional map-matching approaches often struggle because they assume a largely static environment and rely on sparse, often human-designed visual features (e.g., roads). These approaches attempt to match these features by aligning those extracted from live camera images from the robot to that of preloaded maps. On its own, this is very brittle. Deviations such as seasonal change, damage, construction, camouflage, or simply outdated maps can degrade performance. This is why simultaneous mapping and localization (SLAM) is rarely successfully implemented today without platform-specific setups. While deep learning has improved feature matching and robustness in modern SLAM, it still relies on pre-defined feature sets and remains inherently brittle when moving zero-shot between dramatically different visual domains (e.g., from urban to sub-surface). In the real world, warehouses change their layouts. Restaurants reconfigure tables. Data centers change during build out. These seem like easy problems for humans, but have been extremely difficult to replicate in software.
Additionally, the computational burden of dense feature extraction, correspondence search, and global optimization further constrains responsiveness, forcing tradeoffs between accuracy and latency that are unacceptable for fast, low-altitude vehicles. Map matching treats perception as a lookup problem rather than a learning problem, and it can only recognize what is available in preloaded maps. In dynamic or RF-degraded environments, this rigidity turns prior knowledge into a liability, producing systems that work in demos but fail in scaled deployments.
Finally, unlike humans who can intelligently find themselves on a 2D map by projecting a completely different first person view onto it, machines fail even when the point of view and modality of data (RGB to RGB) is almost identical when the camera’s FOV is narrow.
Path to Human-Level Navigation
Our Philosophy to Scaling Human-Level Spatial Reasoning
Humans are adept at navigating new places on any vehicle, whether we’re walking, driving, flying, or cruising. We can read a 2D map when we’re lost and find our way. We remember places we’ve been to decades ago, forming a cognitive map with rich semantically meaningful long-term memory. Humans also improve with data (for example, pilots get better with flight hours). Replicating this in software requires AI with the generality of LLMs, attuned to spatial reasoning.
Tera approaches navigation with a similar philosophy: by focusing on methods that can scale with data to achieve precision and generalization across all domains. Our goal is to give low-cost mobile robots with existing cameras the ability to reason spatially by creating an environment representation that is not tailored to any specific scenario. The perception stack must be free of human-engineered, domain-specific assumptions. A system with these attributes must reason about long-horizon spatial contexts and build a rich cognitive map from a single monocular camera, operating in real-time.
In scenarios without maps, the system should be able to leverage a cheap optical camera to produce a metric-accurate spatial representation that allows subsequent successful navigation. Mirroring humans, we believe over-reliance on external signals like GPS severely degrades economic value, since most activities that have value do not have reliable access to GPS (indoor robotics, mining / construction, RF-denied, space).
Even in scenarios where a map already exists, software today can benefit from improved generality. We also see a path to solving this with foundation models. A transformer-based approach to map matching offers a more robust solution in sparse environments, seasonal variation, and extreme view-point differences. Dense matching more closely resembles how humans navigate. It relies less on sharp contrast boundaries (e.g. roads, trees) or geometric shapes (e.g. a crater with a unique shape). A concrete example would be using pre-loaded top down maps to render what a low elevation view of the same environment should look like. This removes the need for the robot to duplicate the point-of-view from satellite imagery, and creates a solution that can be applied across different platforms even when map data is sparse (e.g. indoor floor plan but not 3D point clouds).
Importantly, Tera is not just a research lab building models and then searching for problems to solve. We’re not building 1% solutions across every domain. Instead, we’re hyper focused on current capability gaps of the best existing navigation stacks at the cost point viable for those platforms and use cases. Given the early state of approaches in this space, we believe this leads to the quickest path to adoption.
Transitioning from a hardware-centric architecture to a software-defined autonomy stack enables advanced distributed multi-agent coordination. Software updates can deploy decentralized control, local decision-making, and autonomous task allocation without reliance on persistent communications, which are unreliable in contested environments.
This goes even beyond hive mind towards distributed autonomy: global objectives emerge from local agent interactions. The layered architecture improves fault tolerance, supports graceful degradation under node loss, and increases mission assurance through system-level redundancy.
Challenges with Applying Existing Foundation Models to Long-Range Navigation
It seems like everyday we see a new demo with a VLM, folding laundry or putting away dishes. Why have these general-purpose models not been successfully adapted for navigating real robots across long-range? Many researchers including ourselves who have thought about this problem for decades advocate for a ‘world model’ that has a more persistent representation of spatial concepts. We see 3 challenges:
- The inductive biases of existing architectures makes it hard to scale to spatial problems reliably
- The datasets that exist are not structured for learning such capabilities in models
- Memory & latency kills the ability to run on-edge for real use cases that demand <$1k compute.
The right inductive biases: our team was involved in the development of the original diffusion model which powers all visual generative AI today. This system has recently been extended to 3D generation, enabling the creation of 3D assets and scenes from text, images, or multi-view inputs. Most current approaches build on 2D diffusion backbones and couple them with differentiable 3D representations, such as radiance fields or Gaussian splatting, to produce geometry and appearance that can be rendered from novel viewpoints. While these systems are effective for asset creation and view-consistent scene synthesis, their connection to real-world autonomy remains limited. Today, 3D methods are specialized toward static consistency rather than dynamic scene understanding and persistent spatial memory with explicit multi-agent shared reconstruction. Autonomy applications (e.g., robotics or embodied agents) require temporally consistent, long-horizon representations that support dynamic state estimation, memory, and interaction under uncertainty. As a result, current 3D diffusion-based methods are well suited for content generation, but they do not yet provide the robust, temporally persistent spatial representations typically required for real-world autonomous systems. Transformers have inductive biases that leverage sequential structure of data like language to train on autoregressively generated outputs. We have empirically determined 3D spatial understanding from 2D imagery cannot be directly formulated under the same next token prediction paradigm.
We have spent the last 2 years at the frontiers of AI research developing new ways to train and architect a unified system that can robustly use and produce spatially consistent representations of the physical world, even over long distances. This involves a novel system architecture that jointly solves the local metric estimation problem and long-range topological mapping, using a sparse, memory-efficient spatial graph to maintain global consistency. This enables long-range vision-only navigation across a variety of existing domains and platforms (land and air).
Data: since SLAM is not a problem that has historically been successfully posed as an end-to-end learning problem, datasets do not exist in a structure that makes it possible to train these systems at scale by writing just a few preprocessors. Therefore, models that exist today are mostly research-grade, meaning they are trained on narrow datasets, with limited adaptability to the real world applications. For example, a vision-only system needs to be able to resolve scale in any environment. This is a hard problem with vision alone, and a main reason software that works on one robot usually does not translate to another. And unlike sequence-based problems like text – performance does not necessarily scale with more data, since there are other problems that still stand in the way of scaling.
Memory & latency: even if large models can predict accurately offline, fast-moving autonomy requires on-edge, real-time inference to reliably navigate. A year ago, we started an engineering effort to run billion-parameter models on <$1k compute at the frequency required for real-world navigation. This required substantial effort in unifying architectures to reduce latency. Over the last year, we have partnered with leading cross-platform OEMs to test our models on platforms with limited onboard compute, conducting extensive cross-domain, on-device evaluations. We’re excited to share that we have achieved an internal breakthrough, unifying our architecture across multiple systems and successfully consolidating them into a single model that meets the latency requirements of fast-moving robots. With this unified architecture, we unlock real-time spatial inference, enabling full autonomous navigation across platforms and domains.
What’s next
Over the last year, we’ve been field testing our software across a range of COTS products today across land and air that have a path to volume. A key constraint is to use the existing onboard compute and camera on day 1, without requiring our partners to adapt their hardware setups. A generalized visual navigation solution powered by intelligent software rather than hardware dramatically simplifies integration. By prioritizing advanced software over specialized hardware, the system adapts to dynamic environments and can be deployed across more customers, eliminating lengthy third-party integration cycles. The result is faster deployments, growing customer value, and reduced BOM costs that increases competitiveness.
In the coming weeks, we’ll share a series of results demonstrating state-of-the-art performance on ultra-sparse terrains across multiple platforms and cameras, targeting operational gaps of existing solutions that rely on traditional software and sensors. These were conducted zero-shot on existing robotic platforms.
Platforms should be future-proofed by architecting in a way to accept software updates that enable greater intelligence. This means including more powerful compute in the platform to offset fewer and lower-cost sensors. With the recent launch of Nvidia Thor, robot vendors now have access to a significant advancement in edge GPU computing.
If you’re building mobile robots in RF-degraded environments where off-the-shelf vision-based approaches struggle and BOM cost matters, let’s explore the future of your navigation stack together. Our solution provides a future-proof, unified software stack that ensures mission assurance in contested or unstructured environments, drastically accelerating your time-to-deployment and expanding your operational envelope.