For much of the past decade, progress in robotics and autonomous systems has been framed largely through vision. Cameras, LiDAR, and computer vision algorithms have dominated discussions around perception, mapping, and situational awareness.
Sound, by contrast, has often been treated as a secondary input – useful for wake words and basic commands, but rarely central to machine intelligence.
Yet hearing is arguably one of the most fundamental senses for understanding the world, particularly when it comes to language and interaction. Humans begin responding to sound before birth, learning rhythm, tone, and voice long before their eyes ever open.
Long after infancy, sound remains essential for navigating shared spaces, identifying who is speaking, and understanding intent in noisy, unpredictable environments – conditions that robots increasingly face in homes, factories, vehicles, and public spaces.
In this Q&A, Dani Cherkassky, CEO at Kardome, outlines why spatial hearing may be a missing piece in the evolution of so-called “physical AI”.
Kardome’s work focuses on enabling machines to process sound in a way that more closely resembles human hearing, rather than treating audio as a flat signal to be sent to the cloud for interpretation.
Spatial Hearing AI, in simple terms, allows a robot or device to understand where sounds are coming from – separating voices, background noise, and reflections within a three-dimensional space.
Cognition AI, meanwhile, operates at a higher level, interpreting meaning, intent, and conversational context from those clean audio inputs so machines can respond appropriately and in real time.
Together, these approaches aim to move voice interaction beyond rigid commands and wake words, toward something closer to natural conversation – even in busy, acoustically complex environments.
As robotics companies increasingly talk about “physical AI” and embodied intelligence, the ability to hear – reliably, spatially, and contextually – may prove just as important as the ability to see.
The conversation below explores how Kardome approaches this challenge, why edge-based audio intelligence matters, and where spatial hearing could fit into the next generation of autonomous machines.
Interview with Dani Cherkassky
Robotics & Automation News: Kardome is positioning Spatial Hearing AI as the next major capability for human-robot collaboration. What technical problem are you solving that today’s speech recognition, beamforming, or noise-cancellation systems cannot?

Dani Cherkassky: Every robot you use today is essentially deaf and brainless until you scream its wake word, and even then, it doesn’t understand what you said and what you want. It acts like a brilliant but unreliable consultant who shows up late to every meeting, hasn’t read any of your emails, and expects to solve your problems in 30 seconds.
This isn’t a small UX problem. It’s the reason voice interfaces have failed to take over, despite a decade of hype and billions in investment.
The cloud-centric LLM architecture that powers essentially the Voice AI of every robot today, like ChatGPT or Gemini, has three fatal flaws:
They’re too expensive to keep awake. Running GPT-4 continuously for every robot device would bankrupt the manufacturer in a week. So these systems sit dormant until triggered, missing all the context that matters.
They’re too slow for conversation. Humans pause 200 milliseconds between turns. Cloud LLMs take 1-3 seconds. That’s not conversation; that’s walkie-talkie chat.
They’re solving the wrong problem. You don’t need GPT-4 to turn off the lights or set a timer. But because everything goes to the cloud, even trivial requests get the full heavyweight treatment.
The breakthrough comes from mimicking how human cognition actually works. Our brain doesn’t route every decision through its most computationally expensive region.
You don’t engage your full cognitive powers to catch a ball, recognize your friend’s voice, or understand when someone’s talking to you versus talking past you.
Psychologists call this “System 1 thinking” – fast, intuitive, always-on. Only for complex problems do you engage “System 2” – slow, deliberate, energy-intensive reasoning. Voice AI needs the same dual architecture.
Building an edge-based System 1 requires solving problems that cloud-first architectures never had to address.
The Cocktail Party Problem: In any real environment, there are multiple sound sources – people talking, music playing, traffic noise, and the dishwasher running.
Cloud LLMs receive this as a single mixed audio stream and do their best with garbage data. Edge AI has to solve this before any language processing happens.
Spatial Hearing AI is a multidimensional soundscape analysis that gives robots a sense of spatial awareness, similar to human hearing. Instead of treating all incoming sound as a single mixed signal, the system extracts spatial cues, analyzes reflection patterns, and isolates individual voices.
This creates a foundation for natural human-robot collaboration, where speech becomes a reliable input modality even in acoustically complex environments such as a noisy kitchen with three people talking, music playing, and cooking sounds.
While Spatial Hearing AI localizes and separates the speakers, Cognition AI determines what each person actually means. It identifies who is speaking, interprets intent from natural phrasing, and maintains short-term conversational context so the robot can follow multi-step instructions without rigid command structures.
The Spatial Hearing AI is trained for Robots’ specific context, and handles simple functions using the on-device SLM. Only when the edge system determines “this needs deeper reasoning” does it activate the cloud LLM – with full context already established.
By combining Spatial Hearing AI’s physical awareness with Cognition AI’s semantic understanding, robots gain a human-like ability to engage in real-time, contextually grounded dialogue, even in noisy multi-user environments.
R&AN: Robotics and autonomous systems today largely rely on vision and LiDAR for perception and mapping. How does ultrasound-based spatial hearing complement – or potentially replace – optical sensors in real-world environments?
DC: Currently, our models are focused solely on processing audio within the human hearing range. Moving forward, however, combining audio and video processing will likely be essential to achieve true human-level situational and context awareness.
R&AN: Can you explain, in practical terms, how your system localizes and separates voices in noisy industrial settings such as factories or warehouses? What are the key latency, range, and accuracy benchmarks?
DC: Kardome Spatial Hearing AI is a multi-dimensional soundscape analysis that runs continuously on-device. It maps the 3D environment and separates the sources in the environment with the following procedure:
- Extracts spatial cues: Where is each sound source relative to the device? Not just left/right, but distance and environment geometry. Analyzes reflection patterns: Every sound source creates a unique pattern of reflections off walls, furniture, and surfaces. These patterns act like acoustic fingerprints.
- Separates sources: By understanding these spatial signatures, the system can isolate individual voices – even in a noisy kitchen with three people talking, music playing, and cooking sounds.
R&AN: Many AI robotics platforms are moving compute to the edge for real-time safety. What is Kardome’s processing model – edge, on-device, hybrid – and why? What are the energy and hardware requirements?
DC: The Kardome Voice AI model is built from two submodels: Spatial Hearing AI and Cognition AI. Both operate entirely on the Edge, with the Cognition AI having the capability to activate a third-party Cloud Large Language Model (for example, ChatGPT) for more complex, deeper reasoning.
R&AN: Beyond robotics, what commercial deployments or pilots can you share? Are there adoption metrics – customers, deployments, partnerships, or revenue growth indicators – that illustrate traction?
DC: Kardome has successfully transitioned from R&D to commercial scale by securing deep integrations with OEMs and Tier-1 manufacturers, most notably establishing a stronghold in the Asian electronics and automotive markets.
Their proprietary “Spatial Hearing” technology is now deployed in over 11 million devices globally, a figure driven by strategic partnerships with industry giants like LG Electronics (integrating into Smart TVs and appliances), KT Corporation (powering voice control for the Genie TV set-top box) and SK Intellix: Deployed in the “NAMUHX” air-purifier robot, showcasing the tech’s ability to handle motor noise and movement while maintaining voice accuracy.
In the automotive sector, Kardome has achieved critical validation through its integration into Panasonic Automotive’s SkipGen2 infotainment system and Nvidia’s Drive AGX platform, while also collaborating with the Renault Group on the H1st Vision concept car.
This commercial traction is backed by significant financial and operational growth, including a $10 million Series A round led by Korea Investment Partners with strategic backing from the Hyundai Motor Group, and the establishment of a regional headquarters in Seoul to directly support its primary supply chain partners.
R&AN: We see Nvidia, Tesla, Figure AI and others discussing “physical AI” as the next computing platform. Where does Spatial Hearing AI fit within that emerging ecosystem, and how do you expect voice-driven interaction with machines to evolve over the next 3-5 years?
DC: Kardome’s technology is highly relevant to Physical AI because it addresses a fundamental challenge: providing machines with a crucial “sense” – hearing – that is accurate, spatially aware, and contextual, even in complex, real-world environments.
Spatial Hearing AI: The ‘Ears’ for Physical AI
Physical AI relies on the perception of the environment. Kardome’s Spatial Hearing AI gives devices the auditory equivalent of human sight, making it a critical sensor for systems that need to operate in the physical world.
Cognition AI: Contextual Understanding for Action
Physical AI systems must not just hear, but understand what they hear within the context of the physical world they are in. The Cognition AI takes the clean audio input and couples it with the spatial information to understand the intent. This allows for fluid, natural interaction, which is a hallmark of true Physical AI.
Kardome bridges the gap between raw audio signals and intelligent physical action. By providing human-level spatial hearing and contextual understanding on the Edge, Kardome enables Physical AI devices to transition from being simple voice-controlled gadgets to truly aware and responsive agents in the physical world.
