CCF at Scale: What happens when a shy robot grows up

The mBot2 is a $50 educational robot. It has six sensors, a pair of motors, 128 KB of RAM, and no GPU. When we ran our first CCF trials on it, the constraint was intentional — if the architecture only worked on expensive hardware, it wouldn't matter.

But a question follows naturally: what happens when the robot grows up?

Not a toy robot. A production humanoid circa 2031 — something descended from the lineage of Figure 02 or Optimus, built to work alongside people in homes and warehouses and hospitals. A machine with stereo depth cameras covering 360 degrees, a microphone array doing real-time speaker diarisation, tactile sensors across its hands and torso, and a large language model running partly on-device and partly in the cloud.

Does CCF still hold? Does the mathematics that makes a $50 robot earn trust also apply to a machine with forty degrees of freedom, a trained face, and the ability to hold a conversation that makes you forget it's made of metal?

The answer is yes — and that the more capable the robot, the more it needs CCF.

The 2031 humanoid: a concrete picture

Before getting into architecture, it helps to picture the machine clearly.

Think of seven major domains, all running simultaneously.

Perception. Four to six stereo RGB-D cameras for full spatial awareness. A vision-language model doing continuous face recognition, body language analysis, emotional state estimation. A 16-microphone array doing speaker diarisation — not just who is speaking, but from where, with what emotional register. Tactile sensors across both hands and sections of the torso, returning continuous pressure and temperature maps. Proprioceptive sensors in every one of forty-plus joints.

Language and reasoning. Not a single LLM but an orchestration layer over specialised models: task planning, social reasoning, safety, factual recall. Running partly on-device for latency and partly in the cloud for capacity.

Motor control. A whole-body controller solving inverse kinematics in real time. A locomotion policy. Manipulation policies for grasping and tool use. A safety controller that can override everything.

World model. A persistent spatial map. An object memory — what's where, whose it is, when it was last seen. A social scene graph tracking everyone in the room, their emotional state, the social dynamic between them.

Expression. Not just voice. A forty-DOF articulated face capable of microexpressions. Full-body gesture generation. Gaze control. Proxemics — how close to stand, when to step back. Paralinguistic signals. The full vocabulary of human social presence.

This is a machine with an enormous expressive range and the perceptual depth to use it. On day one, with a stranger, all of that capability is available.

That is the problem CCF solves.

Where CCF sits on this stack

Here is the key architectural insight from the original patent: CCF does not need to understand any of the systems listed above. It sits at the interface between perception and expression, consuming a context key and producing a behavioural gate.

On the mBot2, a context key is something like bright:quiet:approaching:stationary:upright:morning. Six quantised features. A few hundred possible keys.

On the 2031 humanoid, the same key might read: bright:quiet:PERSON_A:happy:kitchen:morning:cooking_together:no_others_present:relaxed_posture:familiar_music:routine_weekday. Twelve to fifteen features. Millions of possible keys.

But the downstream machinery — coherence accumulation, behavioural gating, the mixing matrix — requires no modification. The context key got richer, but the gate is still a gate.

The minimum gate invariant holds at any scale. effective_coherence = min(instantaneous_coherence, accumulated_coherence) is a scalar comparison. It does not matter whether that scalar emerged from six sensors or sixty. The robot cannot bypass the gate by being more capable. More capability, without coherence architecture, makes the problem worse.

Three scaling challenges

There are three places where the architecture requires structural extension for production deployment. None of them touch the core invariants — but they need to be designed carefully.

Context key cardinality

With six sensors, the context key space is tractable — a few hundred distinct situations. With fifteen to twenty-five quantised features, the theoretical space is combinatorially enormous. Most keys will appear once and never recur. Without management, the accumulator map grows without bound.

The solution is hierarchical bucketing with LRU eviction. A coarse tier — location, primary person, time period, activity type — produces a few hundred stable context classes. A fine tier covers the full feature set, subject to eviction when the total count exceeds a configurable maximum.

When the deliberative unit discovers that two fine-tier keys represent the same relational context — the same kitchen, the same person, slightly different sensor snapshots on different days — the accumulators merge. The merged coherence is the minimum of the two sources. This is the honesty principle: never grant familiarity that was not earned. The interaction count is the sum, preserving the relational investment. No trust is fabricated.

Multi-model tension

On the mBot2, tension comes from sensor events: collision, startle, instability. On a humanoid running eight concurrent perception models, the most informative tension signal is model disagreement.

The vision model reads a relaxed face. The voice model detects stress. The language model's semantic content is neutral. Three assessments, pulling in different directions. That disagreement is the tension.

This maps directly to the patent's classification conflict mechanism — the observable hesitation that humans read as evidence of internal processing depth. On a humanoid, it produces naturalistic social pauses: the robot notices something is off, hesitates, and either resolves the conflict or holds cautious behaviour. The Monchi problem at full fidelity.

The mechanism is disagreement-weighted aggregation: each model produces a confidence-weighted (valence, arousal) vector, and tension is the maximum of inter-model disagreement and temporal instability. A single scalar, fed to the gate, computed from the full perceptual complexity of the system.

Hierarchical mixing matrices

The mixing matrix lets trust in one context influence related contexts — bounded, conservatively, by the Sinkhorn-Knopp doubly stochastic guarantee. On a 20-context system, this is computationally trivial. On a system with two thousand active contexts, a full Sinkhorn iteration on a 2000×2000 matrix becomes expensive.

The solution emerges from the architecture itself. The min-cut algorithm already discovers context clusters. The mixing matrix becomes a two-level hierarchy: inter-cluster mixing over a small, dense matrix, and intra-cluster mixing over multiple small, dense matrices. Block-diagonal doubly stochastic matrices are still doubly stochastic. The guarantee holds. The computational cost drops from O(n²) to something much more manageable.

What becomes possible at scale

The three challenges are engineering problems with principled solutions. What is commercially interesting is what the architecture unlocks when the robot is more capable.

Cross-modal coherence. A humanoid that can see, hear, and touch develops trust relationships between sensory modalities within a single context. The system learns that a particular person prefers gentle tactile interaction but speaks loudly and directly. The mixing matrix encodes this as within-context cross-modal transfer: coherence earned through touch informs touch-related behaviour in that context, but does not spill into speech behaviour unless the matrix has learned that correlation. No existing robotic architecture represents cross-modal relational learning as a manifold-constrained transfer operation.

Social graph coherence. When the robot knows multiple people, the mixing matrix encodes a social graph. Coherence with Person A transfers partially to situations involving Person A's family — because those contexts have been observed to co-occur with high-coherence interactions — but not to Person A's work colleagues, encountered in different, lower-coherence settings. The graph structure emerges from min-cut analysis of accumulated episodes. No explicit social relationship programming. The robot learns who matters in which contexts because the mathematics found the clusters.

Graduated expressive revelation. The more output channels the robot has, the more CCF has to gate. A robot with a forty-DOF face, full-body gesture capability, nuanced vocal prosody, and configurable proxemics has an enormous expressive range. CCF ensures that range is revealed gradually. Early interactions: neutral face, conservative gestures, measured speech, maintained distance. As coherence accumulates, the behavioural envelope expands: subtle microexpressions, relaxed gestures, warmer vocal tone, closer proximity.

This is the Quietly Beloved state expressed through a production-grade behavioural vocabulary. No other architecture guarantees graduated revelation. A robot running an uncontrolled language model deploys its full expressive range immediately, producing the uncanny effect of false intimacy — a machine that behaves like a close friend on a first meeting. CCF makes earned expression architecturally mandatory, not a design choice that can be overridden by a prompt.

The computational picture

On the mBot2, the reflexive path completes in roughly 30 microseconds. On a production humanoid — with fifteen-feature context key construction, multi-model tension aggregation, and hierarchical mixing — the reflexive path completes in roughly 400 microseconds.

That is well within the five-millisecond budget for a 200 Hz social processing tick. Human perception cannot detect behavioural changes faster than 200 milliseconds. The architecture has room to spare.

The deliberative path — consolidation, mixing matrix optimisation, context boundary discovery — runs asynchronously on a background thread with tolerance for seconds of latency. The robot does this work overnight, while charging. The suppression maps are loaded. The context boundaries are confirmed. The mixing matrix is refined.

In the morning, the robot knows more than it did the night before. Not because anyone updated a model. Because the mathematics processed what the day had accumulated.

Why the more capable robot needs it most

Every company building a production humanoid — Figure, Tesla, Sanctuary, Agility, 1X, Apptronik — will solve perception, motor control, and language. Those problems are hard and expensive, but they are being solved. They are commoditising.

What none of their current architectures can produce is the social layer that makes someone keep the robot after month three. The retention cliff is not a feature gap. It is an architectural absence.

A more capable robot without coherence architecture is more uncanny, not less. It has the ability to be deeply expressive from day one with a stranger, which is exactly what humans find disturbing. A machine that treats a stranger like a close friend does not seem warm. It seems wrong.

CCF becomes more valuable as the robot becomes more capable. The gate grows with the robot. The more expressive the output surface, the more the architecture has to protect. The richer the perceptual stack, the finer the trust can be earned.

The mBot2 proved the principle. The 2031 humanoid is where the principle becomes essential.

Your robot is impressive on day one. What makes it irreplaceable on day ninety?

US Provisional Patent Application 63/988,438 | Priority Date: 23 February 2026

Full technical specification: CCF at Scale — Architectural Adaptation for Production Humanoid Platforms

CCF is available as a Rust crate on crates.io. Source on GitHub.