
This research addresses the need for robust proprioceptive methods that capture the continuous deformations of soft robots without relying on multiple sensors that hinder compliance. We propose a vision-based deformation learning strategy called latent proprioception, which anchors the robot's overall deformation state to a single internal reference frame tracked by a miniature onboard camera. Through a multi-modal neural network trained on simulated and real data, we unify motion, force, and shape measurements into a shared representation in latent codes, inferring unseen states from readily measured signals. Our experimental results show that this approach accurately reconstructs full-body deformations and forces from minimal sensing data, enabling soft robots to adapt to complex object manipulation or safe human interaction tasks. The proposed framework exemplifies how a vision-based deformable learning approach can inform and enhance robotics by reducing sensor complexity and preserving mechanical flexibility. We anticipate that such hybrid system codesign will advance robotic capabilities, deepen our understanding of natural movement, and potentially translate back into healthcare and wearable technologies for living beings. This work paves the way for soft robots endowed with greater autonomy and resilience.
This study introduces an anchor-based approach that leverages a single internal reference frame to infer the full proprioceptive state of a soft robot, encompassing motion, force, and shape. We developed Proprioceptive Soft Robots (ProSoRo), an integrated system that combines soft materials with embedded sensing capabilities. Each ProSoRo features a metastructure mounted between a top and bottom frame, with a marker affixed to the top frame serving as the anchor frame. A miniature monocular camera embedded within the bottom frame tracks the marker's movement in real time. More building details can be found in hardware guide.
To harness the full potential of this anchor-based approach, we developed a multi-modal proprioception learning framework utilizing a multi-modal variational autoencoder (MVAE) to align motion, force, and shape of ProSoRos into a unified representation based on an anchored observation, involving three stages:
Stage 1: Material identification: Recognizing the impracticality of collecting extensive physical datasets for soft robots, we leveraged finite element analysis (FEA) simulations to generate high-quality training data. We begin by measuring the material's stress-strain curve through the standard uniaxial tension test to obtain the best-fitted material model. Then, we apply an evolution strategy (EVOMIA) to optimize the material parameters by comparing the calculated force from finite element analysis (FEA) and the measured ground truth from a physical experiment under the same motion of the anchor point.
Stage 2: Latent proprioceptive learning: The simulation dataset was generated using the optimized material parameters and provided motion in $[D_x, D_y, D_z, R_x, R_y, R_z]^\mathrm{T}$, force in $[F_x, F_y, F_z, T_x, T_y, T_z]^\mathrm{T}$, and shape in node displacements of $[n_x, n_y, n_z]_{3n}^\mathrm{T}$ as the training inputs. To learn these modalities for explicit proprioception, we developed a multi-modal variational autoencoder (MVAE) to encode the ProSoRo's proprioception via latent codes. Three modal latent codes are generated through three specific motion, force, and shape encoders, and the shared code contains fused information from all three modalities by minimizing the errors among the three codes. As a result, the shared codes provide explicit proprioception in the latent space, denoted as latent proprioception, which can be used to reconstruct the three modalities using specific decoders for applied interactions.
Stage 3: Cross-modal inference: In real-world deployments, the shape modality, for example, can be estimated from latent proprioception instead of direct measurement, which is usually impossible to achieve in real-time interactions in robotics. At this stage, we visually capture the ProSoRo's anchor point as MVAE's input to estimate the force and shape modalities based on the latent knowledge learned from simulation data. We found that our proposed latent proprioception framework to be a versatile solution in soft robotic interactions.
Following the above stages, we realized proprioception on six different shapes of ProSoRos, and the prototypes are shown below.
To delve deeper into the latent code's structure, we analyzed the correlations among the latent codes $\mathbf{z}_{m}$, $\mathbf{z}_{f}$, and $\mathbf{z}_{s}$ generated from motion, force, and shape inputs. Six components of the latent code exhibited high correlation coefficients across the modalities, identifying them as key morphing primitives. These primitives are pivotal in encoding the robot's deformation behaviors and are instrumental in cross-modal inference. We visualized the relationships between explicit proprioceptive modalities and the latent codes from $z_{1}$ to $z_{32}$ using a chord diagram. Results highlight the dominance of $z_{8}$, $z_{19}$, $z_{20}$, $z_{22}$, $z_{31}$, and $z_{32}$ in learning latent proprioception with the most significant widths, which are considered as the key morphing primitives.
By systematically varying these six latent components, we generated a series of deformation modes for ProSoRos. Each key morphing primitive influences the shape in a distinct manifold, providing intuitive control handles for manipulating complex deformations.
To evaluate the MVAE's real-world applicability, we conducted experiments using physical ProSoRo prototypes.
We compared the shapes estimated by MVAE with the ground truth measured by the 3D scanner under 30 different poses for each type of ProSoRo. And we also compared the forces with those measured by the force/torque sensor under manual movements during a period of 30 seconds. The results verify that the latent proprioception of MVAE trained by simulation data is transferable to estimate reliable states of ProSoRo in reality, confirming the model's effectiveness in cross-modal inference.
The capacity to replicate complex motions across different platforms is a powerful tool for advancing soft robotic applications, such as teleoperation and coordinated multi-robot systems. We demonstrated this capability by digitalizing the motion of a manually operated ProSoRo and synthesizing it on an active ProSoRo mounted on a tendon-driven platform.
We evaluated the motion replication fidelity using the MoCap system by tracking trajectories for various motion patterns, including circular, square, four-leaved rose, and spiral paths. The result shows accurate and robust performance for digitalizing and synthesizing omnidirectional motion of ProSoRo advanced by MVAE with latent proprioception knowledge.
Understanding contact interactions is crucial for soft robots engaged in manipulation tasks, especially in unstructured or dynamic environments. We explored the MVAE's capability to infer contact states during a pivoting manipulation of a wine glass using a soft-rigid hybrid arm. The hybrid arm incorporated omni-neck ProSoRos as joints and dome-shaped ProSoRos as gripper fingers.
Initially, the arm was naturally downward, and the gripper was open. Then, the gripper closed to catch the lying wine glass, and the two dome ProSoRos were compressed with a specific contact extent, allowing the wine glass to rotate in hand. The colors represented the displacements of each node, facilitating the operator's intuitive judgment. The gripper dragged the wine glass on the plane when the arm rolled up. After the wine glass was lifted to the appropriate height, it rotated to a standing state in the gripper. Then the gripper opened, and the two dome ProSoRos were relaxed.
During the sequential contact process, the two dome ProSoRos on the gripper were deformed when contacting with the wine glass, and the motion captured by the inner camera was encoded to latent code $z_{1}$ to $z_{32}$ by MVAE. By analyzing the temporal evolution of the latent codes, we observed distinct patterns corresponding to different contact states. Clustering these latent codes revealed four primary interaction phases: relaxed, compressed, sliding, and transforming. We used the k-means clustering algorithm to classify the latent codes into four categories. We visualized them by t-distributed stochastic neighbor embedding (t-SNE). The results show the contact reasoning capability of the latent proprioception during sequential contact manipulation, providing strong support for researchers to identify contact states and apply them to control strategies.
@article{han2025anchoring,
title={Anchoring Morphological Representations Unlocks Latent Proprioception in Soft Robots},
author={Han, Xudong and Guo, Ning and Xu, Ronghan and Wan, Fang and Song, Chaoyang},
journal={Advanced Intelligent Systems},
volume={0},
pages={0-0},
year={2025}
}