MSG: Multi-Stream Generative Policies for Sample-Efficient Robotic Manipulation

Abstract

Generative robot policies such as Flow Matching offer flexible, multi-modal policy learning but are sample-inefficient. Although object-centric policies improve sample efficiency, it does not resolve this limitation. In this work, we propose Multi-Stream Generative Policy (MSG), an inference-time composition framework that trains multiple object-centric policies and combines them at inference to improve generalization and sample efficiency. MSG is model-agnostic and inference-only, hence widely applicable to various generative policies and training paradigms. We perform extensive experiments both in simulation and on a real robot, demonstrating that our approach learns high-quality generative policies from as few as five demonstrations, resulting in a 95% reduction in demonstrations, and improves policy performance by 89 percent compared to single-stream approaches. Furthermore, we present comprehensive ablation studies on various composition strategies and provide practical recommendations for deployment. Finally, MSG enables zero-shot object instance transfer.

Overview

Multi-Stream Generative Policy (MSG) is the first multi-stream object-centric generative policy. It learns multiple object-centric policies, for example, one policy in end-effector frame and one in object frame. At inference-time, the policies are composed using the object poses. This explicit use of scene geometry enables sample-efficient learning from as little as five to ten demonstrations. In contrast, naive policy learning in the world frame requires in the order of one hundred demonstrations to achieve similar performance. The modularity of MSG further enables zero-shot transfer to novel and cluttered scenes by using a DINO-based keypoint encoder. By being inference-only, MSG is widely applicable to different generative policy approaches, such as Diffusion and Flow Matching.

Composition Approach

We investigate the efficacy of two composition strategies: ensembling and flow composition.

Ensembling combines the final predictions of the local models. It is simple to implement and does not interfere with techniques, such as data normalization. However, it only works reliably for unimodal data distributions.

Flow Compositions iteratively combines the predicted flow fields, which is slightly more complex. The iterative composition encourages convergence to a common mode.

Weighting Strategies

We investigate three different strategies to weight the relative contribution of the local models.

Progress-based schedules are a simple and surprisingly effective approach. However, they require privileged information about which local models matter when and where.

Variance-based weighting is more flexible and does not require privileged information. It works by training the models to predict their own uncertainty along with the action, and uses the uncertainty to weight the models.

Parallel sampling also weights the models by their uncertainty. But it does not explicitly train the models to predict their uncertainty. Instead, it estimates it by sampling multiple particles from the models and estimating their empirical variance.

Video

Code

For academic usage a software implementation of this project based on PyTorch will soon be realsed in our GitHub repository and is released under the GPLv3 license. For any commercial purpose, please contact the authors.

Model downloads will soon be available below.

Publications

If you find our work useful, please consider citing our paper:

Jan Ole von Hartz, Lukas Schweizer, Joschka Boedecker Abhinav Valada,

MSG: Multi-Stream Generative Policies for Sample-Efficient Robotic Manipulation
Under review for publication, 2025.

(PDF) (BibTeX)

Authors