Imagine if every time you needed to complete a complex physical task, like building a bicycle, fixing a broken water heater, or cooking risotto for the first time, you had a world-class expert standing over your shoulder and guiding you through the process. In addition to telling you the steps to follow, this expert would also tune the instructions to your skill set, deliver them with the right timing, and adapt to any mistakes, confusions, or distractions that might arise along the way.
What would it take to build an interactive AI system that could assist you with any task in the physical world, just as a real-time expert would? To begin exploring the core competencies that such a system would require, we developed and released the Situated Interactive Guidance, Monitoring, and Assistance (SIGMA) system, an open-source research platform and testbed prototype (opens in new tab) for studying mixed-reality task assistance. SIGMA provides a basis for researchers to explore, understand, and develop the capabilities required to enable in-stream task assistance in the physical world.
Recent advances in generative AI and large language, vision, and multimodal models can provide a foundation of open-domain knowledge, inference, and generation capabilities to help enable such open-ended task assistance scenarios. However, building AI systems that collaborate with people in the physical world—including not just mixed-reality task assistants but also interactive robots, smart factory floors, autonomous vehicles, and so on—requires going beyond the ability to generate relevant instructions and content. To be effective, these systems also require physical and social intelligence.
Physical and social intelligence
For AI systems to fluidly collaborate with people in the physical world, they must continuously perceive and reason multimodally, in stream, about their surrounding environment. This requirement goes beyond just detecting and tracking objects. Effective collaboration in the physical world necessitates an understanding of which objects are relevant for the task at hand, what their possible uses may be, how they relate to each other, what spatial constraints are in play, and how all these aspects evolve over time.
Just as important as reasoning about the physical environment, these systems also need to reason about people. This reasoning should include not only lower-level inferences about body pose, speech and actions, but also higher-level inferences about cognitive states and the social norms of real-time collaborative behavior. For example, the AI assistant envisioned above would need to consider questions such as: Is the user confused or frustrated? Are they about to make a mistake? What’s their level of expertise? Are they still pursuing the current task, or have they started doing something else in parallel? Is it a good time to interrupt them or provide the next instruction? And so forth.
Situated Interactive Guidance, Monitoring, and Assistance
We developed SIGMA as a platform to investigate these challenges and evaluate progress in developing new solutions.
SIGMA is an interactive application that currently runs on a HoloLens 2 device and combines a variety of mixed-reality and AI technologies, including large language and vision models, to guide a user through procedural tasks. Tasks are structured as a sequence of steps, which can either be predefined manually in a task library or generated on the fly using a large language model like GPT-4. Throughout the interaction, SIGMA can leverage large language models to answer open-ended questions that a user might have along the way. Additionally, SIGMA can use vision models like Detic and SEEM to detect and track task-relevant objects in the environment and point them out to the user as appropriate. This video (opens in new tab) provides a first-person view of someone using SIGMA to perform a couple of example procedural tasks.
Enabling research at the intersection of AI and mixed reality
SIGMA was designed to serve as a research platform. Our goal in open-sourcing the system is to help other researchers leapfrog the basic engineering challenges of putting together a full-stack interactive application and allow them to directly focus on the interesting research challenges ahead.
Several design choices support these research goals. For example, the system is implemented as a client-server architecture: a lightweight client application runs on the HoloLens 2 device (configured in Research Mode (opens in new tab)), which captures and sends a variety of multimodal data streams—including RGB (red-green-blue), depth, audio, head, hand, and gaze tracking information—live to a more powerful desktop server. The desktop server implements the core functionality of the application and streams information and commands to the client app for what to render on the device. This architecture enables researchers to bypass current compute limitations on the headset and creates opportunities for porting the application to other mixed-reality devices.
SIGMA is built on top of Platform for Situated Intelligence (opens in new tab) (also known as psi), an open-source framework that provides the fabric, tools, and components for developing and researching multimodal integrative-AI systems. The underlying psi framework enables fast prototyping and provides a performant streaming and logging infrastructure. The framework provides infrastructure for data replay, enabling data-driven development and tuning at the application level. Finally, Platform for Situated Intelligence Studio provides extensive support for visualization, debugging, tuning and maintenance.
SIGMA’s current functionality is relatively simple, but the system provides an important starting point for discovering and exploring research challenges at the intersection of mixed reality and AI. From computer vision to speech recognition, many research problems, especially when it comes to perception, can and have been investigated based on collected datasets. The recently increased interest in egocentric data and associated challenges provides important fuel for advancing the state of the art. Yet, numerous problems that have to do with interaction and with real-time collaboration are only surfaced by real-time end-to-end systems and are best studied and understood in an interactive context with actual users.
As a testament to Microsoft’s continued commitment to the space, SIGMA provides a research platform and reflects just one part of the company’s work to explore new AI and mixed-reality technologies. Microsoft also offers an enterprise-ready, mixed-reality solution for frontline workers: Dynamics 365 Guides. With Copilot in Dynamics 365 Guides, which is currently being used by customers in private preview, AI and mixed reality together empower frontline workers with step-by-step procedural guidance and relevant information in the flow of work. Dynamics 365 Guides is a richly featured product for enterprise customers, geared toward frontline workers who perform complex tasks. In comparison, SIGMA is an open-source testbed for exploratory research purposes only.
We hope that SIGMA can provide a solid foundation for researchers to build on. Although the system targets the specific scenario of mixed-reality task assistance, it can help illuminate the challenges of developing social and physical intelligence that arise for any computing systems that are designed to operate in the physical world and interact with people, from virtual agents to physical robots and devices.
If you are interested in learning more and using SIGMA in your own research, check it out at https://aka.ms/psi-sigma (opens in new tab). We are excited to collaborate with and work alongside the open-source research community to make faster progress in this exciting and challenging space.
Acknowledgements / Contributors
Ishani Chakraborty, Neel Joshi, Ann Paradiso, Mahdi Rad, Nick Saw, Vibhav Vineet, Xin Wang.
Responsible AI considerations
SIGMA was designed as an experimental prototype for research purposes only and is not intended for use in developing commercial applications. The primary use case is as a research tool to enable academic and industry researchers to push the state of the art in the space of procedural task assistance at the intersection of mixed reality and AI. As such, the system has been open-sourced under a research-only license (opens in new tab). Researchers that wish to make use of SIGMA in their own work should first familiarize themselves with the system and its limitations and risks involved with using the system in a user-study context and should undergo a full IRB or ethical board review as appropriate for their institution. Limitations, risks and additional considerations for using the system are described in a Transparency Note (opens in new tab) available in SIGMA’s open-source repository (opens in new tab).
The post SIGMA: An open-source mixed-reality system for research on physical task assistance appeared first on Microsoft Research.