Socially Aware Robot Assistant

SARA is a Socially-Aware Robot Assistant that interacts with people in a whole new way, personalizing the interaction and improving task performance by relying on information about the relationship between the human user and virtual assistant. Rather than taking the place of people, Sara is programmed to collaborate with her human users. Rather than ignoring the socio-emotional bonds that form the fabric of society, SARA depends on those bonds to improve her collaboration skills.


SARA at the World Economic Forum

SARA was presented at the World Economic Forum (WEF) Annual Meeting in Davos (January 17-20, 2017). The SARA booth was located right in the middle of the main corridor of the Davos Congress Center and was, in fact, the only demo in the Congress Center.

SARA had access to the WEF database of sessions being presented, participants attending, demos being shown in the Loft across the street, and places to get food in the Congress Center (she also knew about some private parties – information she was willing to share if asked nicely!). SARA was programmed to use this information to act as a virtual personal assistant. She assisted the global leaders attending Davos by finding out about their interests and goals in attending the WEF and then recommending sessions and people who were relevant to their interests and goals. In so-doing, SARA showed what it means to have socially-aware Artificial Intelligence. That is SARA used the conversation to build a relationship with the person talking to her, and then used that relationship to obtain better information about his/her interests and goals. In turn, that allowed her to do a better job recommending sessions and people.

Computational Model

SARA is designed to build interpersonal closeness or rapport over the course of a conversation by managing rapport through the understanding and generation of visual, vocal, and verbal behaviors. The ArticuLab always begins by studying human-human interaction, using that as the basis for our design of artificial intelligence systems. Leveraging our prior work on the dynamics of rapport in human-human conversation this SARA system includes the following components:

  1. The computational model of rapport: The computational model is the first to explain how humans in dyadic interactions build, maintain, and destroy rapport through the use of specific conversational strategies that function to fulfill specific social goals, and that are instantiated in particular verbal and nonverbal behaviors.
  2. Conversational strategy classification: The conversational strategy classifier can recognize high-level language strategies closely associated with social goals through training on linguistic features associated with those conversational strategies in a test set.
  3. Rapport level estimation: The rapport estimator estimates the current rapport level between the user and the agent using temporal association rules.
  4. Social and task reasoning: The social reasoner outputs a conversational strategy that the system must adopt in the current turn. The reasoner is modeled as the spreading activation network.
  5. Natural language and nonverbal behavior generation: The natural language generation module expresses conversational strategies in specific language and associated nonverbal behaviors, and they are performed by a virtual human.


SARA Computational Architecture Combining Social and Task Pipelines [Matsuyama et al. 2016]
The system’s architecture is organized around a task-pipeline and a social-pipeline [Matsuyama et al. 2016]. The task-pipeline consists of a task-oriented Natural Language Understanding (NLU), extracting user’s intention from its speech, and a Task Reasoner selecting SARA’s next intention based on the NLU’s output. The social-pipeline consists of three different modules. The Conversational Strategy Classifier detects user’s conversational strategy based on user’s multimodal cues [Zhao et al. 2016a], the Rapport Estimator relies on these conversational strategies as well as visual and acoustic features to predict the level of rapport going on during the interaction [Zhao et al. 2016b], and the Social Reasoner selects SARA’s next conversational strategy based on the history of the interaction [Romero et al. 2017]. Given the system’s task and social intentions decided by the Task and Social Reasoners, a Natural Language Generator (NLG) and Nonverbal Behavior Generator interpreted these intentions into a sentence and nonverbal behavior plans rendered on SARA’s character animation realizer and Text-to-Speech (TTS). The system also had access to the recommendation database, user authentication and messenger applications of the online collaboration platform system.

Analysis of the Field Studies

Research Question: “How does the task performance of a personal assistant affect the dynamics of rapport over the course of an interaction?

Participants interacted with SARA during the conference, receiving recommendations about sessions to attend and/or people to meet. After the attendees entered the booth, SARA first introduced herself and asked several questions about the attendees’ current feelings and mood. Then, the attendees were asked about their occupation as well as their interests and goals for attending the conference. SARA would then cycle through several rounds of people and/or session recommendations, showing information about the recommendation on the virtual board behind her. The attendees were able to request as many recommendations as desired, and were able to leave the booth anytime they wanted. Finally, SARA proposed to take a “selfie” with the attendees before saying farewell. During each interaction, attendees’ video and audio were recorded using a camera and a microphone. SARA’s animations, for their part, were recorded separately in a log file. Audio records were used to get text transcriptions of both attendee’s and SARA’s utterances using a third party transcription service. These transcriptions contained turn-taking information such as speaker ID and starting and ending timestamps for each turn. With rapport being a dyadic phenomenon, we eventually reconstructed the interactions to have both attendee and SARA present in the same video before annotating them. Our corpus contains data from 69 of these interactions, including both attendee’s and SARA’s video, audio and textual speech transcription, which combined accounted for more than 5 hours of interaction (total time = 21055 seconds, mean session duration = 305.15 seconds, SD = 65.00 seconds). Out of these 69 attendees, 29 were women and 40 were men. We did not gather any information about the attendees’ age or nationality.

An Excerpt of An Interaction [Pecune et al. 2018]
For the details of the data analysis, please refer to [Pecune et al. 2018].


  • [Matsuyama et al. 2016] Matsuyama, M., Bhardwaj, A., Zhao, R., Romero, O., Akoju, S., Cassell, J. (2016, September). Socially-Aware Animated Intelligent Personal Assistant Agent, 17th Annual SIGDIAL Meeting on Discourse and Dialogue
  • [Goel et al. 2018] Pranav Goel, Yoichi Matsuyama, Michael Madaio and Justine Cassell, “I think it might help if we multiply, and not add” : Detecting Indirectness in Conversation, International Workshop on Spoken Dialog System Technology (IWSDS 2018). | PDF
  • [Pecune et al. 2018] Florian Pecune, Jingya Chen, Yoichi Matsuyama and Justine Cassell, Field Study Analysis of a Socially Aware Robot Assistant, Proceedings of the special track Socially Interactive Agents (SIA) at the 17th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2018).
  • [Zhao et al. 2016a] Zhao, R., Sinha, T., Black, A., & Cassell, J. (2016, September). Automatic Recognition of Conversational Strategies in the Service of a Socially-Aware Dialog System, 17th Annual SIGDIAL Meeting on Discourse and Dialogue.
  • [Zhao et al. 2016b] Zhao, R., Sinha, T., Black, A., & Cassell, J. (2016, September). Socially-Aware Virtual Agents: Automatically Assessing Dyadic Rapport from Temporal Patterns of Behavior, 16th International Conference on Intelligent Virtual Agents (IVA) [*Best Student Paper]
  • [Romero et al. 2017] Oscar Romero, Ran Zhao, and Justine Cassell. Cognitive-inspired Conversational-Strategy Reasoner for Socially-Aware Agents, In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 3807-3813. AAAI Press, 2017.


This work was supported in part by generous funding from Microsoft, LivePerson, Google, and the IT R&D program of MSIP/IITP [2017-0-00255, Autonomous Digital Companion Development].


Media Coverage

Leave a Comment