My long-term research mission is “Designing Socially Expressive Conversational AI Media to Assist and Entertain Human Lives.” As an independent researcher, I scientifically investigate the nature of human conversations by designing the conversational AI media that has actual impacts on the societies of this century. This post describes the theoretical framework of the design process.

Theoretical Framework of the Research Program

Conversation requires various ways of information process than other conventional media. In the Shannon’s “communication model” [Shannon 2001], the intention of the sender is encoded into a specific code, and decoded by the receiver via a noisy transmission path. In a conversation, many social signals, such as facial expressions and gestures, are simultaneously sent and received, and their meanings are interpreted among the participants in a conversation. This process is possible because the participants share “conversational protocols” containing multiple message types [Kobayashi et al., 2013]. However, the timespans of these messages vary (e.g., head nod occurs more frequently than a propositional utterance; a topic-shifting strategy typically appears in a large chunk of discourse [Cassell 2001]). Moreover, this protocol remarkably differs depending on cultural contexts, thereby rendering this protocol is much more complex than the conventional media. In my theoretical framework, each communication message should consist of the following three functions.

Three-Function Conversational Protocol Model – a conversational protocol that enables the communication between a transmitter and a receiver with 1) semantic function, 2) interactional function, and 3) interpersonal functions. The timespans of these messages vary.
  1. Semantic Function is a central message containing semantic information that is explicitly articulated in certain contexts. In contexts of virtual assistants (e.g., Google Assistant and Siri), user’s and agent’s recommendation request-answer intentions can be categorized in this function (e.g., “please order two pizzas”). Presently, NLP research focuses on this aspect that uses state-of-the-art machine learning techniques. Several people believe that enabling the semantics models suddenly leads to conversational experiences at some point. However, efficient conversations simultaneously require other functions.
  2. Interactional Function is a function that renders an information exchange robust and efficient. For example, turn-taking (a skill of knowing when to start and finish a turn in a conversation) and barge-in (accepting other one’s requests while speaking) have been traditionally investigated in the spoken dialogue system and human-robot interaction fields.
  3. Interpersonal Function is a function related to maintaining interpersonal relationships. While talking, we achieve certain task goals by sending a specific semantic information and securing message transmissions during conversations and are also achieving social goals, such as building relations with others, which is typically implied in one’s utterances and conversational contexts. Generally, the interpersonal function has been studied in sociolinguistics field and has gradually inspired the conversational AI research fields.

Based on the multi-function framework assumption, my primary research question is “how can conversational AI media computationally balance these functions simultaneously to achieve certain goals of applications?” To answer this broad question, as my early steps, I have particularity focused on the 2. interactional functions and the 3. interpersonal functions in two notable projects: SCHEMA is a multiparty conversation facilitation robot that can regulate a group conversation, particularly controls of initiative and engagement density in a physically situated context; Socially-Aware Robot Assistant (SARA) is a virtual agent that can build interpersonal relationships, called rapport, with its users while performing a conference session recommendation and participant matchmaking.

Types of Conversational AI Media

Examples I designed on smartphone, robot, large touch display, and VR/AR devices in the last decade.

Conversational Robots

SCHEMA

Mobile Personal Assistants

InMind

Full-size Virtual Assistants

SARA

VR/AR Agents

HARP

 

Research on Human Conversations Through Design

Here, I discuss my approach to design the conversational AI media. In the HCI community, the roles and purposes of “research” and “design” activities have been debated. Historically, research considerably leads to general knowledge, whereas, the design results in particular artifacts, where their directions contradict each other. Recently, the importance of “Research through Design (RtD)” approach has arisen, where researchers employ methods, practices of designs with the intention of generating new knowledge [Zimmerman et al. 2014]. In the last decade of my study, I have repeated and refined the design process of the conversational AI media with various applications and devices by synthesizing “conversationally smart” AI systems fulfilling multiple functions that can understand the nature of human conversations. Inspired by the intellectual giants’ concepts, I hypothesized and am following my own conversational AI media RtD process, which is primarily composed of the following five steps:

  1. Domain/Application Discovery: Task domain is important in investigating a certain aspect of human behaviors. It is also necessary for the endeavor toward the artificial general intelligence (AGI) [Goertzel 2007]. The success of Deep Mind’s AlphaGo is because they first adopted a domain with a clear value system [Silver et al., 2016]. In addition to that, a conversation is a very complex multi-purpose activity by nature. Hence, selecting the appropriate task domain is essential for the success in addressing real-world problems from the first stage of the conversational AI media design. As a practice, I have committed myself to observe the actual phenomenon, particularly, in the task domain selection phase by considering the values and goals of a conversation, and its social impact. For example, I designed a party game assistant robot for elderly people in Japan as my very first conversational AI media project during my master’s degree. As the first step, I conducted an ethnomethodological observation by participating in their daily activities at an elderly care facility and attempted to determine their fundamental needs in the community. The result of this observation was a computational model of group conversation facilitation model that can actually entertain groups of people and may contribute to the research community [Matsuyama et al., 2009b; Matsuyama et al., 2010a; Matsuyama et al., 2010b].
  2. Data Collection Ecosystem Design: Data are necessary in any stages of conversational AI media investigations. Generally, the most difficult part of AI and machine learning is gathering quality data in large quantities. In most cases, if a problem is clearly defined, then determining a reasonable set of machine learning algorithms is straightforward. Therefore, AI designers must develop a data collection ecosystem itself where the data can be sustainably gathered, where hypotheses can be iteratively tested at a scale along with a specific purpose. Although the SARA was originally a single-user demo system, it has recently been enhanced into a web-based large-scale user study framework to collect a large amount of interaction data, along with the Amazon Mechanical Turk [expected to be published in 2019]. However, such a crowdsourcing approach is still not sufficient with regard to economical sustainability. The overall point is to think about where human conversations will occur in the next decades, and how to realistically collect such conversational data and train models using sustainable techniques.
  3. Conversational Process Modeling: Explainable AI (XAI) [Gunning 2017] models should become the foundation of trust [Cassell et al., 2003] between humans and machines. Social scientific frameworks (e.g., rapport-building theories) present us with human-interpretable symbolic representation and, in fact, the purposes of science is to provide explanations of nature. However, one crucial problem in the AI era is that no obvious ways are available to combine such theoretical interpretations and data-driven computational representation, particularly for deep neural networks. How can the top-down and bottom-up approaches be combined with trustworthy XAI models of human-AI Interactions? Our previous attempt was the socially-aware user simulator for reinforcement learning (RL) based conversational agent [Jain and Matsuyama et al., 2018], where we constructed user simulators based on sociolinguistic theories and a smaller amount of data that we collected, thereby allowing the RL agent to learn its optimal policies in the theory-data-mixed simulated environments.
  4. Holistic System Design: The conversational AI media requires a sufficient degree of intelligent module integration. In most cases, they require holistic systems from speech and facial expression recognition to text-to-speech and nonverbal behavior realization via the virtual and/or physical agents’ embodiment because a human conversation is indeed a multi-modally immersive holistic process – this argument is fundamental in the fields of cognitive science and human-computer interactions for a long time [Newell 1983; 1994]. In both SCHEMA [Matsuyama et al., 2014a] and SARA [Pecune and Matsuyama et al., 2018] projects, I assessed the experimental hypotheses by using fully integrated systems with various levels of autonomy (semi-automatic and fully automated systems) depending on the research questions. The recent advancement of machine learning techniques, which can be regarded as AI design materials, also allows us to obtain iterative design approaches toward a holistic design with rapid prototyping.
  5. Analysis and Repeat: The RtD process of conversational AI media has an iterative approach. The most exciting moment in my research journey was the time when one hypothesis or theoretical proposition was rejected, even if the system is properly implemented based on theories. For example, when I implemented the turn-taking theoretical model with a robot in a group setting, I somehow sensed an incongruity in the conversation even if the robot worked according to my expectations. Then, I found out that the lack of turn initiative function violated the social norms in the group. Thus, I modified the model with a mixed-initiative mechanism and a procedural decision-making process that significantly reduced the incongruity [Matsuyama et al., 2014a].

References

  • [Cassell et al., 2001] Justine Cassell, Embodied Conversational Agents: Representation and Intelligence in User Interfaces, AI magazine 22, no. 4, pp. 67-83, 2001.
  • [Goertzel 2007] Ben Goertzel, Artificial general intelligence, Edited by Cassio Pennachin. Vol. 2. New York: Springer, 2007.
  • [Gunning 2017] David Gunning, Explainable Artificial Intelligence (XAI), Defense Advanced Research Projects Agency (DARPA), 2017.
  • [Jain and Matsuyama et al., 2018] Alankar Jain, Florian Pecune, Yoichi Matsuyama and Justine Cassell, A Social User Simulator Architecture for Socially-Aware Conversational Agents, 18th ACM International Conference on Intelligent Virtual Agents (IVA 2018).
  • [Kobayashi et al., 2013] Tetsunori Kobayashi and Shinya Fujie, Conversational Robots: An Approach to Conversation Protocol Issues that Utilizes the Paralinguistic Information Available in a Robot-Human Setting. Acoustical Science and Technology, 34(2):64–72, 2013.
  • [Matsuyama et al., 2014a] Yoichi Matsuyama, Iwao Akiba, Shinya Fujie and Tetsunori Kobayashi, Four-Participant Group Conversation: A Facilitation Robot Controlling Engagement Density As the Fourth Participant, Journal of Computer Speech and Language, 2014. (DOI:10.1016/j.csl.2014.12.001).
    [Matsuyama et al., 2014b] Yoichi Matsuyama, Akihiro Saito, Shinya Fujie and Tetsunori Kobayashi, Automatic Expressive Opinion Sentence Generation for Enjoyable Conversational Systems, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014. (DOI:10.1109/TASLP.2014.2363589).
  • [Newell 1983] Allen Newell, The psychology of human-computer interaction, 1983.
    [Newell 1994] Allen Newell, Unified theories of cognition. Harvard University Press, 1994.
  • [Pecune and Matsuyama et al., 2018] Florian Pecune, Jingya Chen, Yoichi Matsuyama and Justine Cassell, Field Study Analysis of a Socially Aware Robot Assistant, Proceedings of the special track Socially Interactive Agents (SIA) at the 17th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2018), July 2018.
  • [Shannon 2001] Claude Elwood Shannon, A Mathematical Theory of Communication, ACM SIGMOBILE mobile computing and communications review 5, no. 1 (2001): 3-55.
  • [Silver et al., 2016] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 529, no. 7587, 484 – 489, 2016.
  • [Zimmerman et al., 2014] John Zimmerman and Jodi Forlizzi, Research through design in HCI, In Ways of Knowing in HCI, pp. 167-189. Springer, New York, NY, 2014.

Leave a Comment

Your email address will not be published. Required fields are marked *