Corpora and datasets for discourse and dialogue reserach

Parts of the contents of the list are extracted from the papers using ChatGPT, so they might be wrong. If you find errors, please create GitHub issues or pull requests (Edit this file.). If you don’t have an account on GitHub, please email at mikio.nakano (at) c4a.jp.

Parts of this list have been adapted from A Survey of Available Corpora for Building Data-Driven Dialogue Systems, with permission; see the survey website for reference and please cite the paper if useful.

Name Language Modalities Data Types Task/Domain Participants Size Ave. # of Turns Brief Description Paper
Let’s go & DSTC1 English Speech Audio Bus schedules Human-System 171K dialogues N/A Telephone conversations between real users and bus information systems Raux et al. 2006
DSTC2 English Speech Transcripts and ASR results Restaurant search Human-System 15K dialogues, 3.7M words 7.88 Telephone conversations between hired users and restaurant search system Henderson et al, 2014
MultiWoz 2.0 English Text Text Multiple domains (restaurant, hotel, etc.) Human-Woz 8.5K dialogues, 115K turns, 1.5M tokens 13.18 A fully-labeled collection of human-human written conversations spanning over multiple domains and topics Budzianowski et al., 2018
HCRC MapTask Corpus English Face-to-face Audio, video (not available) direction giving Human-Human 128 dialogues, 174K words, 18hrs A set of 128 dialogues that has been recorded, transcribed, and annotated for a wide range of behaviours, and has been released for research purposes. Anderson et al., 1991
AMI Corpus English face-to-face close-talking and far-field microphones, individual and room-view video cameras, projection, a whiteboard, individual pens. Face-to-face meetings Multi-party human 175 dialogues, 900K words, 100hrs A multi-modal data set consisting of 100 hours of meeting recordings Carletta et al, 2005
Ubuntu Dialogue Corpus English IRC chat text Chat on Ubuntu Human-Human 930K dialogues, 100M words 7.71 Dialogues extracted from Ubuntu chat stream on IRC Lower et al, 2015
DailyDialog Dataset English Text Text Daily communication Human-Human 13K dialogues, 1.5M words 7.9 DailyDialog is a high-quality multi-turn dialogue dataset that covers conversations about daily life. It is manually labeled with communication intention and emotion information, making it useful for training and evaluating dialogue systems. Li et al. 2017
Persona Chat English Chat text Text Open domain Human-Human 11K dialogues, 162K utterances A chit-chat dataset where paired Turkers are given assigned personas and chat to try to get to know each other. Zhang et al., 2018
Schema-Guided Dialogue Dataset English Text Text 16 domains Human-System 16K dialogues, 330K turns The dataset consists of conversations between a virtual assistant and a user ranging over a variety of domains including Travel, Events, Payment, Media, Restaurants, Weather etc. Annotations for natural language understanding, dialogue state tracking, policy learning, natural language generation and user simulation learning are also included. Rastogi et al., 2020
EmoWOZ English Text Text Multiple domains (restaurant, hotel, etc.) Human-Woz More than 11K dialogues 14.63 A large-scale open-source dataset for emotion recognition in task-oriented dialogues with n 83K emotion annotations of user utterances Feng et al. 2022
Ubuntu Dialogue Corpus English text text Technical support for Ubuntu-related problems Human-Human 930,000 dialogues, 7,100,000 utterances, 100,000,000 words 7.71 A dataset containing almost one million multi-turn dialogues extracted from the Ubuntu chat logs, used for research in unstructured multi-turn dialogue systems. It facilitates the development of dialogue managers based on neural language models that can utilize large amounts of unlabeled data. Lowe et al., 2015
Schema-Guided Dialogue (SGD) English text text 26 services across 16 domains including alarms, banks, buses, calendar events, flights, homes, hotels, media, movies, music, payment, rental cars, restaurants, ridesharing, services, trains, travel, messaging, and weather Simulated user-system interactions Over 16,000 dialogues, 329,964 turns 20.44 The SGD dataset is designed to support the development of conversational interfaces that can handle multiple domains and services, particularly in scenarios with zero-shot learning where models encounter unseen services or APIs. It uses a schema-guided approach where intents and slots are dynamically provided, facilitating easier integration of new services without retraining. Rastogi et al., 2020
Internet Argument Corpus 2.0 English text text Online forums and debates on social and political topics Human-Human 24,000 posts, 11,079 threads, 3452 authors, 56M tokens Varies, data includes multiple posts per thread The IAC 2.0 is an expanded dataset designed to support research on many different aspects of social language and dialogue structure, particularly in online forums on social and political topics. It features an SQL schema for organizing dialogues from several platforms into a structured database format. Abbott et al., 2016
The Settlers of Catan Corpus English text text Game strategy and conversation Human-Human 21 games annotated, ca. 2000 dialogue turns, ca. 40 games collected Includes ‘a few dozen self-contained bargaining conversations’ per game A corpus of online chats between agents playing The Settlers of Catan, a competitive win–lose game involving negotiations. The corpus aligns players’ conversations with the state of the game, focusing on negotiation dialogues and strategic interactions. Afantenos et al., 2012
Let’s Go Public corpus English speech audio Public transportation Human-System 627 dialogues, 9162 turns 14.6 The corpus contains dialogues from the Let’s Go Public spoken dialog system, which provides bus schedule information during off-peak hours. It includes transcribed calls from the general public, featuring interactions influenced by various user attitudes and environmental conditions. Raux et al., 2005
Dialog State Tracking Challenge English speech text Bus timetable information Human-System 15K transcribed and labeled human-computer dialogs Varies by dataset; e.g., TRAIN1A: 14.7, TEST4: 10.9 A corpus of 15,000 human-computer dialogue interactions used for evaluating dialogue systems, specifically focusing on the task of dialog state tracking. The corpus contains dialogs from various dialog systems interacting with real users, collected under the Spoken Dialog Challenge hosted by Carnegie Mellon University. Williams et al., 2013
Carnegie Mellon Communicator English speech audio Travel planning (air transportation, hotel reservations, car rentals) Human-System N/A N/A The Carnegie Mellon Communicator system assists users in creating complex travel itineraries through a conversational interface. It utilizes schemas to manage dialogues, aiming to support problem-solving activities by providing information, proposing solutions, and highlighting potential constraint violations. Rudnicky et al., 1999
ATIS Spoken Language Systems Pilot Corpus English speech audio, text Air travel information Human-Woz 41 sessions, 1041 utterances 25.4 utterances per session The ATIS corpus is designed for developing and evaluating speech systems that understand spontaneous speech, focused on air travel information. Hemphill et al, 1990
RITEL Corpus French speech audio Open-domain Human-System 582 dialogs, 5360 user queries, 6 hours of user speech 9 The RITEL Corpus is a Human-Computer open-domain question answering spoken dialog corpus that includes orthographically transcribed and annotated dialogues focusing on specific entities and topics. It involves a real interaction system rather than a Wizard-of-Oz setup. Rosset and Petel, 2006
Tutorial Dialogs on Mathematical Theorem Proving German (Translated to English for publication) text text, audio, video Mathematics (Proofs in naive set theory) Human-Woz 66 sets of dialog session logs, 1115 total turns, 393 student sentences 12 A corpus of dialog session logs from a Wizard-of-Oz experiment focused on teaching proofs in naive set theory, with audio and video logs also collected. Wolska et al., 2004
The MATCH corpus English speech audio Healthcare, appointment scheduling Human-Human 447 dialogues, 6237 turns 14.0 The MATCH corpus is a linguistically annotated corpus collected to study the interaction between older and younger users with simulated spoken dialogue systems. It focuses on the effects of cognitive ageing on users’ interactions and was designed to develop technologies to help older users live independently. Georgila et al, 2010
Frames English text text Travel Human-Human 1369 dialogues, 19986 turns 15 Frames is a corpus of human-human dialogues collected in a Wizard-of-Oz setting to study complex dialogue flows and decision-making behaviour. The dialogues involve users trying to book travel packages with constraints, exploring options and making selections, facilitated by assistants who manage these requests. El Asri et al., 2017
Multi-Domain In-Car Assistant Dialogue Dataset English text text Calendar scheduling, weather information retrieval, point-of-interest navigation Human-Woz 3,031 dialogues; 2,425 training, 302 validation, 304 test dialogues 5.25 This dataset contains dialogues across three domains relevant to in-car personal assistant tasks. Each dialogue is grounded in a knowledge base, making it suitable for developing architectures that reason over world knowledge. Eric et al., 2017
The Walking Around Corpus English speech audio Pedestrian navigation and spatial cognition Human-Human 36 dialogues, detailed transcripts Multiple tasks involved The corpus consists of experimentally parameterized collection of spontaneous spoken dialogues, focusing on lexical choice and variability during direction-giving tasks. It involves participants communicating over mobile phones while one navigates a campus based on directions from a stationary partner. Brennan et al., 2013
Intelligence Squared Debates (IQ2 Debates) English speech text Various (e.g., foreign policy, health, technology) Human-Human 108 debates, average 12,801 words and 117 turns per debate 117 A corpus of transcripts from Oxford-style debates held in the US, covering a wide range of topics with experts debating motions before a live audience. The dataset tracks conversational dynamics and strategies used to sway audience opinions. Zhang et al., 2016
Idiap Wolf Database English multimodal audio, video role-playing game, competitive Human-Human 7.3 hours of recordings, 50 day-phase games, 36 participants N/A The Idiap Wolf Database consists of audio-visual recordings from a competitive role-playing game where players have deceptive and non-deceptive roles. The unique aspect of this corpus is its focus on group behavior and deception in a controlled game setting. Hung and Chittaranjan, 2010
ICSI Meeting Recorder Dialog Act (MRDA) Corpus English speech audio, text natural meetings Human-Human 75 meetings, approx. 72 hours of speech, 180,218 dialog act tags N/A A corpus of hand-annotated dialog acts and adjacency pairs from naturally occurring multi-party meetings recorded at the ICSI. It includes over 180,000 dialog act tags across approximately 72 hours of meetings, focusing on complex discourse phenomena. Shriberg et al., 2004
The Trains 93 Dialogues English speech audio Task-oriented dialogues involving a planning assistant and manufacturing and shipping goods Human-Human 98 dialogues, 5900 turns, 55000 words Approximately 60.2 A corpus of task-oriented dialogues set in the Trains domain where a user collaborates with a planning assistant to accomplish tasks involving manufacturing and shipping goods in a railroad freight system. Includes audio files, time-aligned word and phoneme transcriptions. Heeman and Allen, 1995
ICT Rapport Datasets English multimodal audio, video Narrative task involving retelling events from a sexual harassment awareness video Human-System 131 participants N/A The Rapport Agent is designed to elicit rapport from human participants within a dyadic narrative task. It utilizes real-time analysis of acoustic properties of speech and speaker gestures to generate nonverbal feedback like nods and posture shifts. Gratch et al., 2007
D64 Multimodal Conversational Corpus English multimodal text, audio, video General conversation Human-Human N/A N/A A corpus designed to observe conversational behavior as closely as possible to natural interaction, including elements like gaze, posture, and simultaneous movements. The data, collected in a domestic setting, includes extensive video, audio, and motion-capture records. Oertel et al., 2013
Cardiff Conversation Database (CCDb) English audiovisual audio, video Natural conversations Human-Human 30 conversations, 300 minutes of audio-video data Approximately 10 per conversation (estimated from 5-minute average duration per conversation) A unique 2D audiovisual database containing natural conversations between pairs of people, annotated for speaker activity, facial expressions, head motion, and non-verbal utterances. Aubrey et al., 2013
4D Cardiff Conversation Database (4D CCDb) English multimodal 3D video (4D), audio Natural, dyadic conversations Human-Human 17 minutes, 34 sequences N/A The 4D CCDb is the first 4D (3D Video) audio-visual database containing natural conversations between pairs of people. It includes fully annotated speaker and listener activities such as conversational facial expressions, head motion, and verbal/non-verbal utterances. Vandeventer et al., 2015
Group Affect and Performance (GAP) Corpus English multimodal audio, text Group interaction and decision-making Human-Human 13 group meetings, 104.45 minutes of recordings N/A The GAP corpus contains meeting audio, transcriptions, annotations, decision-making performance, as well as group member influence, post-meeting ratings of satisfaction, and demographics. It is designed to stimulate research on the computational analysis of small group meetings. Braley and Murray, 2018
MULTISIMO Corpus English multimodal text, audio, video Collaborative group interactions in a quiz solving task Human-Human 23 sessions, approximately 4 hours total N/A The MULTISIMO Corpus involves collaborative group interactions where participants work together to solve quiz questions. It includes multimodal data from different cameras and microphones, synchronized and complemented by personality test results and experience assessment surveys. Koutsombogera and Vogel, 2018
Movie-DiC English text text Multiple genres (action, crime, drama, thriller, etc.) Human-Human 132,229 dialogues, 764,146 turns 5.78 A dialogue corpus extracted from movie scripts for studying semantic and pragmatic aspects of human communication in various contexts and styles. Banchs, 2012
Movie-Triples English text text Wide range of movie script topics Human-Human 484 movies, 196,308 triples, Average tokens/triple: 53 3 turns per triple The MovieTriples dataset is developed by expanding and preprocessing the Movie-DiC dataset for generative dialogue modeling. It includes dialogues of three turns between two interlocutors, derived from movie scripts, making it suitable for building dialogue systems that emulate human conversations. Serban et al., 2016
Cornell Movie-Dialogs Corpus English text text Movie scripts Human-Human 220,579 conversational exchanges from 617 unique titles 5 or more exchanges per pair A large set of imagined conversations derived from movie scripts, providing a rich resource for studying linguistic coordination and stylistic convergence in fictional dialogues. Danescu-Niculescu-Mizil and Lee, 2011
Conversation Dialog Corpora from Television and Movie Scripts English text text Television shows and movies Human-Human 1,042,288 dialog pairs (raw), 86,719 dialog pairs (after filtering) N/A This corpus contains conversation pairs extracted from television and movie scripts. The dialogues are filtered to ensure they are between two speakers, using a method called tri-turn filtering and semantic similarity filtering. The final corpus includes 86,719 high-quality query-response pairs. Nio et al., 2014
TVD: a reproducible and multiply aligned TV series dataset English text text, audio, video TV Series (The Big Bang Theory and Game of Thrones) Human-Human 132 episodes of TBBT, 5 episodes of GoT (manual transcripts), 17 TBBT and 10 GoT episodes (subtitles), 17 TBBT and 10 GoT episodes (automatic transcripts), outlines and summaries for multiple episodes N/A The TVD dataset is built around two TV series, The Big Bang Theory and Game of Thrones, and includes multiple tracks such as manual and automatic transcripts, multilingual subtitles, episode outlines, and various metadata. The dataset is designed for tasks like summarization, scene retrieval, and speech retrieval. Roy et al., 2014
Annotated Corpus of Film Dialogue for Learning and Characterizing Character Style English text text Film dialogue from multiple genres (drama, thriller, crime, comedy, action, romance, adventure) Human-Human 862 film scripts, 664,000 lines of dialogue, 9,599,000 tokens N/A A corpus of film dialogue collected from the IMSDb archive, annotated for linguistic structures and character archetypes, used to learn character models of linguistic style. Walker et al., 2012a
SubTle Corpus English, Portuguese text text Horror, Sci-fi, Western, Romance Human-Human SubTle - Portuguese: 2,930,173 I-R pairs; SubTle - English: 3,454,480 I-R pairs Varies by genre, average ranges from 419 to 580 I-R pairs per subtitle file A corpus of Interaction-Response pairs extracted from subtitles files, created to help dialogue systems deal with Out-of-Domain interactions. Ameixa and Coheur, 2013
OPUS Multiple languages (over 90 languages) text text Multiple domains (legislative texts, administrative texts, movie subtitles, software localization, newspaper texts) Human-Human Over 40 billion tokens, 2.7 billion parallel units (aligned sentences and sentence fragments) N/A A growing language resource of freely accessible parallel corpora and related tools, used for various applications including machine translation, translation studies, and cross-linguistic corpus studies. Tiedemann, 2012
NPS Internet Chatroom Conversations English text text General chat, open to any topic Human-Human 10K posts, 45K tokens N/A The corpus consists of online chat dialogues collected from various chat rooms, annotated with lexical, syntactic, and discourse information. It was developed to support natural language processing applications such as author profiling, entity identification, and social network analysis. Forsyth and Martell, 2007
Twitter Conversations Corpus English text text Open-domain (Twitter conversations) Human-Human 1.3 million conversations 2 (majority of conversations have only 2 posts) A large corpus of 1.3 million Twitter conversations, enabling the study of open-domain dialogue acts and structure in a new medium. Ritter et al., 2010
Twitter Triple Corpus English text text Social Media (Twitter) Human-Human 127M triples N/A (Context + Message + Response as triples) A large-scale corpus mined from Twitter, used for training context-sensitive response generation models. The corpus consists of triples representing context, message, and response. Sordoni et al., 2015
NUS SMS Corpus English, Chinese text text General SMS communication Human-Human 57,824 messages N/A A public SMS corpus focusing on English and Mandarin Chinese SMS messages, collected through crowdsourcing methods. Chen and Kan, 2013
Settlers of Catan Strategic Conversation Corpus English text text Game negotiation (Settlers of Catan) Human-Human 21 games annotated with approximately 2000 dialogue turns Varies per game, approximately a few dozen per game A corpus of online chat negotiations during the game The Settlers of Catan, focusing on strategic conversation and negotiation dialogues. Afantenos et al., 2012
Cards corpus English text text Task-oriented (card game in a maze-like environment) Human-Human 744 transcripts, 23,532 utterances, 137,323 words 31.63 The Cards corpus is built from a two-person online video game where players collaborate to complete a task. The game records everything, allowing for detailed study of player utterances, context, and strategies in a simple, controlled environment. Djalali et al., 2012
Agreement and Disagreement in Threaded Discussions English text text Wikipedia discussion forums, LiveJournal weblogs Human-Human 118 unique documents, 810 annotated sentence pairs N/A A corpus of sentence-level agreement and disagreement annotations over threaded discussions on Wikipedia and LiveJournal. Andreas et al., 2012
Agreement by Create Debaters (ABCD) English text text Online discussion forums (e.g., createdebate.com) Human-Human 10K discussions, 200K posts approximately 20 turns per discussion A large corpus derived from the Create Debate website, containing over 10,000 discussions with more than 200,000 posts annotated for agreement, disagreement, or neutrality. Rosenthal and McKeown, 2015
Internet Argument Corpus (IAC) English text text Political debate and discourse Human-Human 390,704 posts in 11,800 discussions N/A A corpus for research on deliberation and debate, containing argumentative discourse from the online debate site 4forums.com. It includes posts on various political and social topics with annotations for topic, stance, and various dialogic and argumentative markers. Walker et al., 2012b
Multi-Party Chat (MPC) Corpus English text text Online chat environments Human-Human 7317 turns, 58175 words Approximately 520 per session A corpus of multi-party online conversations collected in a chat-room environment to model social phenomena such as agenda control, influence, and leadership in online interactions. Shaikh et al., 2010
Ubuntu Chat Corpus Multiple languages (English, Chinese, Russian, Brazilian Portuguese, Spanish, Italian, Polish, Swedish) text text Technical support for Ubuntu OS Human-Human 11 channels, 40M+ messages, 2.9GB (compressed to 0.6GB) Average message length varies across channels (21.7 to 57.6 characters) The Ubuntu Chat Corpus is a large, publicly available corpus consisting of IRC chat logs from various Ubuntu support channels. It includes messages in multiple languages and covers technical discussions related to Ubuntu OS. Uthus and Aha, 2013
The Movie Dialog Dataset English text text Movies Human-Human ∼75k movie entities, ∼3.5M training examples Varies by task A set of four tasks designed to evaluate different prerequisite qualities of end-to-end dialog systems, focusing on the movie domain. These tasks include question-answering, recommendation, QA+recommendation dialog, and Reddit discussion. Dodge et al., 2015
Cooperative Vision-and-Dialog Navigation (CVDN) English multimodal text, image Navigation in simulated, photorealistic home environments Human-Human 2050 dialogues, 7k navigation trajectories 6 A dataset of over 2k embodied, human-human dialogues situated in simulated, photorealistic home environments for studying vision-and-dialog navigation tasks. Thomason et al., 2020
Talk The Walk English multimodal text, audio Navigation in NYC neighborhoods Human-Human 10,310 dialogues 62 Talk The Walk is a large-scale dialogue dataset grounded in action and perception, where a ‘guide’ and a ‘tourist’ communicate to achieve the goal of navigating the tourist to a target location in New York City. De Vries et al., 2018
Japanese Emotion-Tagged Dialogue Corpus Japanese text text Twitter dialogues Human-Human 3,828 dialogues, 13,806 utterances 3.6 A Japanese dialogue corpus annotated with expressed and experienced emotions for each utterance, collected from Twitter. Ide and Kawahara, 2022
MultiWOZ 2.1 English text text Multiple domains (hotel, taxi, restaurant, etc.) Human-Woz 10K dialogues, over 115K turns 11.5 MultiWOZ 2.1 is a multi-domain dialogue dataset with corrections in state annotations and dialogue utterances, building on the original MultiWOZ 2.0. It includes system and user dialogue acts and offers a benchmark for dialogue state tracking models. Eric et al., 2019
MultiWOZ 2.2 English text text Multiple domains (Restaurant, Hotel, Attraction, Taxi, Train, Hospital, Bus, Police) Human-Woz 10K dialogues, 115K turns N/A MultiWOZ 2.2 is an updated version of the MultiWOZ dataset, with corrections to dialogue state annotations, redefined ontology, and additional slot span annotations. It is used as a benchmark for dialogue state tracking in task-oriented dialogues across multiple domains. Zang et al., 2020
MultiWOZ 2.3 English text text Multiple domains (Train, Taxi, Hotel, Restaurant, Attraction, Hospital, Bus, Police) Human-Woz 10K dialogues, 2.5M tokens unknown MultiWOZ 2.3 is a multi-domain task-oriented dialogue dataset with enhanced annotation corrections and co-reference annotation. Han et al., 2021
MultiWOZ 2.4 English text text Multiple domains (e.g., restaurant, hotel, taxi) Human-Woz 2,000 dialogues, 14,000 turns N/A MultiWOZ 2.4 is an updated version of the MultiWOZ 2.1 dataset. It includes refined annotations in the validation set and test set to improve the evaluation of dialogue state tracking models, focusing on task-oriented dialogues across multiple domains. Ye et al., 2022
JMultiWOZ Japanese text text travel-related domains (tourist attractions, accommodation, restaurants, shopping facilities, taxis, weather) Human-Woz 4,246 dialogues, 61,186 turns, 1.1M tokens 14.4 A large-scale Japanese multi-domain task-oriented dialogue dataset focused on travel-related domains. Ohashi et al., 2024
RealPersonaChat (RPC) Japanese text text General chit-chat conversations Human-Human 14K dialogues, 421K utterances, 5.55M tokens 30.09 A large-scale realistic dialogue corpus in Japanese that includes the actual personas and personality traits of the interlocutors. It is the world’s largest corpus of dialogue data that includes personas and personality traits. Yamashita et al., 2023
DIHANA Spanish speech audio Train services (nationwide trains in Spain) Human-Woz 900 dialogues, 6,278 user turns, 9,129 wizard turns, 48,243 words 7.0 Spontaneous speech dialogues for train service queries using the Wizard of Oz technique, focused on information retrieval for nationwide trains in Spain. Benedí et al, 2006
Wizard of Wikipedia English text text Open-domain (various topics including commuting, music festivals, Arnold Schwarzenegger, etc.) Human-Human 22.3K dialogues, 201.9K turns 9.0 Open-domain dialogues grounded with knowledge retrieved from Wikipedia, focusing on conducting knowledgeable discussions. Dinan et al., 2018
FoCus (Call For Customized conversation) English text text Geographical landmarks Human-Machine 14,452 dialogues, 173,424 utterances 11.99 The FoCus dataset contains conversations about geographical landmarks, where the machine provides customized and knowledgeable responses by grounding the dialogue in both Wikipedia knowledge and user persona. Jang et al., 2022
MPCHAT English multimodal text, image Episodic memory-based dialogues sourced from Reddit Human-Human 15K multi-turn dialogues, 42,531 utterances by 25,877 users 2.83 (approx.) A multimodal persona-grounded dialogue dataset where personas reveal speakers’ episodic memories using both text and images. Ahn et al., 2023
DuLeMon Chinese text text Open-domain dialogue with a focus on long-term persona memory Human-Chatbot 27,501 dialogues 16.2 DuLeMon is a dataset designed for studying long-term memory conversation tasks in Chinese. It focuses on the active construction and utilization of the user’s persona in long-term interactions, with explicit annotation of persona-related information in each dialogue. Xu et al., 2022b
MSPD (Multi-Session Personalized Dialogue) Korean text text Personalized conversations, including daily, knowledge-based, empathetic, and personalized dialogues Human-Human-System 13,469 episodes, 53,880 sessions, 601,062 utterances 11.15 A Korean Multi-Session Personalized Dialogue dataset designed to enable models to generate personalized responses grounded on user persona attributes, focusing on natural and engaging conversation across multiple sessions. Kwon et al., 2023
BlendedSkillTalk English text text Multiple domains (personal background, knowledge, empathy) Human-Human 5k conversations, 56k utterances 11.2 BlendedSkillTalk is a dataset designed to evaluate a model’s ability to blend multiple conversational skills—knowledge, empathy, and personal background—within a single conversation. Smith et al., 2020
Empathetic Dialogues English text text Emotional situations in personal conversations Human-Human 25K dialogues, 24,850 conversations 4.31 A dataset of 25k conversations grounded in emotional situations, designed to improve empathetic dialogue generation. Rashkin et al., 2019
PEC (Persona-based Empathetic Conversations) English text text Multiple domains (happy, offmychest) Human-Human 355K conversations Training set has 6 most recent turns per conversation A large-scale, multi-domain dataset for persona-based empathetic conversations collected from Reddit, focusing on the impact of persona on empathetic responses. Zhong et al., 2020
PersonaMinEdit English text text Persona-grounded dialogues Human-Human Multiple human references N/A PERSONAMINEDIT is a dataset designed to evaluate persona-grounded minimal editing, focusing on editing dialogue responses to improve persona consistency while maintaining coherence with the dialogue history. Wu et al., 2021a
Inadequate-Tiny-ConvAI2 (IT-ConvAI2) English text text Dialogue generation domain Human-Human 1,595 conversations N/A IT-ConvAI2 is a dataset that emphasizes the out-of-predefined persona (OOP) problem in personalized dialogue generation. It is built by removing query-related personas from the original ConvAI2 dataset. Liu et al., 2022
LiveChat Chinese text text Live streaming, multi-party conversations Human-Human 1.33M dialogues, 9.4M utterances 7.1 A large-scale personalized dialogue dataset automatically constructed from live streaming videos, containing detailed persona profiles and multi-party conversations. Gao et al., 2023
PER-CHAT English text text Open-domain Human-Human 1.5M dialogues, 300K user profiles Single-turn dialogues PER-CHAT is an open-domain single-turn dialogue dataset consisting of 1.5M conversations and 300k user profiles collected from Reddit. It includes detailed personalization information such as user profiles and comment histories, making it suitable for generating personalized responses in dialogue systems. Wu et al., 2021b
Pchatbot Chinese text text Open-domain (Weibo), Professional domain (Judicial forums) Human-Human 198.88M dialogues, 397.75M utterances 26.21 for PchatbotW, 2.95 for PchatbotL Pchatbot is a large-scale Chinese conversation dataset dedicated to the development of personalized dialogue models, containing two subsets collected from Weibo and Judicial forums respectively. The dataset includes anonymized user IDs and timestamps to enable personalized dialogue modeling. Qian et al, 2021
Multimodal EmotionLines Dataset (MELD) English multimodal text, audio, video emotion recognition in conversations Human-Human 1,433 dialogues, 13,000 utterances 9.6 MELD is a multimodal multi-party conversational emotion recognition dataset that includes text, audio, and visual data from the TV series Friends. It is designed for emotion recognition in conversations. Poria et al., 2019
Multi-Party Dialogue Dataset (MPDD) Chinese text text Social interactions, Interpersonal relationships Human-Human 4,142 dialogues, 25,548 utterances 6.168 MPDD is a Chinese multi-party dialogue dataset annotated with emotion and interpersonal relationship labels on each utterance. The dialogues are sourced from TV series scripts and are designed to facilitate the analysis of emotions and relationships in social dialogues. Chen et al., 2020
RobotSlang Benchmark English text text, audio, video Robot Localization and Navigation Human-Human 169 dialogues, nearly 5k utterances, 1k minutes of robot camera and control streams 28 A benchmark of human-human cooperative trials for controlling a physical robot through natural language dialogues, focusing on localization and navigation tasks. Banerjee et al., 2020
TEACh (Task-driven Embodied Agents that Chat) English multimodal text, actions (environment interactions) Household tasks in a simulated environment Human-Human 3,047 dialogues 13.67 TEACh is a dataset of over 3,000 human-human dialogues where a Commander with oracle task knowledge communicates with a Follower to complete household tasks in a simulated environment. The dataset supports studies on embodied intelligence, including language grounding, dialogue understanding, and task execution. Padmakumar et al., 2021
Minecraft Dialogue Corpus English text text Collaborative building in Minecraft Human-Human 509 dialogues, 15,926 utterances, 113,116 tokens 30.7 A collection of 509 human-human written dialogues and game logs for a collaborative building task in a Minecraft-based environment, where one player instructs another to build a structure. Narayan-Chen et al., 2019
DialFRED English multimodal text, audio, video Household tasks (navigation and object manipulation) Human-Agent 53K task-relevant questions and answers N/A DialFRED is a dialogue-enabled embodied instruction following benchmark that allows an agent to actively ask questions and use the information in the response to better complete household tasks. It is built by augmenting the ALFRED benchmark and includes a human-annotated dataset with 53K task-relevant questions and answers. Gao et al., 2022
Dialog State Tracking Challenge 3 (DSTC3) English speech text, audio Tourist information (restaurants, pubs, coffee shops) Human-System 2,275 dialogs, 17,677 turns N/A The third Dialog State Tracking Challenge (DSTC3) focused on evaluating the ability of trackers to generalize to new entities, such as new slots and values not present in the training data. The challenge involved human-computer dialogs in the tourist information domain, covering restaurants, pubs, and coffee shops in Cambridge, UK. Henderson et al., 2014
Friends TV Show Emotion Corpus English text text TV Show Transcripts Human-Human 12,606 utterances, 897 scenes, 97 episodes 14.05 A corpus comprising transcripts from the TV show Friends, annotated with seven emotions on consecutive utterances in multiparty dialogues. Zahiri and Choi, 2017