Corpora and datasets for discourse and dialogue reserach
Parts of the contents of the list are extracted from the papers using ChatGPT, so they might be wrong. If you find errors, please create GitHub issues or pull requests (Edit this file.). If you don’t have an account on GitHub, please email at mikio.nakano (at) c4a.jp
.
Parts of this list have been adapted from A Survey of Available Corpora for Building Data-Driven Dialogue Systems, with permission; see the survey website for reference and please cite the paper if useful.
Name | Language | Modalities | Data Types | Task/Domain | Participants | Size | Ave. # of Turns | Brief Description | Paper |
---|---|---|---|---|---|---|---|---|---|
Let’s go & DSTC1 | English | Speech | Audio | Bus schedules | Human-System | 171K dialogues | N/A | Telephone conversations between real users and bus information systems | Raux et al. 2006 |
DSTC2 | English | Speech | Transcripts and ASR results | Restaurant search | Human-System | 15K dialogues, 3.7M words | 7.88 | Telephone conversations between hired users and restaurant search system | Henderson et al, 2014 |
MultiWoz 2.0 | English | Text | Text | Multiple domains (restaurant, hotel, etc.) | Human-Woz | 8.5K dialogues, 115K turns, 1.5M tokens | 13.18 | A fully-labeled collection of human-human written conversations spanning over multiple domains and topics | Budzianowski et al., 2018 |
HCRC MapTask Corpus | English | Face-to-face | Audio, video (not available) | direction giving | Human-Human | 128 dialogues, 174K words, 18hrs | A set of 128 dialogues that has been recorded, transcribed, and annotated for a wide range of behaviours, and has been released for research purposes. | Anderson et al., 1991 | |
AMI Corpus | English | face-to-face | close-talking and far-field microphones, individual and room-view video cameras, projection, a whiteboard, individual pens. | Face-to-face meetings | Multi-party human | 175 dialogues, 900K words, 100hrs | A multi-modal data set consisting of 100 hours of meeting recordings | Carletta et al, 2005 | |
Ubuntu Dialogue Corpus | English | IRC chat | text | Chat on Ubuntu | Human-Human | 930K dialogues, 100M words | 7.71 | Dialogues extracted from Ubuntu chat stream on IRC | Lower et al, 2015 |
DailyDialog Dataset | English | Text | Text | Daily communication | Human-Human | 13K dialogues, 1.5M words | 7.9 | DailyDialog is a high-quality multi-turn dialogue dataset that covers conversations about daily life. It is manually labeled with communication intention and emotion information, making it useful for training and evaluating dialogue systems. | Li et al. 2017 |
Persona Chat | English | Chat text | Text | Open domain | Human-Human | 11K dialogues, 162K utterances | A chit-chat dataset where paired Turkers are given assigned personas and chat to try to get to know each other. | Zhang et al., 2018 | |
Schema-Guided Dialogue Dataset | English | Text | Text | 16 domains | Human-System | 16K dialogues, 330K turns | The dataset consists of conversations between a virtual assistant and a user ranging over a variety of domains including Travel, Events, Payment, Media, Restaurants, Weather etc. Annotations for natural language understanding, dialogue state tracking, policy learning, natural language generation and user simulation learning are also included. | Rastogi et al., 2020 | |
EmoWOZ | English | Text | Text | Multiple domains (restaurant, hotel, etc.) | Human-Woz | More than 11K dialogues | 14.63 | A large-scale open-source dataset for emotion recognition in task-oriented dialogues with n 83K emotion annotations of user utterances | Feng et al. 2022 |
Ubuntu Dialogue Corpus | English | text | text | Technical support for Ubuntu-related problems | Human-Human | 930,000 dialogues, 7,100,000 utterances, 100,000,000 words | 7.71 | A dataset containing almost one million multi-turn dialogues extracted from the Ubuntu chat logs, used for research in unstructured multi-turn dialogue systems. It facilitates the development of dialogue managers based on neural language models that can utilize large amounts of unlabeled data. | Lowe et al., 2015 |
Schema-Guided Dialogue (SGD) | English | text | text | 26 services across 16 domains including alarms, banks, buses, calendar events, flights, homes, hotels, media, movies, music, payment, rental cars, restaurants, ridesharing, services, trains, travel, messaging, and weather | Simulated user-system interactions | Over 16,000 dialogues, 329,964 turns | 20.44 | The SGD dataset is designed to support the development of conversational interfaces that can handle multiple domains and services, particularly in scenarios with zero-shot learning where models encounter unseen services or APIs. It uses a schema-guided approach where intents and slots are dynamically provided, facilitating easier integration of new services without retraining. | Rastogi et al., 2020 |
Internet Argument Corpus 2.0 | English | text | text | Online forums and debates on social and political topics | Human-Human | 24,000 posts, 11,079 threads, 3452 authors, 56M tokens | Varies, data includes multiple posts per thread | The IAC 2.0 is an expanded dataset designed to support research on many different aspects of social language and dialogue structure, particularly in online forums on social and political topics. It features an SQL schema for organizing dialogues from several platforms into a structured database format. | Abbott et al., 2016 |
The Settlers of Catan Corpus | English | text | text | Game strategy and conversation | Human-Human | 21 games annotated, ca. 2000 dialogue turns, ca. 40 games collected | Includes ‘a few dozen self-contained bargaining conversations’ per game | A corpus of online chats between agents playing The Settlers of Catan, a competitive win–lose game involving negotiations. The corpus aligns players’ conversations with the state of the game, focusing on negotiation dialogues and strategic interactions. | Afantenos et al., 2012 |
Let’s Go Public corpus | English | speech | audio | Public transportation | Human-System | 627 dialogues, 9162 turns | 14.6 | The corpus contains dialogues from the Let’s Go Public spoken dialog system, which provides bus schedule information during off-peak hours. It includes transcribed calls from the general public, featuring interactions influenced by various user attitudes and environmental conditions. | Raux et al., 2005 |
Dialog State Tracking Challenge | English | speech | text | Bus timetable information | Human-System | 15K transcribed and labeled human-computer dialogs | Varies by dataset; e.g., TRAIN1A: 14.7, TEST4: 10.9 | A corpus of 15,000 human-computer dialogue interactions used for evaluating dialogue systems, specifically focusing on the task of dialog state tracking. The corpus contains dialogs from various dialog systems interacting with real users, collected under the Spoken Dialog Challenge hosted by Carnegie Mellon University. | Williams et al., 2013 |
Carnegie Mellon Communicator | English | speech | audio | Travel planning (air transportation, hotel reservations, car rentals) | Human-System | N/A | N/A | The Carnegie Mellon Communicator system assists users in creating complex travel itineraries through a conversational interface. It utilizes schemas to manage dialogues, aiming to support problem-solving activities by providing information, proposing solutions, and highlighting potential constraint violations. | Rudnicky et al., 1999 |
ATIS Spoken Language Systems Pilot Corpus | English | speech | audio, text | Air travel information | Human-Woz | 41 sessions, 1041 utterances | 25.4 utterances per session | The ATIS corpus is designed for developing and evaluating speech systems that understand spontaneous speech, focused on air travel information. | Hemphill et al, 1990 |
RITEL Corpus | French | speech | audio | Open-domain | Human-System | 582 dialogs, 5360 user queries, 6 hours of user speech | 9 | The RITEL Corpus is a Human-Computer open-domain question answering spoken dialog corpus that includes orthographically transcribed and annotated dialogues focusing on specific entities and topics. It involves a real interaction system rather than a Wizard-of-Oz setup. | Rosset and Petel, 2006 |
Tutorial Dialogs on Mathematical Theorem Proving | German (Translated to English for publication) | text | text, audio, video | Mathematics (Proofs in naive set theory) | Human-Woz | 66 sets of dialog session logs, 1115 total turns, 393 student sentences | 12 | A corpus of dialog session logs from a Wizard-of-Oz experiment focused on teaching proofs in naive set theory, with audio and video logs also collected. | Wolska et al., 2004 |
The MATCH corpus | English | speech | audio | Healthcare, appointment scheduling | Human-Human | 447 dialogues, 6237 turns | 14.0 | The MATCH corpus is a linguistically annotated corpus collected to study the interaction between older and younger users with simulated spoken dialogue systems. It focuses on the effects of cognitive ageing on users’ interactions and was designed to develop technologies to help older users live independently. | Georgila et al, 2010 |
Frames | English | text | text | Travel | Human-Human | 1369 dialogues, 19986 turns | 15 | Frames is a corpus of human-human dialogues collected in a Wizard-of-Oz setting to study complex dialogue flows and decision-making behaviour. The dialogues involve users trying to book travel packages with constraints, exploring options and making selections, facilitated by assistants who manage these requests. | El Asri et al., 2017 |
Multi-Domain In-Car Assistant Dialogue Dataset | English | text | text | Calendar scheduling, weather information retrieval, point-of-interest navigation | Human-Woz | 3,031 dialogues; 2,425 training, 302 validation, 304 test dialogues | 5.25 | This dataset contains dialogues across three domains relevant to in-car personal assistant tasks. Each dialogue is grounded in a knowledge base, making it suitable for developing architectures that reason over world knowledge. | Eric et al., 2017 |
The Walking Around Corpus | English | speech | audio | Pedestrian navigation and spatial cognition | Human-Human | 36 dialogues, detailed transcripts | Multiple tasks involved | The corpus consists of experimentally parameterized collection of spontaneous spoken dialogues, focusing on lexical choice and variability during direction-giving tasks. It involves participants communicating over mobile phones while one navigates a campus based on directions from a stationary partner. | Brennan et al., 2013 |
Intelligence Squared Debates (IQ2 Debates) | English | speech | text | Various (e.g., foreign policy, health, technology) | Human-Human | 108 debates, average 12,801 words and 117 turns per debate | 117 | A corpus of transcripts from Oxford-style debates held in the US, covering a wide range of topics with experts debating motions before a live audience. The dataset tracks conversational dynamics and strategies used to sway audience opinions. | Zhang et al., 2016 |
Idiap Wolf Database | English | multimodal | audio, video | role-playing game, competitive | Human-Human | 7.3 hours of recordings, 50 day-phase games, 36 participants | N/A | The Idiap Wolf Database consists of audio-visual recordings from a competitive role-playing game where players have deceptive and non-deceptive roles. The unique aspect of this corpus is its focus on group behavior and deception in a controlled game setting. | Hung and Chittaranjan, 2010 |
ICSI Meeting Recorder Dialog Act (MRDA) Corpus | English | speech | audio, text | natural meetings | Human-Human | 75 meetings, approx. 72 hours of speech, 180,218 dialog act tags | N/A | A corpus of hand-annotated dialog acts and adjacency pairs from naturally occurring multi-party meetings recorded at the ICSI. It includes over 180,000 dialog act tags across approximately 72 hours of meetings, focusing on complex discourse phenomena. | Shriberg et al., 2004 |
The Trains 93 Dialogues | English | speech | audio | Task-oriented dialogues involving a planning assistant and manufacturing and shipping goods | Human-Human | 98 dialogues, 5900 turns, 55000 words | Approximately 60.2 | A corpus of task-oriented dialogues set in the Trains domain where a user collaborates with a planning assistant to accomplish tasks involving manufacturing and shipping goods in a railroad freight system. Includes audio files, time-aligned word and phoneme transcriptions. | Heeman and Allen, 1995 |
ICT Rapport Datasets | English | multimodal | audio, video | Narrative task involving retelling events from a sexual harassment awareness video | Human-System | 131 participants | N/A | The Rapport Agent is designed to elicit rapport from human participants within a dyadic narrative task. It utilizes real-time analysis of acoustic properties of speech and speaker gestures to generate nonverbal feedback like nods and posture shifts. | Gratch et al., 2007 |
D64 Multimodal Conversational Corpus | English | multimodal | text, audio, video | General conversation | Human-Human | N/A | N/A | A corpus designed to observe conversational behavior as closely as possible to natural interaction, including elements like gaze, posture, and simultaneous movements. The data, collected in a domestic setting, includes extensive video, audio, and motion-capture records. | Oertel et al., 2013 |
Cardiff Conversation Database (CCDb) | English | audiovisual | audio, video | Natural conversations | Human-Human | 30 conversations, 300 minutes of audio-video data | Approximately 10 per conversation (estimated from 5-minute average duration per conversation) | A unique 2D audiovisual database containing natural conversations between pairs of people, annotated for speaker activity, facial expressions, head motion, and non-verbal utterances. | Aubrey et al., 2013 |
4D Cardiff Conversation Database (4D CCDb) | English | multimodal | 3D video (4D), audio | Natural, dyadic conversations | Human-Human | 17 minutes, 34 sequences | N/A | The 4D CCDb is the first 4D (3D Video) audio-visual database containing natural conversations between pairs of people. It includes fully annotated speaker and listener activities such as conversational facial expressions, head motion, and verbal/non-verbal utterances. | Vandeventer et al., 2015 |
Group Affect and Performance (GAP) Corpus | English | multimodal | audio, text | Group interaction and decision-making | Human-Human | 13 group meetings, 104.45 minutes of recordings | N/A | The GAP corpus contains meeting audio, transcriptions, annotations, decision-making performance, as well as group member influence, post-meeting ratings of satisfaction, and demographics. It is designed to stimulate research on the computational analysis of small group meetings. | Braley and Murray, 2018 |
MULTISIMO Corpus | English | multimodal | text, audio, video | Collaborative group interactions in a quiz solving task | Human-Human | 23 sessions, approximately 4 hours total | N/A | The MULTISIMO Corpus involves collaborative group interactions where participants work together to solve quiz questions. It includes multimodal data from different cameras and microphones, synchronized and complemented by personality test results and experience assessment surveys. | Koutsombogera and Vogel, 2018 |
Movie-DiC | English | text | text | Multiple genres (action, crime, drama, thriller, etc.) | Human-Human | 132,229 dialogues, 764,146 turns | 5.78 | A dialogue corpus extracted from movie scripts for studying semantic and pragmatic aspects of human communication in various contexts and styles. | Banchs, 2012 |
Movie-Triples | English | text | text | Wide range of movie script topics | Human-Human | 484 movies, 196,308 triples, Average tokens/triple: 53 | 3 turns per triple | The MovieTriples dataset is developed by expanding and preprocessing the Movie-DiC dataset for generative dialogue modeling. It includes dialogues of three turns between two interlocutors, derived from movie scripts, making it suitable for building dialogue systems that emulate human conversations. | Serban et al., 2016 |
Cornell Movie-Dialogs Corpus | English | text | text | Movie scripts | Human-Human | 220,579 conversational exchanges from 617 unique titles | 5 or more exchanges per pair | A large set of imagined conversations derived from movie scripts, providing a rich resource for studying linguistic coordination and stylistic convergence in fictional dialogues. | Danescu-Niculescu-Mizil and Lee, 2011 |
Conversation Dialog Corpora from Television and Movie Scripts | English | text | text | Television shows and movies | Human-Human | 1,042,288 dialog pairs (raw), 86,719 dialog pairs (after filtering) | N/A | This corpus contains conversation pairs extracted from television and movie scripts. The dialogues are filtered to ensure they are between two speakers, using a method called tri-turn filtering and semantic similarity filtering. The final corpus includes 86,719 high-quality query-response pairs. | Nio et al., 2014 |
TVD: a reproducible and multiply aligned TV series dataset | English | text | text, audio, video | TV Series (The Big Bang Theory and Game of Thrones) | Human-Human | 132 episodes of TBBT, 5 episodes of GoT (manual transcripts), 17 TBBT and 10 GoT episodes (subtitles), 17 TBBT and 10 GoT episodes (automatic transcripts), outlines and summaries for multiple episodes | N/A | The TVD dataset is built around two TV series, The Big Bang Theory and Game of Thrones, and includes multiple tracks such as manual and automatic transcripts, multilingual subtitles, episode outlines, and various metadata. The dataset is designed for tasks like summarization, scene retrieval, and speech retrieval. | Roy et al., 2014 |
Annotated Corpus of Film Dialogue for Learning and Characterizing Character Style | English | text | text | Film dialogue from multiple genres (drama, thriller, crime, comedy, action, romance, adventure) | Human-Human | 862 film scripts, 664,000 lines of dialogue, 9,599,000 tokens | N/A | A corpus of film dialogue collected from the IMSDb archive, annotated for linguistic structures and character archetypes, used to learn character models of linguistic style. | Walker et al., 2012a |
SubTle Corpus | English, Portuguese | text | text | Horror, Sci-fi, Western, Romance | Human-Human | SubTle - Portuguese: 2,930,173 I-R pairs; SubTle - English: 3,454,480 I-R pairs | Varies by genre, average ranges from 419 to 580 I-R pairs per subtitle file | A corpus of Interaction-Response pairs extracted from subtitles files, created to help dialogue systems deal with Out-of-Domain interactions. | Ameixa and Coheur, 2013 |
OPUS | Multiple languages (over 90 languages) | text | text | Multiple domains (legislative texts, administrative texts, movie subtitles, software localization, newspaper texts) | Human-Human | Over 40 billion tokens, 2.7 billion parallel units (aligned sentences and sentence fragments) | N/A | A growing language resource of freely accessible parallel corpora and related tools, used for various applications including machine translation, translation studies, and cross-linguistic corpus studies. | Tiedemann, 2012 |
NPS Internet Chatroom Conversations | English | text | text | General chat, open to any topic | Human-Human | 10K posts, 45K tokens | N/A | The corpus consists of online chat dialogues collected from various chat rooms, annotated with lexical, syntactic, and discourse information. It was developed to support natural language processing applications such as author profiling, entity identification, and social network analysis. | Forsyth and Martell, 2007 |
Twitter Conversations Corpus | English | text | text | Open-domain (Twitter conversations) | Human-Human | 1.3 million conversations | 2 (majority of conversations have only 2 posts) | A large corpus of 1.3 million Twitter conversations, enabling the study of open-domain dialogue acts and structure in a new medium. | Ritter et al., 2010 |
Twitter Triple Corpus | English | text | text | Social Media (Twitter) | Human-Human | 127M triples | N/A (Context + Message + Response as triples) | A large-scale corpus mined from Twitter, used for training context-sensitive response generation models. The corpus consists of triples representing context, message, and response. | Sordoni et al., 2015 |
NUS SMS Corpus | English, Chinese | text | text | General SMS communication | Human-Human | 57,824 messages | N/A | A public SMS corpus focusing on English and Mandarin Chinese SMS messages, collected through crowdsourcing methods. | Chen and Kan, 2013 |
Settlers of Catan Strategic Conversation Corpus | English | text | text | Game negotiation (Settlers of Catan) | Human-Human | 21 games annotated with approximately 2000 dialogue turns | Varies per game, approximately a few dozen per game | A corpus of online chat negotiations during the game The Settlers of Catan, focusing on strategic conversation and negotiation dialogues. | Afantenos et al., 2012 |
Cards corpus | English | text | text | Task-oriented (card game in a maze-like environment) | Human-Human | 744 transcripts, 23,532 utterances, 137,323 words | 31.63 | The Cards corpus is built from a two-person online video game where players collaborate to complete a task. The game records everything, allowing for detailed study of player utterances, context, and strategies in a simple, controlled environment. | Djalali et al., 2012 |
Agreement and Disagreement in Threaded Discussions | English | text | text | Wikipedia discussion forums, LiveJournal weblogs | Human-Human | 118 unique documents, 810 annotated sentence pairs | N/A | A corpus of sentence-level agreement and disagreement annotations over threaded discussions on Wikipedia and LiveJournal. | Andreas et al., 2012 |
Agreement by Create Debaters (ABCD) | English | text | text | Online discussion forums (e.g., createdebate.com) | Human-Human | 10K discussions, 200K posts | approximately 20 turns per discussion | A large corpus derived from the Create Debate website, containing over 10,000 discussions with more than 200,000 posts annotated for agreement, disagreement, or neutrality. | Rosenthal and McKeown, 2015 |
Internet Argument Corpus (IAC) | English | text | text | Political debate and discourse | Human-Human | 390,704 posts in 11,800 discussions | N/A | A corpus for research on deliberation and debate, containing argumentative discourse from the online debate site 4forums.com. It includes posts on various political and social topics with annotations for topic, stance, and various dialogic and argumentative markers. | Walker et al., 2012b |
Multi-Party Chat (MPC) Corpus | English | text | text | Online chat environments | Human-Human | 7317 turns, 58175 words | Approximately 520 per session | A corpus of multi-party online conversations collected in a chat-room environment to model social phenomena such as agenda control, influence, and leadership in online interactions. | Shaikh et al., 2010 |
Ubuntu Chat Corpus | Multiple languages (English, Chinese, Russian, Brazilian Portuguese, Spanish, Italian, Polish, Swedish) | text | text | Technical support for Ubuntu OS | Human-Human | 11 channels, 40M+ messages, 2.9GB (compressed to 0.6GB) | Average message length varies across channels (21.7 to 57.6 characters) | The Ubuntu Chat Corpus is a large, publicly available corpus consisting of IRC chat logs from various Ubuntu support channels. It includes messages in multiple languages and covers technical discussions related to Ubuntu OS. | Uthus and Aha, 2013 |
The Movie Dialog Dataset | English | text | text | Movies | Human-Human | ∼75k movie entities, ∼3.5M training examples | Varies by task | A set of four tasks designed to evaluate different prerequisite qualities of end-to-end dialog systems, focusing on the movie domain. These tasks include question-answering, recommendation, QA+recommendation dialog, and Reddit discussion. | Dodge et al., 2015 |
Cooperative Vision-and-Dialog Navigation (CVDN) | English | multimodal | text, image | Navigation in simulated, photorealistic home environments | Human-Human | 2050 dialogues, 7k navigation trajectories | 6 | A dataset of over 2k embodied, human-human dialogues situated in simulated, photorealistic home environments for studying vision-and-dialog navigation tasks. | Thomason et al., 2020 |
Talk The Walk | English | multimodal | text, audio | Navigation in NYC neighborhoods | Human-Human | 10,310 dialogues | 62 | Talk The Walk is a large-scale dialogue dataset grounded in action and perception, where a ‘guide’ and a ‘tourist’ communicate to achieve the goal of navigating the tourist to a target location in New York City. | De Vries et al., 2018 |
Japanese Emotion-Tagged Dialogue Corpus | Japanese | text | text | Twitter dialogues | Human-Human | 3,828 dialogues, 13,806 utterances | 3.6 | A Japanese dialogue corpus annotated with expressed and experienced emotions for each utterance, collected from Twitter. | Ide and Kawahara, 2022 |
MultiWOZ 2.1 | English | text | text | Multiple domains (hotel, taxi, restaurant, etc.) | Human-Woz | 10K dialogues, over 115K turns | 11.5 | MultiWOZ 2.1 is a multi-domain dialogue dataset with corrections in state annotations and dialogue utterances, building on the original MultiWOZ 2.0. It includes system and user dialogue acts and offers a benchmark for dialogue state tracking models. | Eric et al., 2019 |
MultiWOZ 2.2 | English | text | text | Multiple domains (Restaurant, Hotel, Attraction, Taxi, Train, Hospital, Bus, Police) | Human-Woz | 10K dialogues, 115K turns | N/A | MultiWOZ 2.2 is an updated version of the MultiWOZ dataset, with corrections to dialogue state annotations, redefined ontology, and additional slot span annotations. It is used as a benchmark for dialogue state tracking in task-oriented dialogues across multiple domains. | Zang et al., 2020 |
MultiWOZ 2.3 | English | text | text | Multiple domains (Train, Taxi, Hotel, Restaurant, Attraction, Hospital, Bus, Police) | Human-Woz | 10K dialogues, 2.5M tokens | unknown | MultiWOZ 2.3 is a multi-domain task-oriented dialogue dataset with enhanced annotation corrections and co-reference annotation. | Han et al., 2021 |
MultiWOZ 2.4 | English | text | text | Multiple domains (e.g., restaurant, hotel, taxi) | Human-Woz | 2,000 dialogues, 14,000 turns | N/A | MultiWOZ 2.4 is an updated version of the MultiWOZ 2.1 dataset. It includes refined annotations in the validation set and test set to improve the evaluation of dialogue state tracking models, focusing on task-oriented dialogues across multiple domains. | Ye et al., 2022 |
JMultiWOZ | Japanese | text | text | travel-related domains (tourist attractions, accommodation, restaurants, shopping facilities, taxis, weather) | Human-Woz | 4,246 dialogues, 61,186 turns, 1.1M tokens | 14.4 | A large-scale Japanese multi-domain task-oriented dialogue dataset focused on travel-related domains. | Ohashi et al., 2024 |
RealPersonaChat (RPC) | Japanese | text | text | General chit-chat conversations | Human-Human | 14K dialogues, 421K utterances, 5.55M tokens | 30.09 | A large-scale realistic dialogue corpus in Japanese that includes the actual personas and personality traits of the interlocutors. It is the world’s largest corpus of dialogue data that includes personas and personality traits. | Yamashita et al., 2023 |
DIHANA | Spanish | speech | audio | Train services (nationwide trains in Spain) | Human-Woz | 900 dialogues, 6,278 user turns, 9,129 wizard turns, 48,243 words | 7.0 | Spontaneous speech dialogues for train service queries using the Wizard of Oz technique, focused on information retrieval for nationwide trains in Spain. | Benedí et al, 2006 |
Wizard of Wikipedia | English | text | text | Open-domain (various topics including commuting, music festivals, Arnold Schwarzenegger, etc.) | Human-Human | 22.3K dialogues, 201.9K turns | 9.0 | Open-domain dialogues grounded with knowledge retrieved from Wikipedia, focusing on conducting knowledgeable discussions. | Dinan et al., 2018 |
FoCus (Call For Customized conversation) | English | text | text | Geographical landmarks | Human-Machine | 14,452 dialogues, 173,424 utterances | 11.99 | The FoCus dataset contains conversations about geographical landmarks, where the machine provides customized and knowledgeable responses by grounding the dialogue in both Wikipedia knowledge and user persona. | Jang et al., 2022 |
MPCHAT | English | multimodal | text, image | Episodic memory-based dialogues sourced from Reddit | Human-Human | 15K multi-turn dialogues, 42,531 utterances by 25,877 users | 2.83 (approx.) | A multimodal persona-grounded dialogue dataset where personas reveal speakers’ episodic memories using both text and images. | Ahn et al., 2023 |
DuLeMon | Chinese | text | text | Open-domain dialogue with a focus on long-term persona memory | Human-Chatbot | 27,501 dialogues | 16.2 | DuLeMon is a dataset designed for studying long-term memory conversation tasks in Chinese. It focuses on the active construction and utilization of the user’s persona in long-term interactions, with explicit annotation of persona-related information in each dialogue. | Xu et al., 2022b |
MSPD (Multi-Session Personalized Dialogue) | Korean | text | text | Personalized conversations, including daily, knowledge-based, empathetic, and personalized dialogues | Human-Human-System | 13,469 episodes, 53,880 sessions, 601,062 utterances | 11.15 | A Korean Multi-Session Personalized Dialogue dataset designed to enable models to generate personalized responses grounded on user persona attributes, focusing on natural and engaging conversation across multiple sessions. | Kwon et al., 2023 |
BlendedSkillTalk | English | text | text | Multiple domains (personal background, knowledge, empathy) | Human-Human | 5k conversations, 56k utterances | 11.2 | BlendedSkillTalk is a dataset designed to evaluate a model’s ability to blend multiple conversational skills—knowledge, empathy, and personal background—within a single conversation. | Smith et al., 2020 |
Empathetic Dialogues | English | text | text | Emotional situations in personal conversations | Human-Human | 25K dialogues, 24,850 conversations | 4.31 | A dataset of 25k conversations grounded in emotional situations, designed to improve empathetic dialogue generation. | Rashkin et al., 2019 |
PEC (Persona-based Empathetic Conversations) | English | text | text | Multiple domains (happy, offmychest) | Human-Human | 355K conversations | Training set has 6 most recent turns per conversation | A large-scale, multi-domain dataset for persona-based empathetic conversations collected from Reddit, focusing on the impact of persona on empathetic responses. | Zhong et al., 2020 |
PersonaMinEdit | English | text | text | Persona-grounded dialogues | Human-Human | Multiple human references | N/A | PERSONAMINEDIT is a dataset designed to evaluate persona-grounded minimal editing, focusing on editing dialogue responses to improve persona consistency while maintaining coherence with the dialogue history. | Wu et al., 2021a |
Inadequate-Tiny-ConvAI2 (IT-ConvAI2) | English | text | text | Dialogue generation domain | Human-Human | 1,595 conversations | N/A | IT-ConvAI2 is a dataset that emphasizes the out-of-predefined persona (OOP) problem in personalized dialogue generation. It is built by removing query-related personas from the original ConvAI2 dataset. | Liu et al., 2022 |
LiveChat | Chinese | text | text | Live streaming, multi-party conversations | Human-Human | 1.33M dialogues, 9.4M utterances | 7.1 | A large-scale personalized dialogue dataset automatically constructed from live streaming videos, containing detailed persona profiles and multi-party conversations. | Gao et al., 2023 |
PER-CHAT | English | text | text | Open-domain | Human-Human | 1.5M dialogues, 300K user profiles | Single-turn dialogues | PER-CHAT is an open-domain single-turn dialogue dataset consisting of 1.5M conversations and 300k user profiles collected from Reddit. It includes detailed personalization information such as user profiles and comment histories, making it suitable for generating personalized responses in dialogue systems. | Wu et al., 2021b |
Pchatbot | Chinese | text | text | Open-domain (Weibo), Professional domain (Judicial forums) | Human-Human | 198.88M dialogues, 397.75M utterances | 26.21 for PchatbotW, 2.95 for PchatbotL | Pchatbot is a large-scale Chinese conversation dataset dedicated to the development of personalized dialogue models, containing two subsets collected from Weibo and Judicial forums respectively. The dataset includes anonymized user IDs and timestamps to enable personalized dialogue modeling. | Qian et al, 2021 |
Multimodal EmotionLines Dataset (MELD) | English | multimodal | text, audio, video | emotion recognition in conversations | Human-Human | 1,433 dialogues, 13,000 utterances | 9.6 | MELD is a multimodal multi-party conversational emotion recognition dataset that includes text, audio, and visual data from the TV series Friends. It is designed for emotion recognition in conversations. | Poria et al., 2019 |
Multi-Party Dialogue Dataset (MPDD) | Chinese | text | text | Social interactions, Interpersonal relationships | Human-Human | 4,142 dialogues, 25,548 utterances | 6.168 | MPDD is a Chinese multi-party dialogue dataset annotated with emotion and interpersonal relationship labels on each utterance. The dialogues are sourced from TV series scripts and are designed to facilitate the analysis of emotions and relationships in social dialogues. | Chen et al., 2020 |
RobotSlang Benchmark | English | text | text, audio, video | Robot Localization and Navigation | Human-Human | 169 dialogues, nearly 5k utterances, 1k minutes of robot camera and control streams | 28 | A benchmark of human-human cooperative trials for controlling a physical robot through natural language dialogues, focusing on localization and navigation tasks. | Banerjee et al., 2020 |
TEACh (Task-driven Embodied Agents that Chat) | English | multimodal | text, actions (environment interactions) | Household tasks in a simulated environment | Human-Human | 3,047 dialogues | 13.67 | TEACh is a dataset of over 3,000 human-human dialogues where a Commander with oracle task knowledge communicates with a Follower to complete household tasks in a simulated environment. The dataset supports studies on embodied intelligence, including language grounding, dialogue understanding, and task execution. | Padmakumar et al., 2021 |
Minecraft Dialogue Corpus | English | text | text | Collaborative building in Minecraft | Human-Human | 509 dialogues, 15,926 utterances, 113,116 tokens | 30.7 | A collection of 509 human-human written dialogues and game logs for a collaborative building task in a Minecraft-based environment, where one player instructs another to build a structure. | Narayan-Chen et al., 2019 |
DialFRED | English | multimodal | text, audio, video | Household tasks (navigation and object manipulation) | Human-Agent | 53K task-relevant questions and answers | N/A | DialFRED is a dialogue-enabled embodied instruction following benchmark that allows an agent to actively ask questions and use the information in the response to better complete household tasks. It is built by augmenting the ALFRED benchmark and includes a human-annotated dataset with 53K task-relevant questions and answers. | Gao et al., 2022 |
Dialog State Tracking Challenge 3 (DSTC3) | English | speech | text, audio | Tourist information (restaurants, pubs, coffee shops) | Human-System | 2,275 dialogs, 17,677 turns | N/A | The third Dialog State Tracking Challenge (DSTC3) focused on evaluating the ability of trackers to generalize to new entities, such as new slots and values not present in the training data. The challenge involved human-computer dialogs in the tourist information domain, covering restaurants, pubs, and coffee shops in Cambridge, UK. | Henderson et al., 2014 |
Friends TV Show Emotion Corpus | English | text | text | TV Show Transcripts | Human-Human | 12,606 utterances, 897 scenes, 97 episodes | 14.05 | A corpus comprising transcripts from the TV show Friends, annotated with seven emotions on consecutive utterances in multiparty dialogues. | Zahiri and Choi, 2017 |