Corpora and Datasets | SIGdial Resource Lists

Corpora and datasets for discourse and dialogue reserach

Parts of the contents of the list are extracted from the papers using ChatGPT, so they might be wrong. If you find errors, please create GitHub issues or pull requests (Edit this file.). If you don’t have an account on GitHub, please email at resources@sigdial.org.

Parts of this list have been adapted from A Survey of Available Corpora for Building Data-Driven Dialogue Systems, with permission; see the survey website for reference and please cite the paper if useful.

We also referred to the survey paper On the Need for Thoughtful Data Collection for Multi-Party Dialogue: A Survey of Available Corpora and Collection Methods. We would like to thank the authors.

We would also like to thank David Traum who provided the information.

Name	Language	Modalities	Data Types	Task/Domain	Participants	Size	Ave. # of Turns	Brief Description	Paper
Let’s go & DSTC1	English	Speech	Audio	Bus schedules	Human-System	171K dialogues	N/A	Telephone conversations between real users and bus information systems	Raux et al. 2006
Georgetown University Multilayer corpus (GUM)	English	Mixed (text and speech)	text, markup and transcripts	24 spoken and written genres	Human-Human	~300K tokens	~55 utterances per document	A multilayer English corpus of 24 spoken and written genres annotated for RST and PDTB discourse relations, subtyped coreference and bridging anaphora, entity and proposition salience, multiple summatization, UD syntax and more	Zeldes et al. 2025
Georgetown Chinese Discourse Treebank	Mandarin Chinese	Mixed (text and speech)	text, markup and transcripts	5 spoken and written genres	Human-Human	~63K tokens	~54 utterances per document	A multilayer Chinese corpus of 5 spoken and written genres annotated for RST discourse relations and dependencies, UD syntax and more	Peng et al. 2022
DSTC2	English	Speech	Transcripts and ASR results	Restaurant search	Human-System	15K dialogues, 3.7M words	7.88	Telephone conversations between hired users and restaurant search system	Henderson et al, 2014
MultiWoz 2.0	English	Text	Text	Multiple domains (restaurant, hotel, etc.)	Human-Woz	8.5K dialogues, 115K turns, 1.5M tokens	13.18	A fully-labeled collection of human-human written conversations spanning over multiple domains and topics	Budzianowski et al., 2018
HCRC MapTask Corpus	English	Face-to-face	Audio, video (not available)	direction giving	Human-Human	128 dialogues, 174K words, 18hrs		A set of 128 dialogues that has been recorded, transcribed, and annotated for a wide range of behaviours, and has been released for research purposes.	Anderson et al., 1991
AMI Corpus	English	face-to-face	close-talking and far-field microphones, individual and room-view video cameras, projection, a whiteboard, individual pens.	Face-to-face meetings	Multi-party human	175 dialogues, 900K words, 100hrs		A multi-modal data set consisting of 100 hours of meeting recordings	Carletta et al, 2005
Ubuntu Dialogue Corpus	English	IRC chat	text	Chat on Ubuntu	Human-Human	930K dialogues, 100M words	7.71	Dialogues extracted from Ubuntu chat stream on IRC	Lower et al, 2015
DailyDialog Dataset	English	Text	Text	Daily communication	Human-Human	13K dialogues, 1.5M words	7.9	DailyDialog is a high-quality multi-turn dialogue dataset that covers conversations about daily life. It is manually labeled with communication intention and emotion information, making it useful for training and evaluating dialogue systems.	Li et al. 2017
Persona Chat	English	Chat text	Text	Open domain	Human-Human	11K dialogues, 162K utterances		A chit-chat dataset where paired Turkers are given assigned personas and chat to try to get to know each other.	Zhang et al., 2018
Schema-Guided Dialogue Dataset	English	Text	Text	16 domains	Human-System	16K dialogues, 330K turns		The dataset consists of conversations between a virtual assistant and a user ranging over a variety of domains including Travel, Events, Payment, Media, Restaurants, Weather etc. Annotations for natural language understanding, dialogue state tracking, policy learning, natural language generation and user simulation learning are also included.	Rastogi et al., 2020
EmoWOZ	English	Text	Text	Multiple domains (restaurant, hotel, etc.)	Human-Woz	More than 11K dialogues	14.63	A large-scale open-source dataset for emotion recognition in task-oriented dialogues with n 83K emotion annotations of user utterances	Feng et al. 2022
Ubuntu Dialogue Corpus	English	text	text	Technical support for Ubuntu-related problems	Human-Human	930,000 dialogues, 7,100,000 utterances, 100,000,000 words	7.71	A dataset containing almost one million multi-turn dialogues extracted from the Ubuntu chat logs, used for research in unstructured multi-turn dialogue systems. It facilitates the development of dialogue managers based on neural language models that can utilize large amounts of unlabeled data.	Lowe et al., 2015
Schema-Guided Dialogue (SGD)	English	text	text	26 services across 16 domains including alarms, banks, buses, calendar events, flights, homes, hotels, media, movies, music, payment, rental cars, restaurants, ridesharing, services, trains, travel, messaging, and weather	Simulated user-system interactions	Over 16,000 dialogues, 329,964 turns	20.44	The SGD dataset is designed to support the development of conversational interfaces that can handle multiple domains and services, particularly in scenarios with zero-shot learning where models encounter unseen services or APIs. It uses a schema-guided approach where intents and slots are dynamically provided, facilitating easier integration of new services without retraining.	Rastogi et al., 2020
Internet Argument Corpus 2.0	English	text	text	Online forums and debates on social and political topics	Human-Human	24,000 posts, 11,079 threads, 3452 authors, 56M tokens	Varies, data includes multiple posts per thread	The IAC 2.0 is an expanded dataset designed to support research on many different aspects of social language and dialogue structure, particularly in online forums on social and political topics. It features an SQL schema for organizing dialogues from several platforms into a structured database format.	Abbott et al., 2016
The Settlers of Catan Corpus	English	text	text	Game strategy and conversation	Human-Human	21 games annotated, ca. 2000 dialogue turns, ca. 40 games collected	Includes ‘a few dozen self-contained bargaining conversations’ per game	A corpus of online chats between agents playing The Settlers of Catan, a competitive win–lose game involving negotiations. The corpus aligns players’ conversations with the state of the game, focusing on negotiation dialogues and strategic interactions.	Afantenos et al., 2012
Let’s Go Public corpus	English	speech	audio	Public transportation	Human-System	627 dialogues, 9162 turns	14.6	The corpus contains dialogues from the Let’s Go Public spoken dialog system, which provides bus schedule information during off-peak hours. It includes transcribed calls from the general public, featuring interactions influenced by various user attitudes and environmental conditions.	Raux et al., 2005
Dialog State Tracking Challenge	English	speech	text	Bus timetable information	Human-System	15K transcribed and labeled human-computer dialogs	Varies by dataset; e.g., TRAIN1A: 14.7, TEST4: 10.9	A corpus of 15,000 human-computer dialogue interactions used for evaluating dialogue systems, specifically focusing on the task of dialog state tracking. The corpus contains dialogs from various dialog systems interacting with real users, collected under the Spoken Dialog Challenge hosted by Carnegie Mellon University.	Williams et al., 2013
Carnegie Mellon Communicator	English	speech	audio	Travel planning (air transportation, hotel reservations, car rentals)	Human-System	N/A	N/A	The Carnegie Mellon Communicator system assists users in creating complex travel itineraries through a conversational interface. It utilizes schemas to manage dialogues, aiming to support problem-solving activities by providing information, proposing solutions, and highlighting potential constraint violations.	Rudnicky et al., 1999
ATIS Spoken Language Systems Pilot Corpus	English	speech	audio, text	Air travel information	Human-Woz	41 sessions, 1041 utterances	25.4 utterances per session	The ATIS corpus is designed for developing and evaluating speech systems that understand spontaneous speech, focused on air travel information.	Hemphill et al, 1990
RITEL Corpus	French	speech	audio	Open-domain	Human-System	582 dialogs, 5360 user queries, 6 hours of user speech	9	The RITEL Corpus is a Human-Computer open-domain question answering spoken dialog corpus that includes orthographically transcribed and annotated dialogues focusing on specific entities and topics. It involves a real interaction system rather than a Wizard-of-Oz setup.	Rosset and Petel, 2006
Tutorial Dialogs on Mathematical Theorem Proving	German (Translated to English for publication)	text	text, audio, video	Mathematics (Proofs in naive set theory)	Human-Woz	66 sets of dialog session logs, 1115 total turns, 393 student sentences	12	A corpus of dialog session logs from a Wizard-of-Oz experiment focused on teaching proofs in naive set theory, with audio and video logs also collected.	Wolska et al., 2004
The MATCH corpus	English	speech	audio	Healthcare, appointment scheduling	Human-Human	447 dialogues, 6237 turns	14.0	The MATCH corpus is a linguistically annotated corpus collected to study the interaction between older and younger users with simulated spoken dialogue systems. It focuses on the effects of cognitive ageing on users’ interactions and was designed to develop technologies to help older users live independently.	Georgila et al, 2010
Frames	English	text	text	Travel	Human-Human	1369 dialogues, 19986 turns	15	Frames is a corpus of human-human dialogues collected in a Wizard-of-Oz setting to study complex dialogue flows and decision-making behaviour. The dialogues involve users trying to book travel packages with constraints, exploring options and making selections, facilitated by assistants who manage these requests.	El Asri et al., 2017
Multi-Domain In-Car Assistant Dialogue Dataset	English	text	text	Calendar scheduling, weather information retrieval, point-of-interest navigation	Human-Woz	3,031 dialogues; 2,425 training, 302 validation, 304 test dialogues	5.25	This dataset contains dialogues across three domains relevant to in-car personal assistant tasks. Each dialogue is grounded in a knowledge base, making it suitable for developing architectures that reason over world knowledge.	Eric et al., 2017
The Walking Around Corpus	English	speech	audio	Pedestrian navigation and spatial cognition	Human-Human	36 dialogues, detailed transcripts	Multiple tasks involved	The corpus consists of experimentally parameterized collection of spontaneous spoken dialogues, focusing on lexical choice and variability during direction-giving tasks. It involves participants communicating over mobile phones while one navigates a campus based on directions from a stationary partner.	Brennan et al., 2013
Intelligence Squared Debates (IQ2 Debates)	English	speech	text	Various (e.g., foreign policy, health, technology)	Human-Human	108 debates, average 12,801 words and 117 turns per debate	117	A corpus of transcripts from Oxford-style debates held in the US, covering a wide range of topics with experts debating motions before a live audience. The dataset tracks conversational dynamics and strategies used to sway audience opinions.	Zhang et al., 2016
Idiap Wolf Database	English	multimodal	audio, video	role-playing game, competitive	Human-Human	7.3 hours of recordings, 50 day-phase games, 36 participants	N/A	The Idiap Wolf Database consists of audio-visual recordings from a competitive role-playing game where players have deceptive and non-deceptive roles. The unique aspect of this corpus is its focus on group behavior and deception in a controlled game setting.	Hung and Chittaranjan, 2010
ICSI Meeting Recorder Dialog Act (MRDA) Corpus	English	speech	audio, text	natural meetings	Human-Human	75 meetings, approx. 72 hours of speech, 180,218 dialog act tags	N/A	A corpus of hand-annotated dialog acts and adjacency pairs from naturally occurring multi-party meetings recorded at the ICSI. It includes over 180,000 dialog act tags across approximately 72 hours of meetings, focusing on complex discourse phenomena.	Shriberg et al., 2004
The Trains 93 Dialogues	English	speech	audio	Task-oriented dialogues involving a planning assistant and manufacturing and shipping goods	Human-Human	98 dialogues, 5900 turns, 55000 words	Approximately 60.2	A corpus of task-oriented dialogues set in the Trains domain where a user collaborates with a planning assistant to accomplish tasks involving manufacturing and shipping goods in a railroad freight system. Includes audio files, time-aligned word and phoneme transcriptions.	Heeman and Allen, 1995
ICT Rapport Datasets	English	multimodal	audio, video	Narrative task involving retelling events from a sexual harassment awareness video	Human-System	131 participants	N/A	The Rapport Agent is designed to elicit rapport from human participants within a dyadic narrative task. It utilizes real-time analysis of acoustic properties of speech and speaker gestures to generate nonverbal feedback like nods and posture shifts.	Gratch et al., 2007
D64 Multimodal Conversational Corpus	English	multimodal	text, audio, video	General conversation	Human-Human	N/A	N/A	A corpus designed to observe conversational behavior as closely as possible to natural interaction, including elements like gaze, posture, and simultaneous movements. The data, collected in a domestic setting, includes extensive video, audio, and motion-capture records.	Oertel et al., 2013
Cardiff Conversation Database (CCDb)	English	audiovisual	audio, video	Natural conversations	Human-Human	30 conversations, 300 minutes of audio-video data	Approximately 10 per conversation (estimated from 5-minute average duration per conversation)	A unique 2D audiovisual database containing natural conversations between pairs of people, annotated for speaker activity, facial expressions, head motion, and non-verbal utterances.	Aubrey et al., 2013
4D Cardiff Conversation Database (4D CCDb)	English	multimodal	3D video (4D), audio	Natural, dyadic conversations	Human-Human	17 minutes, 34 sequences	N/A	The 4D CCDb is the first 4D (3D Video) audio-visual database containing natural conversations between pairs of people. It includes fully annotated speaker and listener activities such as conversational facial expressions, head motion, and verbal/non-verbal utterances.	Vandeventer et al., 2015
Group Affect and Performance (GAP) Corpus	English	multimodal	audio, text	Group interaction and decision-making	Human-Human	13 group meetings, 104.45 minutes of recordings	N/A	The GAP corpus contains meeting audio, transcriptions, annotations, decision-making performance, as well as group member influence, post-meeting ratings of satisfaction, and demographics. It is designed to stimulate research on the computational analysis of small group meetings.	Braley and Murray, 2018
MULTISIMO Corpus	English	multimodal	text, audio, video	Collaborative group interactions in a quiz solving task	Human-Human	23 sessions, approximately 4 hours total	N/A	The MULTISIMO Corpus involves collaborative group interactions where participants work together to solve quiz questions. It includes multimodal data from different cameras and microphones, synchronized and complemented by personality test results and experience assessment surveys.	Koutsombogera and Vogel, 2018
Movie-DiC	English	text	text	Multiple genres (action, crime, drama, thriller, etc.)	Human-Human	132,229 dialogues, 764,146 turns	5.78	A dialogue corpus extracted from movie scripts for studying semantic and pragmatic aspects of human communication in various contexts and styles.	Banchs, 2012
Movie-Triples	English	text	text	Wide range of movie script topics	Human-Human	484 movies, 196,308 triples, Average tokens/triple: 53	3 turns per triple	The MovieTriples dataset is developed by expanding and preprocessing the Movie-DiC dataset for generative dialogue modeling. It includes dialogues of three turns between two interlocutors, derived from movie scripts, making it suitable for building dialogue systems that emulate human conversations.	Serban et al., 2016
Cornell Movie-Dialogs Corpus	English	text	text	Movie scripts	Human-Human	220,579 conversational exchanges from 617 unique titles	5 or more exchanges per pair	A large set of imagined conversations derived from movie scripts, providing a rich resource for studying linguistic coordination and stylistic convergence in fictional dialogues.	Danescu-Niculescu-Mizil and Lee, 2011
Conversation Dialog Corpora from Television and Movie Scripts	English	text	text	Television shows and movies	Human-Human	1,042,288 dialog pairs (raw), 86,719 dialog pairs (after filtering)	N/A	This corpus contains conversation pairs extracted from television and movie scripts. The dialogues are filtered to ensure they are between two speakers, using a method called tri-turn filtering and semantic similarity filtering. The final corpus includes 86,719 high-quality query-response pairs.	Nio et al., 2014
TVD: a reproducible and multiply aligned TV series dataset	English	text	text, audio, video	TV Series (The Big Bang Theory and Game of Thrones)	Human-Human	132 episodes of TBBT, 5 episodes of GoT (manual transcripts), 17 TBBT and 10 GoT episodes (subtitles), 17 TBBT and 10 GoT episodes (automatic transcripts), outlines and summaries for multiple episodes	N/A	The TVD dataset is built around two TV series, The Big Bang Theory and Game of Thrones, and includes multiple tracks such as manual and automatic transcripts, multilingual subtitles, episode outlines, and various metadata. The dataset is designed for tasks like summarization, scene retrieval, and speech retrieval.	Roy et al., 2014
Annotated Corpus of Film Dialogue for Learning and Characterizing Character Style	English	text	text	Film dialogue from multiple genres (drama, thriller, crime, comedy, action, romance, adventure)	Human-Human	862 film scripts, 664,000 lines of dialogue, 9,599,000 tokens	N/A	A corpus of film dialogue collected from the IMSDb archive, annotated for linguistic structures and character archetypes, used to learn character models of linguistic style.	Walker et al., 2012a
SubTle Corpus	English, Portuguese	text	text	Horror, Sci-fi, Western, Romance	Human-Human	SubTle - Portuguese: 2,930,173 I-R pairs; SubTle - English: 3,454,480 I-R pairs	Varies by genre, average ranges from 419 to 580 I-R pairs per subtitle file	A corpus of Interaction-Response pairs extracted from subtitles files, created to help dialogue systems deal with Out-of-Domain interactions.	Ameixa and Coheur, 2013
OPUS	Multiple languages (over 90 languages)	text	text	Multiple domains (legislative texts, administrative texts, movie subtitles, software localization, newspaper texts)	Human-Human	Over 40 billion tokens, 2.7 billion parallel units (aligned sentences and sentence fragments)	N/A	A growing language resource of freely accessible parallel corpora and related tools, used for various applications including machine translation, translation studies, and cross-linguistic corpus studies.	Tiedemann, 2012
NPS Internet Chatroom Conversations	English	text	text	General chat, open to any topic	Human-Human	10K posts, 45K tokens	N/A	The corpus consists of online chat dialogues collected from various chat rooms, annotated with lexical, syntactic, and discourse information. It was developed to support natural language processing applications such as author profiling, entity identification, and social network analysis.	Forsyth and Martell, 2007
Twitter Conversations Corpus	English	text	text	Open-domain (Twitter conversations)	Human-Human	1.3 million conversations	2 (majority of conversations have only 2 posts)	A large corpus of 1.3 million Twitter conversations, enabling the study of open-domain dialogue acts and structure in a new medium.	Ritter et al., 2010
Twitter Triple Corpus	English	text	text	Social Media (Twitter)	Human-Human	127M triples	N/A (Context + Message + Response as triples)	A large-scale corpus mined from Twitter, used for training context-sensitive response generation models. The corpus consists of triples representing context, message, and response.	Sordoni et al., 2015
NUS SMS Corpus	English, Chinese	text	text	General SMS communication	Human-Human	57,824 messages	N/A	A public SMS corpus focusing on English and Mandarin Chinese SMS messages, collected through crowdsourcing methods.	Chen and Kan, 2013
Settlers of Catan Strategic Conversation Corpus	English	text	text	Game negotiation (Settlers of Catan)	Human-Human	21 games annotated with approximately 2000 dialogue turns	Varies per game, approximately a few dozen per game	A corpus of online chat negotiations during the game The Settlers of Catan, focusing on strategic conversation and negotiation dialogues.	Afantenos et al., 2012
Cards corpus	English	text	text	Task-oriented (card game in a maze-like environment)	Human-Human	744 transcripts, 23,532 utterances, 137,323 words	31.63	The Cards corpus is built from a two-person online video game where players collaborate to complete a task. The game records everything, allowing for detailed study of player utterances, context, and strategies in a simple, controlled environment.	Djalali et al., 2012
Agreement by Create Debaters (ABCD)	English	text	text	Online discussion forums (e.g., createdebate.com)	Human-Human	10K discussions, 200K posts	approximately 20 turns per discussion	A large corpus derived from the Create Debate website, containing over 10,000 discussions with more than 200,000 posts annotated for agreement, disagreement, or neutrality.	Rosenthal and McKeown, 2015
Internet Argument Corpus (IAC)	English	text	text	Political debate and discourse	Human-Human	390,704 posts in 11,800 discussions	N/A	A corpus for research on deliberation and debate, containing argumentative discourse from the online debate site 4forums.com. It includes posts on various political and social topics with annotations for topic, stance, and various dialogic and argumentative markers.	Walker et al., 2012b
Multi-Party Chat (MPC) Corpus	English	text	text	Online chat environments	Human-Human	7317 turns, 58175 words	Approximately 520 per session	A corpus of multi-party online conversations collected in a chat-room environment to model social phenomena such as agenda control, influence, and leadership in online interactions.	Shaikh et al., 2010
Ubuntu Chat Corpus	Multiple languages (English, Chinese, Russian, Brazilian Portuguese, Spanish, Italian, Polish, Swedish)	text	text	Technical support for Ubuntu OS	Human-Human	11 channels, 40M+ messages, 2.9GB (compressed to 0.6GB)	Average message length varies across channels (21.7 to 57.6 characters)	The Ubuntu Chat Corpus is a large, publicly available corpus consisting of IRC chat logs from various Ubuntu support channels. It includes messages in multiple languages and covers technical discussions related to Ubuntu OS.	Uthus and Aha, 2013
The Movie Dialog Dataset	English	text	text	Movies	Human-Human	∼75k movie entities, ∼3.5M training examples	Varies by task	A set of four tasks designed to evaluate different prerequisite qualities of end-to-end dialog systems, focusing on the movie domain. These tasks include question-answering, recommendation, QA+recommendation dialog, and Reddit discussion.	Dodge et al., 2015
Cooperative Vision-and-Dialog Navigation (CVDN)	English	multimodal	text, image	Navigation in simulated, photorealistic home environments	Human-Human	2050 dialogues, 7k navigation trajectories	6	A dataset of over 2k embodied, human-human dialogues situated in simulated, photorealistic home environments for studying vision-and-dialog navigation tasks.	Thomason et al., 2020
Talk The Walk	English	multimodal	text, audio	Navigation in NYC neighborhoods	Human-Human	10,310 dialogues	62	Talk The Walk is a large-scale dialogue dataset grounded in action and perception, where a ‘guide’ and a ‘tourist’ communicate to achieve the goal of navigating the tourist to a target location in New York City.	De Vries et al., 2018
Japanese Emotion-Tagged Dialogue Corpus	Japanese	text	text	Twitter dialogues	Human-Human	3,828 dialogues, 13,806 utterances	3.6	A Japanese dialogue corpus annotated with expressed and experienced emotions for each utterance, collected from Twitter.	Ide and Kawahara, 2022
MultiWOZ 2.1	English	text	text	Multiple domains (hotel, taxi, restaurant, etc.)	Human-Woz	10K dialogues, over 115K turns	11.5	MultiWOZ 2.1 is a multi-domain dialogue dataset with corrections in state annotations and dialogue utterances, building on the original MultiWOZ 2.0. It includes system and user dialogue acts and offers a benchmark for dialogue state tracking models.	Eric et al., 2019
MultiWOZ 2.2	English	text	text	Multiple domains (Restaurant, Hotel, Attraction, Taxi, Train, Hospital, Bus, Police)	Human-Woz	10K dialogues, 115K turns	N/A	MultiWOZ 2.2 is an updated version of the MultiWOZ dataset, with corrections to dialogue state annotations, redefined ontology, and additional slot span annotations. It is used as a benchmark for dialogue state tracking in task-oriented dialogues across multiple domains.	Zang et al., 2020
MultiWOZ 2.3	English	text	text	Multiple domains (Train, Taxi, Hotel, Restaurant, Attraction, Hospital, Bus, Police)	Human-Woz	10K dialogues, 2.5M tokens	unknown	MultiWOZ 2.3 is a multi-domain task-oriented dialogue dataset with enhanced annotation corrections and co-reference annotation.	Han et al., 2021
MultiWOZ 2.4	English	text	text	Multiple domains (e.g., restaurant, hotel, taxi)	Human-Woz	2,000 dialogues, 14,000 turns	N/A	MultiWOZ 2.4 is an updated version of the MultiWOZ 2.1 dataset. It includes refined annotations in the validation set and test set to improve the evaluation of dialogue state tracking models, focusing on task-oriented dialogues across multiple domains.	Ye et al., 2022
JMultiWOZ	Japanese	text	text	travel-related domains (tourist attractions, accommodation, restaurants, shopping facilities, taxis, weather)	Human-Woz	4,246 dialogues, 61,186 turns, 1.1M tokens	14.4	A large-scale Japanese multi-domain task-oriented dialogue dataset focused on travel-related domains.	Ohashi et al., 2024
RealPersonaChat (RPC)	Japanese	text	text	General chit-chat conversations	Human-Human	14K dialogues, 421K utterances, 5.55M tokens	30.09	A large-scale realistic dialogue corpus in Japanese that includes the actual personas and personality traits of the interlocutors. It is the world’s largest corpus of dialogue data that includes personas and personality traits.	Yamashita et al., 2023
DIHANA	Spanish	speech	audio	Train services (nationwide trains in Spain)	Human-Woz	900 dialogues, 6,278 user turns, 9,129 wizard turns, 48,243 words	7.0	Spontaneous speech dialogues for train service queries using the Wizard of Oz technique, focused on information retrieval for nationwide trains in Spain.	Benedí et al, 2006
Wizard of Wikipedia	English	text	text	Open-domain (various topics including commuting, music festivals, Arnold Schwarzenegger, etc.)	Human-Human	22.3K dialogues, 201.9K turns	9.0	Open-domain dialogues grounded with knowledge retrieved from Wikipedia, focusing on conducting knowledgeable discussions.	Dinan et al., 2018
FoCus (Call For Customized conversation)	English	text	text	Geographical landmarks	Human-Machine	14,452 dialogues, 173,424 utterances	11.99	The FoCus dataset contains conversations about geographical landmarks, where the machine provides customized and knowledgeable responses by grounding the dialogue in both Wikipedia knowledge and user persona.	Jang et al., 2022
MPCHAT	English	multimodal	text, image	Episodic memory-based dialogues sourced from Reddit	Human-Human	15K multi-turn dialogues, 42,531 utterances by 25,877 users	2.83 (approx.)	A multimodal persona-grounded dialogue dataset where personas reveal speakers’ episodic memories using both text and images.	Ahn et al., 2023
DuLeMon	Chinese	text	text	Open-domain dialogue with a focus on long-term persona memory	Human-Chatbot	27,501 dialogues	16.2	DuLeMon is a dataset designed for studying long-term memory conversation tasks in Chinese. It focuses on the active construction and utilization of the user’s persona in long-term interactions, with explicit annotation of persona-related information in each dialogue.	Xu et al., 2022b
MSPD (Multi-Session Personalized Dialogue)	Korean	text	text	Personalized conversations, including daily, knowledge-based, empathetic, and personalized dialogues	Human-Human-System	13,469 episodes, 53,880 sessions, 601,062 utterances	11.15	A Korean Multi-Session Personalized Dialogue dataset designed to enable models to generate personalized responses grounded on user persona attributes, focusing on natural and engaging conversation across multiple sessions.	Kwon et al., 2023
BlendedSkillTalk	English	text	text	Multiple domains (personal background, knowledge, empathy)	Human-Human	5k conversations, 56k utterances	11.2	BlendedSkillTalk is a dataset designed to evaluate a model’s ability to blend multiple conversational skills—knowledge, empathy, and personal background—within a single conversation.	Smith et al., 2020
Empathetic Dialogues	English	text	text	Emotional situations in personal conversations	Human-Human	25K dialogues, 24,850 conversations	4.31	A dataset of 25k conversations grounded in emotional situations, designed to improve empathetic dialogue generation.	Rashkin et al., 2019
PEC (Persona-based Empathetic Conversations)	English	text	text	Multiple domains (happy, offmychest)	Human-Human	355K conversations	Training set has 6 most recent turns per conversation	A large-scale, multi-domain dataset for persona-based empathetic conversations collected from Reddit, focusing on the impact of persona on empathetic responses.	Zhong et al., 2020
PersonaMinEdit	English	text	text	Persona-grounded dialogues	Human-Human	Multiple human references	N/A	PERSONAMINEDIT is a dataset designed to evaluate persona-grounded minimal editing, focusing on editing dialogue responses to improve persona consistency while maintaining coherence with the dialogue history.	Wu et al., 2021a
Inadequate-Tiny-ConvAI2 (IT-ConvAI2)	English	text	text	Dialogue generation domain	Human-Human	1,595 conversations	N/A	IT-ConvAI2 is a dataset that emphasizes the out-of-predefined persona (OOP) problem in personalized dialogue generation. It is built by removing query-related personas from the original ConvAI2 dataset.	Liu et al., 2022
LiveChat	Chinese	text	text	Live streaming, multi-party conversations	Human-Human	1.33M dialogues, 9.4M utterances	7.1	A large-scale personalized dialogue dataset automatically constructed from live streaming videos, containing detailed persona profiles and multi-party conversations.	Gao et al., 2023
PER-CHAT	English	text	text	Open-domain	Human-Human	1.5M dialogues, 300K user profiles	Single-turn dialogues	PER-CHAT is an open-domain single-turn dialogue dataset consisting of 1.5M conversations and 300k user profiles collected from Reddit. It includes detailed personalization information such as user profiles and comment histories, making it suitable for generating personalized responses in dialogue systems.	Wu et al., 2021b
Pchatbot	Chinese	text	text	Open-domain (Weibo), Professional domain (Judicial forums)	Human-Human	198.88M dialogues, 397.75M utterances	26.21 for PchatbotW, 2.95 for PchatbotL	Pchatbot is a large-scale Chinese conversation dataset dedicated to the development of personalized dialogue models, containing two subsets collected from Weibo and Judicial forums respectively. The dataset includes anonymized user IDs and timestamps to enable personalized dialogue modeling.	Qian et al, 2021
Multimodal EmotionLines Dataset (MELD)	English	multimodal	text, audio, video	emotion recognition in conversations	Human-Human	1,433 dialogues, 13,000 utterances	9.6	MELD is a multimodal multi-party conversational emotion recognition dataset that includes text, audio, and visual data from the TV series Friends. It is designed for emotion recognition in conversations.	Poria et al., 2019
Multi-Party Dialogue Dataset (MPDD)	Chinese	text	text	Social interactions, Interpersonal relationships	Human-Human	4,142 dialogues, 25,548 utterances	6.168	MPDD is a Chinese multi-party dialogue dataset annotated with emotion and interpersonal relationship labels on each utterance. The dialogues are sourced from TV series scripts and are designed to facilitate the analysis of emotions and relationships in social dialogues.	Chen et al., 2020
RobotSlang Benchmark	English	text	text, audio, video	Robot Localization and Navigation	Human-Human	169 dialogues, nearly 5k utterances, 1k minutes of robot camera and control streams	28	A benchmark of human-human cooperative trials for controlling a physical robot through natural language dialogues, focusing on localization and navigation tasks.	Banerjee et al., 2020
TEACh (Task-driven Embodied Agents that Chat)	English	multimodal	text, actions (environment interactions)	Household tasks in a simulated environment	Human-Human	3,047 dialogues	13.67	TEACh is a dataset of over 3,000 human-human dialogues where a Commander with oracle task knowledge communicates with a Follower to complete household tasks in a simulated environment. The dataset supports studies on embodied intelligence, including language grounding, dialogue understanding, and task execution.	Padmakumar et al., 2021
Minecraft Dialogue Corpus	English	text	text	Collaborative building in Minecraft	Human-Human	509 dialogues, 15,926 utterances, 113,116 tokens	30.7	A collection of 509 human-human written dialogues and game logs for a collaborative building task in a Minecraft-based environment, where one player instructs another to build a structure.	Narayan-Chen et al., 2019
DialFRED	English	multimodal	text, audio, video	Household tasks (navigation and object manipulation)	Human-Agent	53K task-relevant questions and answers	N/A	DialFRED is a dialogue-enabled embodied instruction following benchmark that allows an agent to actively ask questions and use the information in the response to better complete household tasks. It is built by augmenting the ALFRED benchmark and includes a human-annotated dataset with 53K task-relevant questions and answers.	Gao et al., 2022
Dialog State Tracking Challenge 3 (DSTC3)	English	speech	text, audio	Tourist information (restaurants, pubs, coffee shops)	Human-System	2,275 dialogs, 17,677 turns	N/A	The third Dialog State Tracking Challenge (DSTC3) focused on evaluating the ability of trackers to generalize to new entities, such as new slots and values not present in the training data. The challenge involved human-computer dialogs in the tourist information domain, covering restaurants, pubs, and coffee shops in Cambridge, UK.	Henderson et al., 2014
Friends TV Show Emotion Corpus	English	text	text	TV Show Transcripts	Human-Human	12,606 utterances, 897 scenes, 97 episodes	14.05	A corpus comprising transcripts from the TV show Friends, annotated with seven emotions on consecutive utterances in multiparty dialogues.	Zahiri and Choi, 2017
Hazumi	Japanese	multimodal	text, audio, video, posture, physiological data	chit-chat (food, travel, etc.)	Human-WoZ	214 dialogues (15 to 20 minutes), 18,162 exchanges	84.9	A multimodal dialogue corpus with various manual annotations, including those provided by five third-party annotators as well as those given by the participants themselves. The corpus also includes physiological data.	Komatani and Okada, 2021
KokoroChat	Japanese	Text (role-play)	Text	Psychological counseling	Human-Human (trained counselor role-play)	6,589 dialogues	~91.2 utterances per dialogue	A high-quality, human-collected Japanese psychological counseling dialogue dataset where trained counselors simulate both client and counselor in one-hour text-based sessions, with detailed client feedback per session (20 rating items).	Qi et al., 2025
Switchboard Telephone Speech Corpus (Switchboard-1)	English	Speech (telephone conversations)	Audio, transcripts	Open-domain conversational speech	Human-Human	Approximately 2,400 dialogues (~260 hours of speech; ~3 million words)	~6 minutes per dialogue (i.e., ~12 turns typical) — average not explicitly given	Spontaneous two-speaker telephone conversations across roughly 70 topics, fully transcribed and time-aligned, with speaker demographics and call metadata recorded for speech technology and linguistic research	Godfrey et al., 1992
CALLHOME American English Speech (LDC97S42)	English	Speech (telephone conversations)	Audio (2-channel μ-law at 8 kHz), with optional transcripts (LDC97T14)	Open-domain personal telephone conversations	Human-Human	120 dialogues (~30 minutes each; ~60 hours total)	N/A (unspecified average turns)	Unscripted telephone calls between native speakers, mostly family or friends, fully recorded and documented for ASR research.	Canavan et al., 1997
CALLFRIEND American English-Non-Southern Dialect (LDC96S46)	English	Speech (telephone conversations)	Audio (2-channel μ-law at 8 kHz)	Open-domain conversational speech	Human-Human	60 dialogues, each 5–30 minutes (up to ~30 minutes each)	N/A (not specified)	Unscripted telephone conversations between native speakers of non-Southern American English, with metadata such as speaker demographics and call quality, collected for language identification research	Canavan & Zipperlen, 1996
The HUMAINE Database	English/French/German	Multimodal	Video, audio, annotations	Emotional expressions (naturalistic and induced)	Human (spontaneous/emotional behaviors) – Data clips	50 annotated clips	N/A	A curated set of emotional clips captured in multiple modalities and systematically annotated to support affective computing research, with both naturalistic and induced emotion samples labeled at global and frame-level	Douglas-Cowie et al., 2007
Corpus of Spoken Professional American-English (CSPA)	English	Speech transcripts	Text (transcripts)	Professional domain: academic meetings and press conferences	Human-Human (various professional speakers)	~2 million words across two sub-corpora of ~1 million words each (17 files)	N/A	Transcripts of unscripted spoken interactions—mainly faculty council and committee meetings, and White House press conferences—minimally coded to retain hesitations and disfluencies.	Barlow, 2000
COLT – The Bergen Corpus of London Teenage Language	English	Speech (audio recordings with transcripts)	Audio, orthographic and prosodic transcripts, POS tagging	Spontaneous teenage talk (informal, conversational)	Human-Human (peer teenage conversations)	~500,000 words from recordings by 31 teenagers	N/A	Spontaneous conversational language of 13–17-year-old London teens captured via walkman devices, transcribed and POS-tagged for sociolinguistic and discourse analyses.	Stenström et al., 2002 (COLT project)
Dependency Dialogue Act Corpus	English	Text (multi-party dialogues)	Text transcripts with dialogue-act annotations (Dependency Dialogue Acts framework)	Classroom discussions, board games, and online game chat (multi-genre)	Human-Human multi-party interactions	33 dialogues, over 9,000 utterance units	N/A (not specified separately)	A dense annotation of multi-party conversational data across four genres—physics and engineering classroom discussions, board game interactions, and online game chat—using the Dependency Dialogue Acts framework, with double annotation and adjudication for high consistency.	Cai et al., 2025
British National Corpus (BNC)	English (British)	Mixed (text-based spoken and written)—not dialogue per se	Text (written samples, transcribed speech)	Multiple domains (e.g., newspapers, fiction, conversations, academic, letters)	Mixed participants (various genres of text and spontaneous spoken contributions)	~100 million words total; ~10 million words are spoken (from various types, including conversation)	N/A	A large-scale balanced corpus of late-20th-century British English, encompassing both written texts and transcribed spoken data (including some conversation), intended for general-purpose linguistic research but not focused on dialogue corpus specifically.	Leech et al., 1990s
COLT – The Bergen Corpus of London Teenage Language	English	Speech	Audio, Text	Teenage casual talk (London)	Human-Human	~500 K words (≈ half a million words)	Varies (3 to 39 turns per conversation)	Spontaneous conversations recorded by teenage recruits (aged 13–17) using Walkman, then orthographically transcribed, edited, and POS-tagged for linguistic research	Stenström et al., 2002
Idiap Wolf Corpus	English	Multimodal (audio-visual)	Audio, Video	Competitive role-playing game (Werewolf-style group interaction)	Human-Human (multi-party)	Undisclosed exact size (volunteers in role-playing sessions)	Varies (triadic or multi-party conversations in sessions)	Natural conversational data of volunteers engaged in a competitive role-playing game, captured in an audio-visual corpus to explore group behavior	Hung & Chittaranjan, 2010
Teams Corpus	English	Speech	Audio, Text, Video, Questionnaire data	Cooperative board-game conversation	Human-Human (multi-party, 3–4 participants)	Over 47 hours of recordings from 62 teams (213 participants)	Varies per session (game-based multi-party dialogue)	Audio, video, aligned transcripts, and questionnaire data collected from teams playing the cooperative board game Forbidden Island™, designed to study acoustic-prosodic and lexical entrainment in multi-party spoken dialogues	Litman et al., 2016
Critical Role Dungeons and Dragons Dataset (CRD3)	English	Text	Text	Open-ended role-playing game dialogue (Dungeons & Dragons)	Human-Human (multi-party, fixed group of players and a Dungeon Master)	159 episodes; 398,682 turns	High (varies per episode; dataset spans full gameplay episodes)	Transcribed unscripted live-streamed Dungeons & Dragons sessions featuring storytelling through collaborative dialogue; includes abstractive summaries mined from Fandom wiki	Rameshkumar & Bailey, 2020
Michigan Corpus of Academic Spoken English (MICASE)	English (American English)	Speech	Audio, Text	Academic spoken events (lectures, seminars, meetings, advising, study groups)	Human-Human	~1.8 million words (~200 hours across 152 speech events)	Varies by event (unspecified average)	Spoken academic interactions recorded at the University of Michigan across diverse academic contexts and departments, transcribed and annotated for linguistic study	Simpson-Vlach & Leicher, 2006
Canal9 Political Debate Corpus	English	Multimodal (Speech + Video)	Audio, Video, Text annotations	Political debates (public broadcast debates)	Human-Human (multi-party + moderator)	70 debates; ≈43 hours of recordings	Varies by debate (multi-party structure)	Public political debates annotated richly for social interaction features—including speaker turns, agreement/disagreement, roles, shot segmentation, and speaker identity—recorded in broadcast studio settings	Vinciarelli et al., 2009
Interview	English	Text	Text	News interview transcripts (media dialog)	Human-Human	≈ 105K conversations	Varies (not specified; multi-turn interviews)	Transcribed news interview dialogues gathered from media transcripts, annotated with speaker roles for each turn to support conversational modeling	Majumder et al., 2020
MediaSum	English	Text	Text	Media interviews from NPR and CNN	Human-Human	≈ 463.6K transcripts	Varies per interview (not specified)	Transcribed interviews from radio (NPR) and TV (CNN) with associated summaries or topic descriptions, making it a large-scale dataset for dialogue summarization	Zhu et al., 2021
Corpus of American Soap Operas (SOAP)	English	Text (script transcripts)	Text	Soap opera scripts (American television)	Human-Human (scripted dialogues)	~100 million words from over 22,000 transcripts	Varies per episode (not specified)	A vast compilation of transcripts from ten popular American soap operas (early 2000s), offering rich examples of everyday-styled, multi-party scripted dialogue for linguistic study
Serial Speakers	English	Multimodal (Speech + Video)	Audio (speech turns), Text (encrypted turns via subtitles), Video (shots)	TV serials (Breaking Bad, Game of Thrones, House of Cards)	Human-Human (multi-party dialogues in TV series)	155 episodes (exact word/turn counts not specified)	Varies per episode (multi-party scripted dialogues)	Annotated dataset of episodes from three popular American TV serials with speech-turn boundaries, speaker labels, scene and shot boundaries, recurring shots, and interacting speaker annotations; text content encrypted but recoverable via users’ own subtitle files	Bost et al., 2020
MEISD	English	Text, Speech, Vision (multimodal)	Text, Audio, Video	Multiple domains (TV-series dialogues)	Human-Human (multi-party)	1,000 dialogues (from 10 TV series)	Varies (multi-party dialogues; average not specified)	A balanced multimodal dialogue dataset annotated with multiple emotions, emotion intensities, and sentiment per utterance, collected from ten popular TV shows across genres, with textual, audio, and visual modalities for emotion and sentiment analysis.	Firdaus et al., 2020 (COLING)
NPS Chat Corpus	English	Text	Text (chat logs annotated)	Online chat / Internet-mediated communication	Human-Human (chat)	Not specified	Not specified	A chat corpus annotated with lexical (POS), syntactic, and discourse labels (chat dialog-act), intended to support statistical NLP applications like author profiling and entity identification.	Forsyth & Martell, 2007
Molweni	English	Text	Text (chat logs with questions and annotations)	Technical support chats (Ubuntu IRC)	Human-Human (multi-party)	10,000 dialogues, 88,303 utterances	~8.82	A multiparty dialogue-based MRC dataset with discourse dependency annotations (modified SDRT) and both answerable and unanswerable questions, derived from Ubuntu IRC logs.	Li et al., 2020
Pushshift Reddit Dataset	English	Text	Text (Reddit submissions and comments)	Open-domain social media (Reddit)	Human-Human (multi-participant threads)	~651M submissions, ~5.6B comments (2005–2019)	Varies (thread-level discussions; average not specified)	A large, continuously updated repository of Reddit data—historical submissions and comments—provided via dumps and an API for research, archiving, and social media analysis.	Baumgartner et al., 2020
Reddit Domestic Abuse Dataset	English	Text	Text (Reddit posts and comments)	Domestic abuse discussions on social media	Human-Human (submissions and responses)	1,336 abuse posts; 17,020 non-abuse posts	Varies (thread-level posts; average not specified)	A classification dataset of Reddit submissions labeled as abuse (e.g., “domestic-violence”, “survivors-of-abuse”) versus non-abuse (e.g., “advice”, “anger”, “casual-conversation”) to support detection of domestic abuse discourse online.	Schrading et al., 2015 (EMNLP)
ISL Meeting Speech Part 1 (ISL-MC1)	English	Speech (audio recordings of meetings)	Audio (multi-channel WAV files); Transcripts (orthographic text)	Meeting domain (natural and artificial meetings across various scenarios)	Human-Human (multi-participant meetings)	18 meetings, ~10 hours of speech (105 audio files)	Varies—average meeting duration ~34 minutes; participants ~5 per meeting	Multi-channel microphone recordings of real and staged meetings collected at CMU (2000–2001), with orthographic transcriptions, speaker turn timestamps, and annotations of spontaneous speech phenomena and disfluencies.	Burger et al., 2002 (ICSLP)
CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse Parsing in Conversations	Hindi + English (code-mixed: Hinglish)	Multimodal	Audio, Text (transcriptions)	Multiple customer-support domains (e-commerce, pharmaceutical, stock broker applications, e-marketplace, education)	Human-Human (two-party call-center dialogues)	799 dialogues, 8,811 utterances, ~79,867 words	~11.03 utterances per dialogue	A real-world, code-mixed (Hindi/English) multimodal corpus of customer call-center interactions across multiple domains, annotated at the span level with nine discourse relations, forming directed discourse graphs—reflecting genuine noisy ASR and diarization conditions.	Shukla et al., 2025 (Findings ACL)
KwaiChat	Multiple (multilingual: 4 languages)	Multimodal (video-driven dialogue)	Video, text dialogue content (comments, replies), metadata (domains, topics)	Multimedia discussions: video-based interactions around shared videos	Human-Human (multi-participant dialogues via video comments/replies)	93,209 videos, 246,080 dialogues	N/A	A massive dataset of human-to-human, video-driven multicultural multi-participant dialogues collected via a video-sharing platform, annotated across diverse dialogue types, domains, languages, and topics—designed to support multilingual dialogue generation over rich video context.	Shi et al., 2025