DIHARD participants may use any data to train their system, whether publicly available or not, with the exception of the following previously released LDC corpora, from which portions of the evaluation set are drawn:
Speech samples with diarization and reference speech segmentation will be be distributed to registered participants and may be used for any purpose including system development or training. These samples consist of approximately 19 hours worth of 5-10 minute chunks drawn from the following domains:
Previously unexposed annotation of oral arguments from the 2001 term of the U.S. Supreme Court that were transcribed and manually word-aligned as part of the OYEZ project. The original recordings were made using individual table-mounted microphones, one for each participant, which could be switched on and off by the speakers as appropriate. The outputs of these microphones were summed and recorded on a single-channel reel-to-reel analogue tape recorder. Those tapes were later digitized and made available by Jerry Goldman of OYEZ.
Previously unexposed recordings of Autism Diagnostic Observation Schedule (ADOS) interviews conducted at the Center for Autism Research (CAR) at the Children's Hospital of Philadelphia (CHOP). ADOS is a semi-structured interview in which clinicians attempt to elicit language that differentiates children with Autism Spectrum Disorders from those without (e.g., “What does being a friend mean to you?”). All interviews were conducted by CAR with audio recorded from a video camera mounted on a wall approximately 12 feet from the location inside the room where the interview was conducted.
Note that in order to publish this data, it had to be de-identified by applying a low-pass filter to regions identified as containing personal identifying information (PII). Pitch information in these regions is still recoverable, but the amplitude levels have been reduced relative to the original signal. Filtering was done with a 10th order Butterworth filter with a passband of 0 to 400 Hz. To avoid abrupt transitions in the resulting waveform, the effect of the filter was gradually faded in and out at the beginning and end of the regions using a ramp of 40 ms.
Previously unexposed recordings of YouthPoint, a late 1970s radio program run by students at the University of Pennsylvania consisting of student-lead interviews with opinion leaders of the era (e.g., Ann Landers, Mark Hamill, Buckminster Fuller, and Isaac Asimov). The recordings were conducted in a studio on open reel tapes and later digitized at LDC.
Previously exposed recordings of subjects involved in map tasks drawn from the DCIEM Map Task Corpus (LDC96S38). Each map task session contains two speakers sitting opposite one another at a table. Each speaker has a map visible only to him and a designated role as either “Leader” or “Follower.” The Leader has a route marked on his map and is tasked with communicating this route to the Follower so that he may precisely reproduce it on his own map. Though each speaker was recorded on a separate channel via a close-talking microphone, these have been mixed together for the DIHARD releases.
Previously exposed recordings of sociolinguistic interviews drawn from the SLX Corpus of Classic Sociolinguistic Interviews (LDC2003T15). These are field recordings conducted during the 1960s and 1970s by Bill Labov and his students in various locations within the Americas and the United Kingdom.
Previously exposed recordings of multiparty (3 to 7 participant) meetings drawn from the 2004 Spring NIST Rich Transcription (RT-04S) dev (LDC2007S11) and eval (LDC2007S12) releases. Meetings were recorded at multiple sites (ICSI, NIST, CMU, and LDC), each with a different microphone setup. For DIHARD, a single channel is distributed for each meeting, corresponding to the RT-04S single distant microphone (SDM) condition. Audio files have been trimmed from the original recordings to the 11 minute scoring regions specified in the RT-04S un-partitioned evaluation map (UEM) files.
**NOTE** In some cases the scoring region onsets/offsets from the original sources were found to bisect a speech segment. In such cases, the onset or offset was adjusted to fall in silence adjacent to the relevant turn.
Previously unexposed single-speaker, amateur recordings of audiobooks selected from LibriVox. In this case, the recordings are unexposed in the sense that while the audio and text these segments were selected from are obviously online and available from LibriVox, they have not previously been released as part of a speech recognition corpus. In particular, care was taken to ensure that the chapters and speakers drawn from were not present in LibriSpeech.
Previously unexposed annotations of web video collected as part of the Video Annotation for Speech Technologies (VAST) project. This domain is expected to be particularly challenging as the videos present a diverse set of topics and recording conditions. Unlike the other sources, which contain only English speech, though not necessarily from native speakers, the VAST selections contain both English and Mandarin speech with half the selections coming from monolingual English videos and half from monolingual Mandarin videos.
All samples will be distributed as 16 kHz, mono-channel FLAC files.
The evaluation set consists of approximately 21 hours worth of 5-10 minute speech samples drawn from the same domains and sources as the development set with the following exceptions:
Instead of SLX, previously exposed sociolinguistic interviews recorded as part of MIXER6 (LDC2013S03) are used. While these recordings have not previously been released with diarization or SAD, the audio data was released as part of LDC2013S03, excerpts of which were used in the NIST SRE10 and SRE12 evaluation sets. The released audio comes from microphone five, a PZM microphone.
For the meeting speech domain, previously unexposed recordings of multiparty (3 to 6 participant) meetings conducted at LDC in the Fall of 2001 as part of ROAR are used. All meetings were recorded in the same room, though with different microphone setups. A single centrally located distant microphone is provided for each meeting.
The evaluation set includes a novel domain, unseen in the development set, consisting of previously unexposed recordings from LDC's Conversations in Restaurants (CIR) collection. These recordings consist of conversations between 3 to 6 speakers, all LDC employees, seated at the same table at a restaurant on the University of Pennsylvania campus. All recordings were conducted using binaural microphones mounted on either side of one speaker's head, whose outputs were then mixed down to one channel.
The domain from which each sample is drawn will not be provided during the evaluation period, but will be revealed at the conclusion of the evaluation.
Where transcription exists and forced alignment was feasible, initial segment boundaries were produced by refining the human marked boundaries with forced alignment by trimming of turn-initial/turn-final silence and splitting on pauses > 200 ms in duration, where for a given speaker, a pause is defined as any segment in which that speaker is not producing a vocalization. This includes breaths, but not coughs, laughs, or lipsmacks. In some cases, during the annotation process non-speech vocal noises were encountered that could not be accurately assigned to a speaker. All such segments have been omitted. Ideally, this segmentation was then checked and corrected by human annotators using a tool equipped with a spectrogam display. Where forced alignment was not possible, manually assigned segment boundaries were used. The reference speech-activity segmentation (SAD) was then derived from the diarization speaker-segment boundaries by merging overlapping segments and removing speaker identification.
Because this was an unfunded pilot project, created under time pressure by volunteers, the full three-step workflow (transcription, alignment, checking and correction by human annotators) could not be implemented for all sources. The situation for each source is as follows:
For the selections from “Autism Diagnostic Observation Schedule (ADOS)” interviews, the full workflow was implemented.
For the selections from “Conversations In Restaurants (CIR)”, segments were derived from a careful turn-level transcription, without alignment and checking.
For the selections from the (Canadian) “Defence and Civil Institute of Environmental Medicine (DCIEM)” map task corpus, the full workflow was implemented.
For the selections from LibriVox audiobooks, the full workflow was implemented.
For the selections from sociolinguistic interviews conducted as part of MIXER 6, the full workflow was implemented.
For the selections from meeting data collected at LDC in 2001 as part of the ROAR project, the full workflow was implemented.
For the selections from 2001 U.S. Supreme Court oral arguments, the full workflow was implemented.
For the selections from child language recordings collected as part of the SEEDLingS project, segments were derived from manual segmentation done at LDC (with not entirely consistent guidelines). The evaluation set received an extra QC pass not performed for the development set and, consequently, should be of higher quality, though still imperfect.
For the selections from the “Video Annotation for Speech Technologies (VAST)” project, segments were derived from a careful turn-level transcription performed for that project, without additional alignment and checking.
For the selections from YouthPoint radio interviews, the full workflow was implemented.
For the selections of meeting speech from LDC2007S1 and LDC2007S12, segments were derived from the original releases' RTTM files without any checking. These files have known issues such as overlapping turns, untranscribed speech, and speech that is inaudible on the distant microphones, which were not corrected.
For the selections of sociolinguistic interviews drawn from LDC2003T15, the full workflow was implemented.
We should also note that in cases where recordings from individual microphones were available, transcription and segmentation may have been done separately for each speaker using their individual microphone. This means that the reference RTTM may contain some segments that are inaudible, or nearly so, in the the released single-channel FLAC file, which may be taken from a single distant microphone. This affects MIXER6, ROAR, and (in the Dev set) RT-04S.
For each recording, speech segmentation will be provided via an HTK label file listing one segment per line, each line consisting of three space-delimited fields:
0.10 1.41 speech 1.98 3.44 speech 5.0 7.52 speech
Following prior NIST RT evaluations, diarization for recordings will be provided using Rich Transcription Time Marked (RTTM) files. RTTM files are space-separated text files containing one turn per line, each line containing ten fields:
SPEAKER CMU 20020319-1400 d01 NONE 1 130.430000 2.350 <NA> <NA> juliet <NA> <NA> SPEAKER CMU 20020319-1400 d01 NONE 1 157.610000 3.060 <NA> <NA> tbc <NA> <NA> SPEAKER CMU 20020319-1400 d01 NONE 1 130.490000 0.450 <NA> <NA> chek <NA> <NA>
This identifies a (non-exhaustive) list of publicly available corpora suitable for system training.
Corpora containing meeting speech
Conversational telephone speech (CTS) corpora
Proper interpretation of the evaluation results requires thorough documentation of each system. Consequently, at the end of the evaluation researchers must submit a full description of their system with sufficient detail for a fellow researcher to understand the approach and data/computational requirements. An acceptable system description should include the following information:
A short (a few sentences) high-level description of the system.
This section should describe the data used for training including both volumes and sources. For LDC or ELRA corpora, catalog ids should be supplied. For other publicly available corpora (e.g., AMI) a link should be provided. In cases where a non-publicly available corpus is used, it should be described in sufficient detail to get the gist of its composition. If the system is composed of multiple components and different components are trained using different resources, there should be an accompanying description of which resources were used for which components.
Each component of the system should be described in sufficient detail that another researcher would be able to reimplement it. You may be brief or omit entirely description of components that are standard (e.g., no need to list the standard equations underlying an LSTM or GRU). If hyperparameter tuning was performed, there should be detailed description both of the tuning process and the final hyperparameters arrived at.
We suggest including subsections for each major phase in the system. Suggested subsections:
System developers should report the hardware requirements for both training and at test time:
System execution times to process a single 10 minute recording must be reported.