## The Second DIHARD Speech Diarization Challenge

DIHARD is an annual challenge focusing on “hard” diarization; that is, speech diarization for difficult, including, but not limited to:

• clinical interviews
• extended child language acquisition recordings
• "speech in the wild" (e.g., recordings in restaurants)

Because the performance of a diarization system is highly dependent on the quality of the speech activity detection (SAD) system used, the challenge will have several tracks, including diarization from scratch or starting from gold SAD.

Following in the success of the First DIHARD Challenge,
we are pleased to announce This Second DIHARD Challenge (DIHARD II)

DIHARD II will expand on DIHARD I by improving the data sets and introducing a new multichannel diarization track using data contributed by the organizers of CHiME (Speech Separation and Recognition Challenge).

The challenge will run from February 14th, 2019 through July 1, 2019 and results will be presented at a special session at Interspeech 2019 in Graz - Austria.

## Important dates

Event Date
• Registration period
• January 30 through March 1, 2019
• Launch: release of DIHARD II development data
+ scoring code
• February 18, 2019
• Release of DIHARD II eval data
+ baselines + scoring server
• February 28, 2019
• March 29, 2019
• April 5, 2019
• End of challenge/final Interspeech deadline
• July 1, 2019
• Interspeech 2019 special session
• September 15-19, 2019

The deadline for submission of final system outputs is July 1, 2019 midnight.

## Overview

While state-of-the-art diarization systems perform remarkably well for some domains (e.g., conversational telephone speech such as CallHome), as was discovered at the 2017 JSALT Summer Workshop at CMU, this success does not transfer to more challenging corpora such as child language recordings, clinical interviews, speech in reverberant environments, web video, and “speech in the wild” (e.g., recordings from wearables in an outdoor or restaurant setting). In particular, current approaches:

• fare poorly at estimating of the number of speakers (e.g., monologues are frequently broken into multiple speakers)
• fail to work for short utterances (<1 second), which is particularly problematic for domains such as clinical interviews, which contain many short segments of high information content
• deal poorly with child speech and pathological speech (e.g., due to neurodegenerative diseases)
• are not robust to materials with large amounts of overlapping speech or dynamic environmental noise with some speechlike characteristics

The DIHARD Challenge series aims to provide a common ground for the development and evaluation of speech diarization routines that are robust to these challenges.

### Tracks

For DIHARD II we are supporting four tracks spanning two input conditions and two SAD conditions. The input conditions are:

• Single channel -- Diarization is performed from a single channel, which depending on the recording source, may be a single channel from a single distant microphone, a single channel from a microphone array, a mix of head-mounted or array microphones, or a mix of binaural microphones
• Multichannel -- Diarization is performed from multiple channels from a single, distance microphone array. Participants are free to use as few or as many of these channels as they wish.

• Reference SAD -- Participants are provided with a reference speech segmentation that is generated by merging speaker turns in the reference diarization.
• System SAD -- Participants are provided with just the raw audio.

These then yield four tracks

#### Single channel tracks

• Track 1: Diarization from reference SAD using a single channel
• Track 2: Diarization from system SAD using a single channel

#### Multichannel

• Track 3: Diarization from reference SAD using multiple channels
• Track 4: Diarization from system SAD using multiple channels

Tracks 1 and 2 are identical to tracks 1 and 2 in DIHARD I and use the same data, though with improved annotation and additional additional development audio (see Data section for more details). These tracks DO NOT contain any CHiME 5 data.

Tracks 3 and 4 are new this year and consist exclusively of multi-person dinner party conversations taken from the CHiME 5 corpus.

All participants MUST register for at least one of Track 1 or Track 3 (diarization from reference SAD). Participation in tracks 2 and 4 is optional.

## 1. Training data

DIHARD participants may use any data to train their system, whether publicly available or not, with the exception of the following previously released LDC corpora, from which portions of the evaluation set are drawn:

• HCRC Map Task Corpus (LDC93S12)
• DCIEM Map Task Corpus (LDC96S38)
• MIXER6 Speech (LDC2013S03)
• DIHARD II’s data set
Portions of MIXER6 have previously been excerpted for use in the NIST SRE10 and SRE12 evaluation sets, which also may not be used.

All training data should be thoroughly documented in the system description document at the end of the challenge. Please also see the provided list of suggested training corpora.

## 2. Development data

Speech samples with diarization and reference speech segmentation will be be distributed to registered participants and may be used for any purpose including system development or training. These samples consist of approximately 19 hours worth of 5-10 minute chunks drawn from the following domains:

• Child language acquisition recordings
• Previously unexposed recordings of language acquisition in 6-to-18 month olds. The data was collected in the home using a LENA recording device as part of SEEDLingS.

• Supreme Court oral arguments
• Previously unexposed annotation of oral arguments from the 2001 term of the U.S. Supreme Court that were transcribed and manually word-aligned as part of the OYEZ project. The original recordings were made using individual table-mounted microphones, one for each participant, which could be switched on and off by the speakers as appropriate. The outputs of these microphones were summed and recorded on a single-channel reel-to-reel analogue tape recorder. Those tapes were later digitized and made available by Jerry Goldman of OYEZ.

• Clinical interviews
• Previously unexposed recordings of Autism Diagnostic Observation Schedule (ADOS) interviews conducted at the Center for Autism Research (CAR) at the Children's Hospital of Philadelphia (CHOP). ADOS is a semi-structured interview in which clinicians attempt to elicit language that differentiates children with Autism Spectrum Disorders from those without (e.g., “What does being a friend mean to you?”). All interviews were conducted by CAR with audio recorded from a video camera mounted on a wall approximately 12 feet from the location inside the room where the interview was conducted.

Note that in order to publish this data, it had to be de-identified by applying a low-pass filter to regions identified as containing personal identifying information (PII). Pitch information in these regions is still recoverable, but the amplitude levels have been reduced relative to the original signal. Filtering was done with a 10th order Butterworth filter with a passband of 0 to 400 Hz. To avoid abrupt transitions in the resulting waveform, the effect of the filter was gradually faded in and out at the beginning and end of the regions using a ramp of 40 ms.

• Previously unexposed recordings of YouthPoint, a late 1970s radio program run by students at the University of Pennsylvania consisting of student-lead interviews with opinion leaders of the era (e.g., Ann Landers, Mark Hamill, Buckminster Fuller, and Isaac Asimov). The recordings were conducted in a studio on open reel tapes and later digitized at LDC.

• Previously exposed recordings of subjects involved in map tasks drawn from the DCIEM Map Task Corpus (LDC96S38). Each map task session contains two speakers sitting opposite one another at a table. Each speaker has a map visible only to him and a designated role as either “Leader” or “Follower.” The Leader has a route marked on his map and is tasked with communicating this route to the Follower so that he may precisely reproduce it on his own map. Though each speaker was recorded on a separate channel via a close-talking microphone, these have been mixed together for the DIHARD releases.

• Sociolinguistic interviews
• Previously exposed recordings of sociolinguistic interviews drawn from the SLX Corpus of Classic Sociolinguistic Interviews (LDC2003T15). These are field recordings conducted during the 1960s and 1970s by Bill Labov and his students in various locations within the Americas and the United Kingdom.

• Meeting speech
• Previously exposed recordings of multiparty (3 to 7 participant) meetings drawn from the 2004 Spring NIST Rich Transcription (RT-04S) dev (LDC2007S11) and eval (LDC2007S12) releases. Meetings were recorded at multiple sites (ICSI, NIST, CMU, and LDC), each with a different microphone setup. For DIHARD, a single channel is distributed for each meeting, corresponding to the RT-04S single distant microphone (SDM) condition. Audio files have been trimmed from the original recordings to the 11 minute scoring regions specified in the RT-04S un-partitioned evaluation map (UEM) files.

NOTE: In some cases the scoring region onsets/offsets from the original sources were found to bisect a speech segment. In such cases, the onset or offset was adjusted to fall in silence adjacent to the relevant turn.

• Audiobooks
• Previously unexposed single-speaker, amateur recordings of audiobooks selected from LibriVox. In this case, the recordings are unexposed in the sense that while the audio and text these segments were selected from are obviously online and available from LibriVox, they have not previously been released as part of a speech recognition corpus. In particular, care was taken to ensure that the chapters and speakers drawn from were not present in LibriSpeech.

• Previously unexposed annotations of web video collected as part of the Video Annotation for Speech Technologies (VAST) project. This domain is expected to be particularly challenging as the videos present a diverse set of topics and recording conditions. Unlike the other sources, which contain only English speech, though not necessarily from native speakers, the VAST selections contain both English and Mandarin speech with half the selections coming from monolingual English videos and half from monolingual Mandarin videos.

All samples will be distributed as 16 kHz, mono-channel FLAC files.

## 3. Evaluation data

The evaluation set consists of approximately 21 hours worth of 5-10 minute speech samples drawn from the same domains and sources as the development set with the following exceptions:

• Sociolinguistic interviews
• Instead of SLX, previously exposed sociolinguistic interviews recorded as part of MIXER6 (LDC2013S03) are used. While these recordings have not previously been released with diarization or SAD, the audio data was released as part of LDC2013S03, excerpts of which were used in the NIST SRE10 and SRE12 evaluation sets. The released audio comes from microphone five, a PZM microphone.

• Meeting speech
• For the meeting speech domain, previously unexposed recordings of multiparty (3 to 6 participant) meetings conducted at LDC in the Fall of 2001 as part of ROAR are used. All meetings were recorded in the same room, though with different microphone setups. A single centrally located distant microphone is provided for each meeting.

• Restaurant conversation
• The evaluation set includes a novel domain, unseen in the development set, consisting of previously unexposed recordings from LDC's Conversations in Restaurants (CIR) collection. These recordings consist of conversations between 3 to 6 speakers, all LDC employees, seated at the same table at a restaurant on the University of Pennsylvania campus. All recordings were conducted using binaural microphones mounted on either side of one speaker's head, whose outputs were then mixed down to one channel.

The domain from which each sample is drawn is not provided in the annotations, and should not be used if known.

## 4. Segmentation

One of DIHARD II’s goals was to improve the quality and consistency of the segmentation, especially for those sources that were identified as problematic in DIHARD I. With the following exceptions, all sources received a QC pass by an LDC annotator who performed segmentation using a spectrogram and the DIHARD II guidelines:

• RT-04S
• For the selections of meeting speech from LDC2007S1 and LDC2007S12, segments were derived from the original releases' RTTM files without any checking. These files have known issues such as overlapping turns, untranscribed speech, and speech that is inaudible on the distant microphones, which were not corrected.

• CHiME 5
• Due to time constraints, only the CHiME 5 eval set was re-annotated for DIHARD II. All CHiME 5 dev and training segment boundaries are derived from the turn boundaries established during the original CHiME 5 transcription process. Additionally, as all CHiME 5 segment boundaries were established by performing annotation on the binaural microphones, then projecting these back onto the arrays, there may be errors in the segmentation in places where the simple cross-correlation process used to do this projection was unable to handle the drift.

We should also note that in cases where recordings from individual microphones were available, transcription and segmentation may have been done separately for each speaker using their individual microphone. This means that the reference RTTM may contain some segments that are inaudible, or nearly so, in the the released single-channel FLAC file, which may be taken from a single distant microphone. This affects MIXER6, ROAR, and (in the Dev set) RT-04S.

## 5. File formats

For each recording, speech segmentation will be provided via an HTK label file listing one segment per line, each line consisting of three space-delimited fields:

• segment onset in seconds from beginning of recording
• segment offset in seconds from beginning of recording
• segment label (always “speech”)

For example:


0.10  1.41  speech
1.98  3.44  speech
5.0   7.52  speech


Following prior NIST RT evaluations, diarization for recordings will be provided using Rich Transcription Time Marked (RTTM) files. RTTM files are space-separated text files containing one turn per line, each line containing ten fields:

• Type – segment type; should always by “SPEAKER”

• File ID – file name; basename of the recording minus extension (e.g., “rec1 a”)

• Channel ID – channel (1-indexed) that turn is on; should always be “1”

• Turn Onset – onset of turn in seconds from beginning of recording

• Turn Duration – duration of turn in seconds

• Orthography Field – should always by “<NA>”

• Speaker Type – should always be “<NA>”

• Speaker Name – name of speaker of turn; should be unique within scope of each file

• Confidence Score – system confidence (probability) that information is correct; should always be “<NA>”

• Signal Lookahead Time – should always be “<NA>”

For instance:

              
SPEAKER CMU 20020319-1400 d01 NONE 1 130.430000 2.350 <NA> <NA> juliet <NA> <NA>
SPEAKER CMU 20020319-1400 d01 NONE 1 157.610000 3.060 <NA> <NA> tbc <NA> <NA>
SPEAKER CMU 20020319-1400 d01 NONE 1 130.490000 0.450 <NA> <NA> chek <NA> <NA>


## Data Resources for Training

This identifies a (non-exhaustive) list of publicly available corpora suitable for system training.

Corpora containing meeting speech
LDC corpora

• ICSI Meeting Speech Speech (LDC2004S02)

• ICSI Meeting Transcripts (LDC2004T04)

• ISL Meeting Speech Part 1 (LDC2004S05)

• ISL Meeting Transcripts Part 1 (LDC2004T10)

• NIST Meeting Pilot Corpus Speech (LDC2004S09)

• NIST Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13)

• 2004 Spring NIST Rich Transcription (RT-04S) Development Data (LDC2007S11)

• 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data (LDC2007S12)

• 2006 NIST Spoken Term Detection Development Set (LDC2011S02)

• 2006 NIST Spoken Term Detection Evaluation Set (LDC2011S03)

• 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set (LDC2011S06)

Non-LDC corpora

Conversational telephone speech (CTS) corpora

LDC corpora

• CALLHOME Mandarin Chinese Speech (LDC96S34)

• CALLHOME Spanish Speech (LDC96S35)

• CALLHOME Japanese Speech (LDC96S37)

• CALLHOME Mandarin Chinese Transcripts (LDC96T16)

• CALLHOME Spanish Transcripts (LDC96T17)

• CALLHOME Japanese Transcripts (LDC96T18)

• CALLHOME American English Speech (LDC97S42)

• CALLHOME German Speech (LDC97S43)

• CALLHOME Egyptian Arabic Speech (LDC97S45)

• CALLHOME American English Transcripts (LDC97T14)

• CALLHOME German Transcripts (LDC97T15)

• CALLHOME Egyptian Arabic Transcripts (LDC97T19)

• CALLHOME Egyptian Arabic Speech Supplement (LDC2002S37)

• CALLHOME Egyptian Arabic Transcripts Supplement (LDC2002T38)

• Switchboard-1 Release 2 (LDC97S62)

• Fisher English Training Speech Part 1 Speech (LDC2004S13)

• Fisher English Training Speech Part 1 Transcripts (LDC2004T19)

• Arabic CTS Levantine Fisher Training Data Set 3, Speech (LDC2005S07)

• Fisher English Training Part 2, Speech (LDC2005S13)

• Arabic CTS Levantine Fisher Training Data Set3, Transcripts (LDC2005T03)

• Fisher English Training Part 2, Transcripts (LDC2005T19)

• Fisher Levantine Arabic Conversational Telephone Speech (LDC2007S02)

• Fisher Levantine Arabic Conversational Telephone Speech, Transcripts (LDC2007T04)

• Fisher Spanish Speech (LDC2010S01)

• Fisher Spanish - Transcripts (LDC2010T04)

Other corpora

LDC corpora
• Speech in Noisy Environments (SPINE) Training Audio (LDC2000S87)

• Speech in Noisy Environments (SPINE) Evaluation Audio (LDC2000S96)

• Speech in Noisy Environments (SPINE) Training Transcripts (LDC2000T49)

• Speech in Noisy Environments (SPINE) Evaluation Transcripts (LDC2000T54)

• Speech in Noisy Environments (SPINE2) Part 1 Audio (LDC2001S04)

• Speech in Noisy Environments (SPINE2) Part 2 Audio (LDC2001S06)

• Speech in Noisy Environments (SPINE2) Part 3 Audio (LDC2001S08)

• Speech in Noisy Environments (SPINE2) Part1 Transcripts (LDC2001T05)

• Speech in Noisy Environments (SPINE2) Part2 Transcripts (LDC2001T07)

• Speech in Noisy Environments (SPINE2) Part3 Transcripts (LDC2001T09)

• Santa Barbara Corpus of Spoken American English Part I (LDC2000S85)

• Santa Barbara Corpus of Spoken American English Part II (LDC2003S06)

• Santa Barbara Corpus of Spoken American English PartIII (LDC2004S10)

• Santa Barbara Corpus of Spoken American English PartIV (LDC2005S25)

• HAVIC Pilot Transcription (LDC2016V01)

Non-LDC corpora

## Software

The baseline was prepared by Sriram Ganapathy and Lei Sun for the DIHARD data (tracks 1-2), and Shinji Watanabe for the CHiME 5 data (tracks 3-4). It is based on JHU's DIHARD I submission (Sell et al., 2018) with updated KALDI scripts and deep learning based denoising applied as a preprocessing step (following Sun et al., 2018). More details to follow.

## Registration

To register for the evaluation, participants should email dihardchallenge@gmail.com with the subject line "REGISTRATION" and the following details:

• Organization – the organization competing (e.g., NIST, BBN, SRI)
• Team name – the name to be displayed on the leaderboard; you need to use that same team name when you register with codalab (see under "Results submission"; instructions coming by Feb 14)
• Tracks – which tracks they will be competing in

One participant from each site must sign the data license agreement and return them to LDC: (1) by email to ldc@ldc.upenn.edu or (2) by facsimile, Attention: Membership Office, fax number (+1) 215-573-2175. They will also need to create an LDC Online user account, which will be used to download the dev and eval releases.

Once the process is complete, this will give you access to all annotation plus the non-CHiME audio.

Participants of tracks 3 and 4 need to apply separately to Sheffield for the CHiME 5 data regardless of whether you participated in CHiME 5. To apply for the multi-channel data, visit https://licensing.sheffield.ac.uk/i/data/chime5.html
Non-profit organizations should sign the non-commercial license. Everyone else, regardless of use case (even if they are only using the data for non-commercial research), should apply for the commercial license.

## Paper submission

Instructions coming soon

## Results submission

Instructions coming soon

## Rules

The 2019 DIHARD challenge is an open evaluation where the test data is sent to participants, who will process the data locally and submit their system outputs to LDC via codalab for scoring. As such, the participants have agreed to process the data in accordance with the following rules:

• Investigation of the evaluation data prior to the end of the evaluation is disallowed.
• Automatic identification of the domain of the test utterance is allowed.
• During the evaluation period, each team may make at most two submissions per day per system. Additional submissions past the first two each day will be ignored.
• While most test data is actually, or effectively, unexposed, portions have been exposed in part in the following corpora:
• HCRC Map Task Corpus (LDC93S12)
• DCIEM Map Task Corpus (LDC96S38)
• MIXER6 Speech (LDC2013S03)
• NIST SRE10 evaluation data
• NIST SRE12 evaluation data
• DIHARD 2018 data
Use of these corpora is prohibited.
• Participants in the 2017 JSALT Summer Workshop would have had access to an earlier version of the following sources:
• SEEDlingS
• YouthPoint
Teams containing members who participated in JSALT will be allowed to submit systems, but their scores will be flagged on the leaderboard and in publications.
• While participants are encouraged to submit papers to the special session at Interspeech 2019, this is not a requirement for participation.

1. Must I participate in all tracks in the challenge?
No, researchers can choose to participate in a single track. However, researchers are strongly encouraged to submit results to at least the first track (mono microphone, using oracle SAD information).

2. Are there any limitations about the training data?
Participants have the freedom to choose their own training data, whether it is publicly available or not. The only exception is that you should not use data that overlaps with the evaluation set. Please note that clear descriptions of the data are needed in the final technical report.

3. Can I use the development set to do data simulation and augmentation?
Yes, development data is free to be used in any way you see fit, including for tuning your current diarization system or augmenting training data.

4. How can I upload the results?
The submission procedures will be listed in updated in our website soon.

5. Which files should I submit?
Only the result RTTM files are needed.

6. What should I report in the final paper?
Clear documentation of each system is required, providing sufficient detail for a fellow researcher to understand the approach and data/computational requirements. This includes, as mentioned above, explanation of any training data used. Additionally, participants are encouraged to submit papers to the special session at Interspeech 2019, although this is not a requirement for participation.

## System Descriptions

Proper interpretation of the evaluation results requires thorough documentation of each system. Consequently, at the end of the evaluation researchers must submit a full description of their system with sufficient detail for a fellow researcher to understand the approach and data/computational requirements. An acceptable system description should include the following information:

• Abstract
• Data resources
• Detailed description of algorithm
• Hardware requirements

#### Section 1: Abstract

A short (a few sentences) high-level description of the system.

#### Section 2: Data resources

This section should describe the data used for training including both volumes and sources. For LDC or ELRA corpora, catalog ids should be supplied. For other publicly available corpora (e.g., AMI) a link should be provided. In cases where a non-publicly available corpus is used, it should be described in sufficient detail to get the gist of its composition. If the system is composed of multiple components and different components are trained using different resources, there should be an accompanying description of which resources were used for which components.

#### Section 3: Detailed description of algorithm

Each component of the system should be described in sufficient detail that another researcher would be able to reimplement it. You may be brief or omit entirely description of components that are standard (e.g., no need to list the standard equations underlying an LSTM or GRU). If hyperparameter tuning was performed, there should be detailed description both of the tuning process and the final hyperparameters arrived at.

We suggest including subsections for each major phase in the system. Suggested subsections:

• signal processing – e.g., signal enhancement, denoising, source separation

• acoustic features – e.g., MFCCs, PLPs, mel fiterbank, PNCCs, RASTA, pitch extraction

• speech activity detection details – relevant for Track 2 only

• segment representation – e.g., i-vectors, d-vectors

• speaker estimation – how number of speakers was estimated if such estimation was performed

• clustering method – e.g., k-means, agglomerative

• resegmentation details

#### Section 4: Hardware requirements

System developers should report the hardware requirements for both training and at test time:

• Total number of CPU cores used

• Description of CPUs used (model, speed, number of cores)

• Total number of GPUs used

• Description of used GPUs (model, single precision TFLOPS, memory)

• Total available RAM

• Used disk storage

• Machine learning frameworks used (e.g., PyTorch, Tensorflow, CNTK, etc)

System execution times to process a single 10 minute recording must be reported.