# Overview

While state-of-the-art diarization systems perform remarkably well for some domains (e.g., conversational telephone speech such as CallHome), as was discovered at the 2017 JSALT Summer Workshop at CMU, this success does not transfer to more challenging corpora such as child language recordings, clinical interviews, speech in reverberant environments, web video, and “speech in the wild” (e.g., recordings from wearables in an outdoor or restaurant setting). In particular, current approaches:

• fare poorly at estimating of the number of speakers (e.g., monologues are frequently broken into multiple speakers)
• fail to work for short utterances (<1 second), which is particularly problematic for domains such as clinical interviews, which contain many short segments of high information content
• deal poorly with child speech and pathological speech (e.g., due to neurodegenerative diseases)
• are not robust to materials with large amounts of overlapping speech or dynamic environmental noise with some speechlike characteristics

The goals of the inaugural DIHARD evaluation include:

• to create an evaluation set drawn from a diverse set of challenging domains
• to establish a baseline of performance for existing diarization technologies on this set
• to release the reference data and result for continued research after the evaluation to encourage further testing and development

The goal of the challenge is to automatically detect and label all speaker segments in each audio recording. Small pauses of <= 200 ms by a speaker are not considered to be segmentation breaks and should be bridged into a single continuous segment. Vocal noises other than breaths (e.g., laughter, cough, sneeze, and lipsmack), are considered to be speech for the purpose of this evaluation, though all other sounds are considered non-speech. Because system performance is strongly influenced by the quality of the speech segmentation used, two tracks will be supported:

• Track 1: diarization using gold speech segmentation
• Track 2: diarization from scratch

Systems submitted to the former track should use the provided reference speech segmentation for each file, which will allow for evaluation of the diarization component in isolation from the SAD component. Systems submitted to the latter track will work directly from the audio. All researchers are strongly encouraged to submit results to at least the first track.

# Scoring

System output will be scored by comparison to human reference segmentation with performance evaluated by two metrics:

• diarization error rate (DER)
• framewise mutual information (MI)

### 1. Diarization error rate

Diarization error rate (DER), introduced for the NIST Rich Transcription Spring 2003 Evaluation (RT-03S), is the total percentage of reference speaker time that is not correctly attributed to a speaker, where “correctly attributed” is defined in terms of an optimal one-to-one mapping between the reference and system speakers. More concretely, DER is defined as:

$$\textrm{DER} = \frac{\textrm{FA} + \textrm{MISS} + \textrm{ERROR}}{\textrm{TOTAL}}$$

where

• TOTAL is the total reference speaker time; that is, the sum of the durations of all reference speaker segments
• FA is the total system speaker time not attributed to a reference speaker
• MISS is the total reference speaker time not attributed to a system speaker
• ERROR is the total reference speaker time attributed to the wrong speaker

Contrary to practice in the NIST evaluations, NO forgiveness collar will be applied to the reference segments prior to scoring and overlapping speech WILL be evaluated. For more details please consult section 6 of the RT-09 evaluation plan and the source to the NIST md-eval scoring tool, available as part of the Speech Recognition Scoring Toolkit (SCTK). For DIHARD, we will be using version 22 of md-eval.

### 2. Mutual information

We will also approach system evaluation from the standpoint of clustering evaluation, where both the reference and system segmentations are viewed as assignments of labels to frames of speech and a system's score is the mutual information in bits between its labeling and the reference labeling. More concretely, each segmentation will be converted to a sequence of 10 ms frames, each of which is assigned a single label corresponding to one of the following classes:

• non-speech
• non-overlapping speech by speakeri
• overlapping speech by n speakers speakeri1, ..., speakerin
where the sets of speakers are assumed disjoint for any pair of files. The contingency matrix between the reference and system labelings is then built and from this the mutual information computed according to:

$$\textrm{MI} = \sum_{i=1}^{R}\sum_{j=1}^{S}\frac{n_{ij}}{N}\log_2{\frac{n_{ij}N}{r_is_j}}$$

where

• R is the number of reference clusters
• S is the number of system clusters
• nij is the number of frames assigned to the i-th reference cluster and j-th system cluster
• ri is the number of frames assigned to the i-th reference cluster
• sj is the number of frames assigned to the j-th system cluster
• N is the total number of frames

### 3. Scoring regions

The scoring region for each recording will be the entirety of the recording; that is, for a recording of duration 405.37 seconds, the scoring region will be [0, 405.37]. These regions will be provided to the scoring tool via un-partitioned evaluation map (UEM) files, which are plaintext files containing one scoring region per line, each line consisting of four space-delimited fields

• File ID -- file name; basename of the recording minus extension (e.g., “rec1_a”)
• Channel ID -- channel (1-indexed) that scoring region is on
• Onset -- onset of scoring region in seconds from beginning of recording
• Offset -- offset of scoring region in seconds from beginning of recording
For instance:

CMU_20020319-1400_d01_NONE 1 125.000000 727.090000
CMU_20020320-1500_d01_NONE 1 111.700000 615.330000
ICSI_20010208-1430_d05_NONE 1 97.440000 697.290000

UEM files for the dev/eval partitions:

### 4. Scoring tool

The official scoring tool is maintained as a github repo: https://github.com/nryant/dscore. To score a set of system output RTTMs sys1.rttm, sys2.rttm, ... against corresponding reference RTTMs ref1.rttm, ref2.rttm, ... using the un-partitioned evaluation map (UEM) dev.uem, the command line would be:

\$ python score.py -u dev.uem -r ref1.rttm ref2.rttm ... -s sys1.rttm sys2.rttm ...

The overall and per-file results for DER and MI (and many other metrics) will be printed to STDOUT as a table. For additional details about scoring tool usage, please consult the documentation for the github repo.