While state-of-the-art diarization systems perform remarkably well for some domains (e.g., conversational telephone speech such as CallHome), as was discovered at the 2017 JSALT Summer Workshop at CMU, this success does not transfer to more challenging corpora such as child language recordings, clinical interviews, speech in reverberant environments, web video, and “speech in the wild” (e.g., recordings from wearables in an outdoor or restaurant setting). In particular, current approaches:
The goals of the inaugural DIHARD evaluation include:
The goal of the challenge is to automatically detect and label all speaker segments in each audio recording. Small pauses of <= 200 ms by a speaker are not considered to be segmentation breaks and should be bridged into a single continuous segment. Vocal noises other than breaths (e.g., laughter, cough, sneeze, and lipsmack), are considered to be speech for the purpose of this evaluation, though all other sounds are considered non-speech. Because system performance is strongly influenced by the quality of the speech segmentation used, two tracks will be supported:
Systems submitted to the former track should use the provided reference speech segmentation for each file, which will allow for evaluation of the diarization component in isolation from the SAD component. Systems submitted to the latter track will work directly from the audio. All researchers are strongly encouraged to submit results to at least the first track.
System output will be scored by comparison to human reference segmentation with performance evaluated by two metrics:
Diarization error rate (DER), introduced for the NIST Rich Transcription Spring 2003 Evaluation (RT-03S), is the total percentage of reference speaker time that is not correctly attributed to a speaker, where “correctly attributed” is defined in terms of an optimal one-to-one mapping between the reference and system speakers. More concretely, DER is defined as:
Contrary to practice in the NIST evaluations, NO forgiveness collar will be applied to the reference segments prior to scoring and overlapping speech WILL be evaluated. For more details please consult section 6 of the RT-09 evaluation plan and the source to the NIST md-eval scoring tool, available as part of the Speech Recognition Scoring Toolkit (SCTK). For DIHARD, we will be using version 22 of md-eval.
We will also approach system evaluation from the standpoint of clustering evaluation, where both the reference and system segmentations are viewed as assignments of labels to frames of speech and a system's score is the mutual information in bits between its labeling and the reference labeling. More concretely, each segmentation will be converted to a sequence of 10 ms frames, each of which is assigned a single label corresponding to one of the following classes:
The scoring region for each recording will be the entirety of the recording; that is, for a recording of duration 405.37 seconds, the scoring region will be [0, 405.37]. These regions will be provided to the scoring tool via un-partitioned evaluation map (UEM) files, which are plaintext files containing one scoring region per line, each line consisting of four space-delimited fields
CMU_20020319-1400_d01_NONE 1 125.000000 727.090000 CMU_20020320-1500_d01_NONE 1 111.700000 615.330000 ICSI_20010208-1430_d05_NONE 1 97.440000 697.290000
UEM files for the dev/eval partitions:
The official scoring tool is maintained as a github repo: https://github.com/nryant/dscore. To score a set of system output RTTMs sys1.rttm, sys2.rttm, ... against corresponding reference RTTMs ref1.rttm, ref2.rttm, ... using the un-partitioned evaluation map (UEM) dev.uem, the command line would be:
The overall and per-file results for DER and MI (and many other metrics) will be printed to STDOUT as a table. For additional details about scoring tool usage, please consult the documentation for the github repo.
$ python score.py -u dev.uem -r ref1.rttm ref2.rttm ... -s sys1.rttm sys2.rttm ...