CN-Celeb

CN-Celeb is a multi-genre dataset covering 11 different genres in real world,
collected from multiple Chinese open media sources.

3,000
Speakers

CN-Celeb contains speech from Chinese celebrities.

600,000 +
Utterances

CN-Celeb covers multiple genres of speech, including entertainment, interview, singing, play, movie, vlog, live broadcast, speech, drama, recitation and advertisement.

1,200 +
Hours

CN-Celeb consists of complex long-short challenge which meets the scenarios of most
real applications.

Download

The dataset consists of two subsets, CN-Celeb1 and CN-Celeb2. For each subset, we provide audio files and speaker meta-data. There is no overlap between the two subsets. CN-Celeb1 contains more than 125,000 utterances from 997 Chinese celebrities, and CN-Celeb2 contains more than 520,000 utterances from 1,996 Chinese celebrities.

Project

Data

Kaldi | Pytorch

License

All the resources contained in the dataset are free for research institutes and individuals. The copyright remains with the original owners of the audio/video. No commerical usage is permitted.

Publications

Publications based on the dataset welcome to cite the following papers:

Y.Fan,  J.W.Kang,  L.T.Li,  K.C.Li,  H.L.Chen,  S.T.Cheng,  P.Y.Zhang,  Z.Y.Zhou,  Y.Q.Cai,  D.Wang*

CN-Celeb: A Challenging Chinese Speaker Recognition Dataset, ICASSP, 2020

Bibtex | Abstract | Cite | PDF

@inproceedings{fan2020cn,
 title={Cn-celeb: a challenging chinese speaker recognition dataset}, 
 author={Fan, Yue and Kang, JW and Li, LT and Li, KC and Chen, HL and Cheng, ST and Zhang, PY and Zhou, ZY and Cai, YQ and Wang, Dong},
 booktitle={ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
 pages={7604--7608}, year={2020}, organization={IEEE} }

L.T.Li,  R.Q.Liu,  J.W.Kang,  Y.Fan,  H.Cui,  Y.Q.Cai,  R.Vipperla,  T.F.Zheng,  D.Wang*

CN-Celeb: multi-genre speaker recognition, Speech Communication, 2022

Bibtex | Abstract | Cite | PDF

@article{li2022cn,
  title={CN-Celeb: multi-genre speaker recognition},
  author={Li, Lantian and Liu, Ruiqi and Kang, Jiawen and Fan, Yue and Cui, Hao and Cai, Yunqi and Vipperla, Ravichander and Zheng, Thomas Fang and Wang, Dong},
  journal={Speech Communication},
  year={2022},
  publisher={Elsevier}
}

Challenge

We are hosting the first CN-Celeb Speaker Recognition Challenge (CNSRC) at Odyssey 2022 (The Speaker and Language Recognition Workshop). CNSRC aims to evaluate how well the current speaker recognition methods work in real world scenarios, usually with in-the-wild complexity and real-time processing speed. CNSRC consists of two parts, an evaluation challenge and an accompanying workshop. The challenge website can be found here and the workshop website can be found here .

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants No.61633013 and No.62171250.

CN-Celeb-AV is a multi-genre audio-visual person recognition dataset covering 11 different genres in the real world,
collected from multiple Chinese open media sources.

1,136
Speakers

CN-Celeb-AV contains speech from Chinese celebrities.

419,000 +
Utterances

CN-Celeb-AV covers multiple genres of speech, including entertainment, interview, singing, play, movie, vlog, live broadcast, speech, drama, recitation and advertisement.

660 +
Hours

CN-Celeb-AV consists of both full-modality and partial-modality challenges which meet the scenarios of most real applications.

Video

Audio

Dev-F:689

A development set with full-modality information, contains both audio and visual information.

Eval-F:197

An evaluation set with full-modality information, contains both audio and visual information.

Eval-P:250

An evaluation set with partial-modality information, contains some segments whose audio or visual information is corrupted or fully lost.

Download

The dataset consists of three subsets, Dev-F, Eval-F and Eval-P. For each subset, we provide video and audio files and speaker meta-data. There is no overlap among the three subsets. Dev-F contains more than 93,000 segments from 689 Chinese celebrities, Eval-F contains more than 17,000 segments from 197 Chinese celebrities, and Eval-P contains more than 307,900 segments from 250 Chinese celebrities.

Collector

Data

Baseline

License

All the resources contained in the dataset are free for research institutes and individuals. The copyright remains with the original owners of the audio/video.
No commerical usage is permitted.
Please register and log in to the CN-Celeb system, and then submit the data license to request the data.

Publications

Publications based on the dataset welcome to cite the following papers:

Lantian Li, Xiaolou Li, Haoyu Jiang, Chen Chen, Ruihai Hou, Dong Wang*

CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition, INTERSPEECH, 2023.

Bibtex | Abstract | Cite | PDF

@article{li2023cn,
  title={CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition},
  author={Li, Lantian and Li, Xiaolou and Jiang, Haoyu and Chen, Chen and Hou, Ruihai and Wang, Dong},
  journal={arXiv preprint arXiv:2305.16049},
  year={2023}
}

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants No.62171250.

CN-CVS is a large-scale continuous visual-speech dataset in Mandarin Chinese
consisting of short clips collected from TV news and Internet speech shows.

2,500 +
Speakers

CN-CVS contains speech from speakers
ranging in age from teenagers to seniors.

200k +
Utterances

CN-CVS covers complex environmental factors. For instance, different videos are recorded with diverse cameras; in the same video, the angle and distance of the camera may change.

300 +
Hours

CN-CVS involves two parts: CN-CVS/News which consists of TV news, and CN-CVS/Speech which consists of speech shows from online media.

Download

CN-CVS involves two parts: CN-CVS/News and CN-CVS/Speech. The former is relatively constrained in speaking style and camera setting while the latter involves more real-life complexity. CN-CVS/News contains 13,016 segments from 28 speakers with 35 hours in total. CN-CVS/Speech contains 193,245 utterances from 2,529 speakers with 273 hours in total.

Project

Data

Collector | Benchmark

License

Publications

Please cite the following if you make use of the dataset.

Chen Chen, Dong Wang*, Thomas Fang Zheng*

CN-CVS: A Mandarin Audio-Visual Dataset for Large Vocabulary Continuous Visual to Speech Synthesis, ICASSP, 2023

Bibtex | Abstract | Cite | PDF

@inproceedings{chen2023cn,
  title={CN-CVS: A Mandarin Audio-Visual Dataset for Large Vocabulary Continuous Visual to Speech Synthesis},
  author={Chen, Chen and Wang, Dong and Zheng, Thomas Fang},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants No.62171250.

3,000 Speakers

600,000 + Utterances

1,200 + Hours

Download

Kaldi | Pytorch

License

Publications

Challenge

Acknowledgements

1,136 Speakers

419,000 + Utterances

660 + Hours

Video

Audio

Dev-F:689

Eval-F:197

Eval-P:250

Download

Publications

Acknowledgements

2,500 + Speakers

200k + Utterances

300 + Hours

Download

Collector | Benchmark

Publications

Acknowledgements

3,000
Speakers

600,000 +
Utterances

1,200 +
Hours

1,136
Speakers

419,000 +
Utterances

660 +
Hours

2,500 +
Speakers

200k +
Utterances

300 +
Hours