CN-Celeb

Welcome to the first CN-Celeb speaker recognition challenge, CNSRC 2022 ! The challenge aims to probe how well the current speaker recognition methods can work in real world scenarios, including with in-the-wild complexity and real-time processing speed.

The challenge will be based on CN-Celeb, a free multi-genre speaker recognition dataset with the most real-world complexity so far. The dataset consists of audio from both multiple genres of speech, including entertainment, interview, singing, play, movie, vlog, live broadcast, speech, drama, recitation and advertisement as well as real-world noise, strong and overlapped background speakers, significant variations in speaking styles, time-varying and cross-channel problems and long-short test scenarios. The CNSRC 2022 is open now. Please check the detailed information below about the challenge.

CNSRC 2022 Evaluation Plan

News

2022-Jan-30 CNSRC 2022 Registration System is open now.

2022-Feb-22 The Development Set for Task 2 SR is released now.

2022-Feb-22 The Baselines for Task 1 and Task 2 are open now.

2022-Mar-07 The Evaluation Set for Task 2 SR is released now.

2022-Mar-20 The Submission System and Leaderboard are open now.

2022-Apr-15 The submission deadline of system results is 16 May, 12:00 PM (UTC).

2022-May-03 The submission deadline for special session paper to Odyssey 2022 CNSRC special session is 20 May, 12:00 PM (UTC)..

2022-May-03 The submission deadline for system description to CNSRC 2022 is 30 May, 12:00 PM (UTC).

2022-May-16 The submission deadline of system results will be postponed to 18 May, 12:00 (UTC).

2022-May-21 Important: Notification to participants.

2022-May-31

The Submission System for system description to CNSRC 2022 will be closed by 1 June, 12:00 (UTC). Participants without system submission will be regarded as invalid and be removed from the Leaderboard.

2022-June-07

The Validation Set for Task 1 SV is released now. Participants in Task 1 SV tracks are required to submit the result on this validation set obtained with your final submission system. The submission deadline of this extra validation test is 12 June, 12:00 (UTC). More about.

2022-June-30The Metadata for Task 2 SR is released now.

Tasks

CNSRC 2022 defines two tasks: speaker verification (SV) and speaker retrieval (SR).

Task 1. Speaker Verification (SV)

The objective of this task is to improve performance on the standard CN-Celeb evaluation set. According to the data used in system development, two tracks are defined for the SV task: fixed track and open track, shown as follows:

Fixed Track, where only the CN-Celeb training set is allowed for training/tuning the system.

Open Track, where any data sources can be used for developing the system, except the CN-Celeb evaluation set.

Task 2. Speaker Retrieval (SR)

The purpose of this task is to find out the utterances spoken by a target speaker from a large data pool, given an enrollment data of the target speaker. Each target speaker forms a retrieval request. Each target individual has 1 enrolled utterance and 10 test utterances. The non-target set contains a large amount of utterances, coming from multiple sources. The target and non-target utterances are put together, and the participants are required to design their retrieval system to find top-10 candidates for each target speaker, and list them in descending order according to the LLR scores. Participants can use any data sources to train their system, except the CN-Celeb evaluation set.

Evaluation

Task 1. Speaker Verification (SV)

The primary metric for SV performance evaluation is minimum Detection Cost Function (minDCF).
Firstly define the detection cost function as follows:

where 𝑃_{𝑀𝑖𝑠𝑠} (𝜃) is the missing rate and 𝑃_{𝐹𝑎𝑙𝑠𝑒𝐴𝑙𝑎𝑟𝑚} (𝜃) is the false alarm rate with the decision threshold set to 𝜃. 𝐶_{𝑀𝑖𝑠𝑠} and 𝐶_{𝐹𝑎𝑙𝑠𝑒𝐴𝑙𝑎𝑟𝑚} are the cost of a missed detection and a spurious detection, respectively; 𝑃_{𝑇𝑎𝑟𝑔𝑒𝑡} is a prior probability of the specified target speaker. Then minDCF is obtained by minimizing 𝐶_𝐷𝑒𝑡 (𝜃) with respect to 𝜃 and setting 𝐶_{𝑀𝑖𝑠𝑠} = 𝐶_{𝐹𝑎𝑙𝑠𝑒𝐴𝑙𝑎𝑟𝑚} = 1 and 𝑃_{𝑇𝑎𝑟𝑔𝑒𝑡} = 0.01:

Besides minDCF, the SV performance is also evaluated/analyzed in two ways:

• Equal Error Rate (EER). EER is defined as the balanced value of 𝑃_{𝑀𝑖𝑠𝑠} and 𝑃_{𝐹𝑎𝑙𝑠𝑒𝐴𝑙𝑎𝑟𝑚}, formally 𝑃_{𝑀𝑖𝑠𝑠} (𝜃^∗) = 𝑃_{𝐹𝑎𝑙𝑠𝑒𝐴𝑙𝑎𝑟𝑚} (𝜃 ^∗ ), where 𝜃^∗ is the decision threshold that achieves the balance. EER is used as the auxiliary metric and should be reported in the system description.

• Decision Error Tradeoff (DET) curve. DET curve is a curve within a two-dimensional space where the two axes represent 𝑃_{𝑀𝑖𝑠𝑠} and 𝑃_{𝐹𝑎𝑙𝑠𝑒𝐴𝑙𝑎𝑟𝑚} respectively. The DET curve reflects the trade-off between missing and false alarm, and presents the performance of the system at various operation points determined by 𝜃.

Task 2. Speaker Retrieval (SR)

The performance of the SR system will be measured in terms of Mean Average Precision (mAP). For a single speaker 𝑖, suppose there are 𝑀 test utterances overall, and the system output maximum top-𝑁 candidates for each retrieval request. For the top-k case, the Precision is defined as:

The AP of top-𝑁 is defined as the averaged precision over the top-𝑘 (𝑘 = 1, 2, .., 𝑁) cases:

Then mAP is computed as the averaged AP over all the target speakers: retrieval awards of all the target speakers:

where 𝑆 is the number of target speakers. For the evaluation set of CNSRC 2022, the parameters are as follows: the number of target speakers 𝑆 = 5 in SV.dev and 𝑆 = 25 in SV.eval, the number of test utterances per target speaker 𝑀 = 10, and the SR system output maximum candidates 𝑁 = 10.

Resource

The evaluation toolkit can be downloaded from [Here].

Data

Task 1. Speaker Verification (SV)

In the fixed track, only the CN-Celeb training set is allowed to be used to perform system development. It contains 797 speakers of CN-Celeb1.dev and 1996 speakers of CN-Celeb2. Participants can download the dataset from OpenSRL.

In the open track, any data sources and tools can be used to develop the system.

In addition, to better verify the reliability and generalizability of each submission system, an extra validation test is set up by using a small validation set (SV.val) without labels.

Participants can obtain the SV.val set via clicking the following button and then signing the data user agreement. Once signed, the data download links will automatically send to your registration email.

SV.val

Task 2. Speaker Retrieval (SR)

Two datasets are released: SR.dev and SR.eval. Each dataset contains two parts:

(1) Target speakers and associated enrollment data;

(2) Utterance pool that involves utterances of the target speakers as well as a large amount of non-target utterances. SR.dev will be provided to the participants for system development, while SR.eval will be released for system evaluation.

Participants can obtain the datasets via clicking the following button and then signing the data user agreement. Once signed, the data download links will automatically send to your registration email.

SR.dev

SR.eval

SR.meta

Participation Rules

Participation is open and free to all individuals and institutes.
Anonymity of affiliation/department is allowed in leaderboard and result announcement.
Attendance to post-evaluation workshop is mandatory.
Consensus in data user agreement is required.

Registration

Participants must sign up for an evaluation account where they can perform various activities such as registering for the evaluation, signing the data user agreement, as well as uploading the submission and system description.

Once the account has been created, the registration can be performed online. The registration is free to all individuals and institutes. The regular case is that the registration takes effect immediately, but the organizers may check the registration information and ask the participants to provide additional information to validate the registration.

To sign up for an evaluation account, please click Quick Registration

Baseline

The organizers prepared multiple baseline systems to demonstrate the process of training/evaluation required by the challenge. All the baseline systems are open-sourced.

For the fixed track SV system, three baselines are published with [Kaldi] [ASV-Subtools] [Sunine]. These baseline recipes can be easily adapted to develop an open-track system by involving more training data, except CN-Celeb.E.

For speaker retrieval, two baselines are published with [ASV-Subtools] [Sunine]. Based on these baseline recipes, participants can use any data sources to train their system, except CN-Celeb.E.

Submission and Leaderboard

Click here to Submission System

Click here to Learderboard

Participants should submit their results via the submission system. Once the submission is completed, it will be shown in the Leaderboard, and all participants can check their positions. For each task and each track, participants can submit their results no more than 10 times.

All valid submissions are required to be accompanied with a system description, submitted via the submission system. All the system descriptions will be published at the web page of the CNSRC 2022 workshop.

In the system description, participants are allowed to hide their name and affiliation. The submission deadline for system description to CNSRC 2022 is 30 May, 12:00 PM (UTC). The template for system description can be downloaded here.

Participants are encouraged to formulate their system description as an Odyssey 2022 special session paper, submitted via EasyChair submission system.

The submitted paper will be subjected to the same review process as regular papers of Odyssey 2022. Considering the time cost of the review process, the submission system will start from 16 May, 12:00 PM (UTC) and end by 20 May, 12:00 PM (UTC). The template for Odyssey special session paper can be found at here.

Note that each accepted paper MUST be covered by a full registration of Odyssey 2022.

Note that ranks in the current Leaderboard is not the FINAL ranks . The FINAL Leaderboard will be announced in the CNSRC 2022 Workshop at Odyssey 2022.

All the participants in Task 1 SV tracks are required to submit the result on the extra validation set obtained with your FINAL submission system. The result on the validation set will not impact the rank of the Leaderboard. Note that this extra validation test is obligatory, otherwise your submission will be regarded as invalid.

The submission deadline of this extra validation test is 12 June, 12:00 (UTC)

Dates

Mid Feb	Registration System Open.
Late Feb	Development Set for Task 2 SR Release.
Mid Mar	Evaluation Set for Task 2 SR Release.
Mid Mar	Submission System and Leader Board Open.
16 May	Deadline for Submission of Results.
20 May	Deadline for Submission of Special Session Paper to Odyssey 2022 CNSRC Special Session.
30 May	Deadline for Submission of System Description to CNSRC 2022.
27 Jun	CNSRC 2022 workshop at Odyssey 2022.

Organization Committees

Dong Wang, Tsinghua University, Beijing, China
Qingyang Hong, Xiamen University, Xiamen, China
Lantian Li, Tsinghua University, Beijing, China
Wenqiang Du, Tsinghua University, Beijing, China
Yang Zhang, Tsinghua University, Shenzhen, China
Tao Jiang, TalentedSoft, Xiamen, China
Hui Bu, AISHELL, Beijing, China
Xin Xu, AISHELL, Beijing, China

Please contact e-mail cnsrc@cslt.org if you have any queries.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants No.61633013 and No.62171250.

Welcome to the first Chinese Continuous Visual Speech Recognition Challenge, CNVSRC 2023!
The challenge aims to probe the performance of Large Vocabulary Continuous Visual Speech Recognition (LVCVSR) in two scenarios: reading in a recording studio and speech on the Internet.

The challenge will be based on CN-CVS, a large-scale continuous visual-speech dataset in Mandarin Chinese consisting of short clips collected from TV news and Internet speech shows.

CNVSRC 2023 consists of two tasks: Single-speaker VSR (T1) and Multi-speaker VSR (T2). The former T1 focuses on the performance of large-scale tuning for a specific speaker, while the latter T2 focuses on the basic performance of the system for non-specific speakers. The organizers offer baseline systems participants to refer to. The results will be announced and awarded at the NCMMSC 2023 conference.

CNVSRC 2023 Evaluation Plan

News

2023-Sept-20 CNVSRC 2023 Registration System is open now.

2023-Sept-20 The Development Sets and Baseline are released now.

2023-Oct-10 The Evaluation Sets are released now.

2023-Nov-01 The Submission System and Leaderboard are open now.

2023-Dec-02 The Submission System and Leaderboard are close now.

Tasks

CNVSRC 2023 defines two tasks: Single-speaker VSR (T1) and Multi-speaker VSR (T2). In both tasks, participants are required to develop a system that can recognize corresponding Chinese text from silent facial videos.

Task 1. Single-speaker VSR (T1)

The objective of this task is to assess the performance of the VSR system when developed using a substantial amount of audio-visual data from a particular speaker and then applied to unseen video clips from the same speaker.

For the T1 task, there are two defined tracks based on the data used for system development:

Fixed Track:ONLY the CN-CVS dataset and the development set of T1 is allowed for training/tuning the system.

Open Track:ANY data sources can be used for developing the system, with the exception of the T1 evaluation set.

Task 2. Multi-speaker VSR (T2)

The objective of this task is to assess the performance of the VSR system when it is applied to multiple speakers. In this task, both the data used for developing and evaluating the VSR system involve the same group of speakers. This group comprises multiple speakers, but each speaker has relatively limited data available.

For the T2 task, there are two defined tracks based on the data used for system development:

Fixed Track:ONLY the CN-CVS dataset and the development set of T2 is allowed for training/tuning the system.

Open Track:ANY data sources can be used for developing the system, with the exception of the T2 evaluation set.

Specifically, resources that cannot be used in the fixed track include: non-public pre-training models used as feature extractors, pre-training language models with more than 1B parameters, or that are non-public.
Tools and resources that can be used include: publicly available pre-processing tools such as face detection, extraction, lip area extraction, contour extraction, etc.; publicly available external models and tools, datasets for data augmentation; publicly available word lists, pronunciation dictionaries, n-gram language models, neural language models with less than 1B parameters.

Evaluation

For both tasks, the evaluation metric is the Character Error Rate (CER). CER is calculated by

where N_Ins, N_Subs, N_Del are the character number of three errors, i.e., insertion, substitution, and deletion. N_Total is the total number of characters in ground-truth text transcription, which contains only Chinese characters.

We use CharErrorRate in TorchMetrics to perform evaluation.

License

Participants can apply the data, by signing the data user agreement and upload it to the system.

The organizers will review the application. If it is successful, the data download links will be automatically sent to the email address provided during participants' registration.

Data

This challenge is centered around the CN-CVS dataset, which encompasses over 200,000 utterances from 2,557 speakers, with a total duration exceeding 300 hours.
Please click the button below to obtain the data and transcription of CN-CVS.

CN-CVS

CN-CVS Text Transcription

Additionally, for each task, corresponding development sets will be made available to participants.

Task 1. Single-speaker VSR

Two datasets are released: CNVSRC-Single.Dev and CNVSRC-Single.Eval.

CNVSRC-Single.Dev contains 25,947 utterances from a single speaker, with a total duration of approximately 94 hours. CNVSRC-Single.Eval contains 2,881 utterances from the same speaker, approximately 8.41 hours in total.

For the CNVSRC-Single.Dev, audio-video recordings and text transcriptions will be provided, while for the CNVSRC-Single.Eval, only video data will be available.

In the fixed track, ONLY the CN-CVS dataset and CNVSRC-Single.Dev are permitted to use for system development.

In the open track, ANY data sources and tools are free to utilize for system development, with the exception of CNVSRC-Single.Eval.

CNVSRC-Single.Dev

CNVSRC-Single.Eval

Task 2. Multi-speaker VSR

Two datasets are released: CNVSRC-Multi.Dev and CNVSRC-Multi.Eval.

CNVSRC-Multi comprises two parts: (1) Video data recorded in a studio setting from 23 speakers. (2) Video data collected from the Internet including 20 speakers.
Each speaker possesses approximately one hour of data. Two-thirds of each person's data make up the CNVSRC-Multi.Dev, while the remaining data make up the CNVSRC-Multi.Eval.

For the CNVSRC-Multi.Dev, audio-video recordings and text transcriptions will be provided, while for the CNVSRC-Multi.Eval, only video data will be available.

In the fixed track, ONLY the CN-CVS dataset and CNVSRC-Multi.Dev are permitted to use for system development.

In the open track, ANY data sources and tools are free to utilize for system development, with the exception of CNVSRC-Multi.Eval.

CNVSRC-Multi.Dev

CNVSRC-Multi.Eval

Participation Rules

Participation is open and free to all individuals and institutes.
Anonymity of affiliation/department is allowed in leaderboard and result announcement.
Consensus in data user agreement is required.

Registration

Participants must register for a CNVSRC account where they can perform various activities such as signing the data user agreement as well as uploading the submission and system description.

The registration is free to all individuals and institutes. The regular case is that the registration takes effect immediately, but the organizers may check the registration information and ask the participants to provide additional information to validate the registration.

Once the account has been created, participants can apply the data, by signing the data agreement and upload it to the system. The organizers will review the application, and if it is successful, participants will be notified the link of the data.

To sign up for an evaluation account, please click Quick Registration

Baseline

The organizers construct baseline systems for the Single-speaker VSR task and the Multi-speaker VSR task, using the data resource permitted on the fixed track. The baselines use the Conformer structure as the building blocks and offer reasonable performance, shown below:

Task	Single-speaker VSR	Multi-speaker VSR
CER on Dev Set	48.57%	58.77%
CER on Eval Set	48.60%	58.37%

Participants can download the source code of the baseline systems from [here]

Submission and Leaderboard

All valid submissions are required to be accompanied with a system description, submitted via the submission system. All the system descriptions will be published at the web page of the CNVSRC 2023 workshop. The submission deadline for system description to CNVSRC 2023 is 2023/12/01, 12:00 PM (UTC). The template for system description can be downloaded [here].

In the system description, participants are allowed to hide their name and affiliation.

Click here to Submission System

Click here to Learderboard

Dates

2023/09/20	Registration kick-off
2023/09/20	Training data, development data release
2023/09/20	Baseline system release
2023/10/10	Evaluation set release
2023/11/01	Submission system open
2023/12/01	Deadline for result submission
2023/12/09	Workshop at NCMMSC 2023

Organization Committees

DONG WANG, Center for Speech and Language Technologies, Tsinghua University, China
CHEN CHEN, Center for Speech and Language Technologies, Tsinghua University, China 
LANTIAN LI, Beijing University of Posts and Telecommunications, China
KE LI, Beijing Haitian Ruisheng Science Technology Ltd., China
HUI BU, Beijing AIShell Technology Co. Ltd, China

Please contact e-mail cnvsrc@cnceleb.org if you have any queries.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants No.62171250 and No. 62301075.

Special thanks to Beijing Haitian Ruisheng Science Technology Ltd for their generous donation of Chinese lip recognition video datasets to support CNVSRC 2023.

Welcome to the second Chinese Continuous Visual Speech Recognition Challenge, CNVSRC 2024!
The challenge aims to probe the performance of Large Vocabulary Continuous Visual Speech Recognition (LVCVSR) in two scenarios: reading in a recording studio and speech on the Internet.

Compared to CNVSRC 2023, this year we're offering (1) more powerful baseline system for the fixed tracks; (2) an extra data source, CN-CVS2-P1, for the open tracks.

The challenge will be based on CN-CVS, a large-scale continuous visual-speech dataset in Mandarin Chinese consisting of short clips collected from TV news and Internet speech shows.

CNVSRC 2024 consists of two tasks: Single-speaker VSR (T1) and Multi-speaker VSR (T2). The former T1 focuses on the performance of large-scale tuning for a specific speaker, while the latter T2 focuses on the basic performance of the system for non-specific speakers.

The organizers offer baseline systems participants to refer to. The results will be announced and awarded at the NCMMSC 2024 conference.

CNVSRC 2024 Evaluation Plan

News

2024-May-08 CNVSRC 2024 Registration System is open now.

2024-May-08 The Datasets and Baseline are released now.

Tasks

CNVSRC 2024 defines two tasks: Single-speaker VSR (T1) and Multi-speaker VSR (T2). In both tasks, participants are required to develop a system that can recognize corresponding Chinese text from silent facial videos.

Task 1. Single-speaker VSR (T1)

For the T1 task, there are two defined tracks based on the data used for system development:

Fixed Track:ONLY the CN-CVS dataset and the development set of T1 is allowed for training/tuning the system. The CN-CVS2-P1 dataset is not allowed for training/tuning the system.

Open Track:ANY data sources can be used for developing the system, with the exception of the T1 evaluation set.

Task 2. Multi-speaker VSR (T2)

For the T2 task, there are two defined tracks based on the data used for system development:

Fixed Track:ONLY the CN-CVS dataset and the development set of T2 is allowed for training/tuning the system. The CN-CVS2-P1 dataset is not allowed for training/tuning the system.

Open Track:ANY data sources can be used for developing the system, with the exception of the T2 evaluation set.

Evaluation

For both tasks, the evaluation metric is the Character Error Rate (CER). CER is calculated by

We use CharErrorRate in TorchMetrics to perform evaluation.

License

Participants can apply the data, by signing the data user agreement and upload it to the system.

The organizers will review the application. If it is successful, the data download links will be automatically sent to the email address provided during participants' registration.

Data

CN-CVS

CN-CVS Text Transcription

Additionally, for each task, corresponding development sets will be made available to participants.

Task 1. Single-speaker VSR

Two datasets are released: CNVSRC-Single.Dev and CNVSRC-Single.Eval.

For the CNVSRC-Single.Dev, audio-video recordings and text transcriptions will be provided, while for the CNVSRC-Single.Eval, only video data will be available.

In the fixed track, ONLY the CN-CVS dataset and CNVSRC-Single.Dev are permitted to use for system development.

In the open track, ANY data sources and tools are free to utilize for system development, with the exception of CNVSRC-Single.Eval.

CNVSRC-Single.Dev

CNVSRC-Single.Eval

Task 2. Multi-speaker VSR

Two datasets are released: CNVSRC-Multi.Dev and CNVSRC-Multi.Eval.

For the CNVSRC-Multi.Dev, audio-video recordings and text transcriptions will be provided, while for the CNVSRC-Multi.Eval, only video data will be available.

In the fixed track, ONLY the CN-CVS dataset and CNVSRC-Multi.Dev are permitted to use for system development.

In the open track, ANY data sources and tools are free to utilize for system development, with the exception of CNVSRC-Multi.Eval.

CNVSRC-Multi.Dev

CNVSRC-Multi.Eval

Extra data sources for Open Tracks

For the open tracks, the organizers provide CN-CVS2-P1, the preview part of CN-CVS2 dataset, for system development.

It encompasses over 160,000 utterances with a total duration about 200 hours.

Note that this dataset can only be used for OPEN TRACKS.

Please click the button below to obtain the data and transcription of CN-CVS2-P1.

CN-CVS2-P1

CN-CVS2-P1 Text Transcription

Participation Rules

Participation is open and free to all individuals and institutes.
Anonymity of affiliation/department is allowed in leaderboard and result announcement.
Consensus in data user agreement is required.

Registration

Participants must register for a CNVSRC account where they can perform various activities such as signing the data user agreement as well as uploading the submission and system description.

To sign up for an evaluation account, please click Quick Registration

Baseline

The organizers construct baseline systems for the Single-speaker VSR task and the Multi-speaker VSR task, using the data resource permitted on the fixed track. The baselines use the Conformer structure as the building blocks and offer reasonable performance, shown below:

Task	Single-speaker VSR	Multi-speaker VSR
CER on Dev Set	41.22%	52.42%
CER on Eval Set	39.66%	52.20%

Participants can download the source code of the baseline systems from [here]

Submission and Leaderboard

All valid submissions are required to be accompanied with a system description, submitted via the submission system. All the system descriptions will be published at the web page of the CNVSRC 2024 workshop. The submission deadline for system description to CNVSRC 2024 is 2024/08/01, 12:00 PM (UTC). The template for system description can be downloaded [here].

In the system description, participants are allowed to hide their name and affiliation.

Click here to Submission System

Click here to Learderboard

Dates

2024/05/08	Registration kick-off
2024/05/08	Data release
2024/05/08	Baseline system release
2024/07/01	Submission system open
2024/08/01	Deadline for result submission
2024/08/16	Workshop at NCMMSC 2024

Organization Committees

DONG WANG, Center for Speech and Language Technologies, Tsinghua University, China
CHEN CHEN, Center for Speech and Language Technologies, Tsinghua University, China 
LANTIAN LI, Beijing University of Posts and Telecommunications, China
KE LI, Beijing Haitian Ruisheng Science Technology Ltd., China
HUI BU, Beijing AIShell Technology Co. Ltd, China

Please contact e-mail cnvsrc@cnceleb.org if you have any queries.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants No.62171250 and No. 62301075.

Special thanks to Beijing Haitian Ruisheng Science Technology Ltd for their generous donation of Chinese lip recognition video datasets to support CNVSRC.

Welcome to the third Chinese Continuous Visual Speech Recognition Challenge, CNVSRC 2025!
The challenge aims to probe the performance of Large Vocabulary Continuous Visual Speech Recognition (LVCVSR) and Large Vocabulary Continuous Visual To Speech Conversion (LVCVTS)

Compared to CNVSRC 2024, this year we're offering (1) introduce a new visual to speech (VTS) track (2) release of additional 1,000 hours of training data, to support large-scale models.

The competition will be based on the CN-CVS, CNVSRC, CN-CVS2-P1, and CN-CVS3 datasets, totaling over 1,600 hours of data.

CNVSRC 2025 comprises two tasks: Multi-speaker VSR (T1) and Single-speaker VTS (T2). The former (T1) focuses on the accuracy of content recognition across multiple speakers, while the latter (T2) emphasizes the reconstruction of both content and speaker-specific audio characteristics under a single-speaker scenario.

The organizers provide baseline systems for participants to reference. Final results will be announced and awarded at the NCMMSC 2025 conference.

CNVSRC 2025 Evaluation Plan

News

2025-June-25 CNVSRC 2025 Registration System is open now.

2025-June-25 The Datasets and Baseline are released now.

Tasks

CNVSRC 2025 defines two tasks: Multi-speaker VSR (T1) and Single-speaker VTS (T2). In Task T1, participants are required to recognize the corresponding Chinese text from silent facial videos, whereas in Task T2, they are expected to reconstruct the original speech audio from the same type of silent facial video input.

Task 1. Multi-speaker VSR (T1)

For the T1 task, there are two defined tracks based on the data used for system development:

Fixed Track:ONLY the CN-CVS, CNVSRC(CNVSRC includes the CNVSRC.Single.Dev and the CNVSRC.Multi.Dev ), CN-CVS2-P1, CN-CVS3 dataset are allowed for training/tuning the system.

Open Track:ANY data sources can be used for developing the system, with the exception of the T1 evaluation set.

Task 2. Single-speaker VTS (T2)

The objective of this task is to evaluate the performance of a Visual-to-Speech (VTS) system in reconstructing speech audio from silent videos of a single speaker.

For the T2 task, there are two defined tracks based on the data used for system development:

Fixed Track:ONLY the CN-CVS, CNVSRC(CNVSRC includes the CNVSRC.Single.Dev and the CNVSRC.Multi.Dev ), CN-CVS2-P1, CN-CVS3 datasets are allowed for training/tuning the system.

Open Track:ANY data sources can be used for developing the system, with the exception of the T2 evaluation set.

Evaluation

Both tasks use Character Error Rate (CER) as the evaluation metric. For Task T1, CER can be directly computed from the predicted text. For Task T2, the generated audio is first transcribed into text using an ASR model, and then CER is calculated based on the transcription.CER is calculated by

License

Participants can apply the data, by signing the data user agreement and upload it to the system.

The organizers will review the application. If it is successful, the data download links will be automatically sent to the email address provided during participants' registration.

Data

This challenge is centered around the CN-CVS dataset, which encompasses over 200,000 utterances from 2,557 speakers, with a total duration exceeding 300 hours.
Please click the button below to obtain the data and transcription of CN-CVS.

CN-CVS

CN-CVS Text Transcription

The organizers provide CN-CVS2-P1, the preview part of CN-CVS2 dataset, for system development.It encompasses over 160,000 utterances with a total duration about 200 hours.

Please click the button below to obtain the data and transcription of CN-CVS2-P1.

CN-CVS2-P1

CN-CVS2-P1 Text Transcription

The organizers have also provided a new dataset, CN-CVS3. This is an additional data source specifically prepared for this year’s competition. It represents a significant expansion, containing over 900,000 utterances with a total duration of approximately 1,000 hours. This large-scale visual-speech dataset, primarily collected from internet media, is intended to serve as a comprehensive resource for system development and training. Please click the button below to access the CN-CVS3 data and transcriptions.

CN-CVS3

Additionally, for each task, corresponding development sets will be made available to participants.

Task 1. Multi-speaker VSR

Two datasets are released: CNVSRC-Multi.Dev and CNVSRC-Multi.Eval.

For the CNVSRC-Multi.Dev, audio-video recordings and text transcriptions will be provided, while for the CNVSRC-Multi.Eval, only video data will be available.

In the fixed track, ONLY the CN-CVS, CNVSRC(CNVSRC includes the CNVSRC.Single.Dev and the CNVSRC.Multi.Dev ), CN-CVS2-P1, CN-CVS3 datasets are allowed for training/tuning the system.

In the open track, ANY data sources and tools are free to utilize for system development, with the exception of CNVSRC-Multi.Eval.

CNVSRC-Multi.Dev

CNVSRC-Multi.Eval

Task 2. Single-speaker VTS

Two datasets are released: CNVSRC-Single.Dev and CNVSRC-Single.Eval.

CNVSRC-Single.Dev contains 25,947 utterances from a single speaker, with a total duration of approximately 94 hours. CNVSRC-Single.Eval contains 300 utterances from the same speaker, approximately 0.87 hours in total.

For the CNVSRC-Single.Dev, audio-video recordings and text transcriptions will be provided, while for the CNVSRC-Single.Eval, only video data will be available.

In the fixed track, ONLY the CN-CVS, CNVSRC(CNVSRC includes the CNVSRC.Single.Dev and the CNVSRC.Multi.Dev ), CN-CVS2-P1, CN-CVS3 datasets are allowed for training/tuning the system.

In the open track, ANY data sources and tools are free to utilize for system development, with the exception of CNVSRC-Single.Eval.

CNVSRC-Single.Dev

CNVSRC-Single.Eval

Participation Rules

Participation is open and free to all individuals and institutes.
Anonymity of affiliation/department is allowed in leaderboard and result announcement.
Consensus in data user agreement is required.

Registration

Participants must register for a CNVSRC account where they can perform various activities such as signing the data user agreement as well as uploading the submission and system description.

To sign up for an evaluation account, please click Quick Registration

Baseline

The organizers construct baseline systems for the Multi-speaker VSR task and the Single-speaker VTS task, using the data resource permitted on the fixed tracks. The baseline leverages advanced methods for VSR and VTS and offer reasonable performance, as shown below:

Task	Multi-speaker VSR	Single-speaker VTS
CER on Dev Set	31.91%	33.15%
CER on Eval Set	31.55%	31.41%

Participants can download the source code of the baseline systems from [here]

Submission and Leaderboard

All valid submissions are required to be accompanied with a system description, submitted via the submission system. All the system descriptions will be published at the web page of the CNVSRC 2025 workshop. The submission deadline for system description to CNVSRC 2025 is 2025/10/01, 12:00 PM (UTC). The template for system description can be downloaded [here].

In the system description, participants are allowed to hide their name and affiliation.

Click here to Submission System

Click here to Learderboard

Dates

2025/07/04	Registration kick-off
2025/07/04	Data release
2025/07/04	Baseline system release
2025/08/01	Submission system open
2025/10/10	Deadline for result submission
2025/10/16-19	Workshop at NCMMSC 2025

Organization Committees

DONG WANG, Center for Speech and Language Technologies, Tsinghua University, China
LANTIAN LI, Beijing University of Posts and Telecommunications, China
ZEHUA LIU, Beijing University of Posts and Telecommunications, China
XIAOLOU LI, Beijing University of Posts and Telecommunications, China
KE LI, Beijing Haitian Ruisheng Science Technology Ltd., China
HUI BU, Beijing AIShell Technology Co. Ltd., China

Please contact e-mail cnvsrc@cnceleb.org if you have any queries.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants No.62171250 and No. 62301075.

Special thanks to Beijing Haitian Ruisheng Science Technology Ltd for their generous donation of Chinese lip recognition video datasets to support CNVSRC.