Sign in
Audio Wave
CNVSRC 2023
Chinese Continuous Visual Speech Recognition Challenge

Welcome to the first Chinese Continuous Visual Speech Recognition Challenge, CNVSRC 2023!
The challenge aims to probe the performance of Large Vocabulary Continuous Visual Speech Recognition (LVCVSR) in two scenarios: reading in a recording studio and speech on the Internet.


The challenge will be based on CN-CVS, a large-scale continuous visual-speech dataset in Mandarin Chinese consisting of short clips collected from TV news and Internet speech shows.

CNVSRC 2023 consists of two tasks: Single-speaker VSR (T1) and Multi-speaker VSR (T2). The former T1 focuses on the performance of large-scale tuning for a specific speaker, while the latter T2 focuses on the basic performance of the system for non-specific speakers. The organizers offer baseline systems participants to refer to. The results will be announced and awarded at the NCMMSC 2023 conference.

CNVSRC 2023 Evaluation Plan

News

2023-Sept-20 CNVSRC 2023 Registration System is open now.

2023-Sept-20 The Development Sets and Baseline are released now.

2023-Oct-10 The Evaluation Sets are released now.

2023-Nov-01 The Submission System and Leaderboard are open now.

2023-Dec-02 The Submission System and Leaderboard are close now.

Tasks

CNVSRC 2023 defines two tasks: Single-speaker VSR (T1) and Multi-speaker VSR (T2). In both tasks, participants are required to develop a system that can recognize corresponding Chinese text from silent facial videos.

Task 1. Single-speaker VSR (T1)

The objective of this task is to assess the performance of the VSR system when developed using a substantial amount of audio-visual data from a particular speaker and then applied to unseen video clips from the same speaker.

For the T1 task, there are two defined tracks based on the data used for system development:

Fixed Track:ONLY the CN-CVS dataset and the development set of T1 is allowed for training/tuning the system.

Open Track:ANY data sources can be used for developing the system, with the exception of the T1 evaluation set.

Task 2. Multi-speaker VSR (T2)

The objective of this task is to assess the performance of the VSR system when it is applied to multiple speakers. In this task, both the data used for developing and evaluating the VSR system involve the same group of speakers. This group comprises multiple speakers, but each speaker has relatively limited data available.

For the T2 task, there are two defined tracks based on the data used for system development:

Fixed Track:ONLY the CN-CVS dataset and the development set of T2 is allowed for training/tuning the system.

Open Track:ANY data sources can be used for developing the system, with the exception of the T2 evaluation set.

Specifically, resources that cannot be used in the fixed track include: non-public pre-training models used as feature extractors, pre-training language models with more than 1B parameters, or that are non-public.
Tools and resources that can be used include: publicly available pre-processing tools such as face detection, extraction, lip area extraction, contour extraction, etc.; publicly available external models and tools, datasets for data augmentation; publicly available word lists, pronunciation dictionaries, n-gram language models, neural language models with less than 1B parameters.

Evaluation

For both tasks, the evaluation metric is the Character Error Rate (CER). CER is calculated by

where NIns, NSubs, NDel are the character number of three errors, i.e., insertion, substitution, and deletion. NTotal is the total number of characters in ground-truth text transcription, which contains only Chinese characters.

We use CharErrorRate in TorchMetrics to perform evaluation.

License

Participants can apply the data, by signing the data user agreement and upload it to the system.

The organizers will review the application. If it is successful, the data download links will be automatically sent to the email address provided during participants' registration.

Data

This challenge is centered around the CN-CVS dataset, which encompasses over 200,000 utterances from 2,557 speakers, with a total duration exceeding 300 hours.
Please click the button below to obtain the data and transcription of CN-CVS.

Additionally, for each task, corresponding development sets will be made available to participants.

Task 1. Single-speaker VSR

Two datasets will be released: CNVSRC-Single.Dev and CNVSRC-Single.Eval.

CNVSRC-Single.Dev contains 25,947 utterances from a single speaker, with a total duration of approximately 94 hours. CNVSRC-Single.Eval contains 2,881 utterances from the same speaker, approximately 8.41 hours in total.

For the CNVSRC-Single.Dev, audio-video recordings and text transcriptions will be provided, while for the CNVSRC-Single.Eval, only video data will be available.

In the fixed track, ONLY the CN-CVS dataset and CNVSRC-Single.Dev are permitted to use for system development.

In the open track, ANY data sources and tools are free to utilize for system development, with the exception of CNVSRC-Single.Eval.

Task 2. Multi-speaker VSR

Two datasets will be released: CNVSRC-Multi.Dev and CNVSRC-Multi.Eval.

CNVSRC-Multi comprises two parts: (1) Video data recorded in a studio setting from 23 speakers. (2) Video data collected from the Internet including 20 speakers.
Each speaker possesses approximately one hour of data. Two-thirds of each person's data make up the CNVSRC-Multi.Dev, while the remaining data make up the CNVSRC-Multi.Eval.

For the CNVSRC-Multi.Dev, audio-video recordings and text transcriptions will be provided, while for the CNVSRC-Multi.Eval, only video data will be available.

In the fixed track, ONLY the CN-CVS dataset and CNVSRC-Multi.Dev are permitted to use for system development.

In the open track, ANY data sources and tools are free to utilize for system development, with the exception of CNVSRC-Multi.Eval.

Participation Rules

  • Participation is open and free to all individuals and institutes.
  • Anonymity of affiliation/department is allowed in leaderboard and result announcement.
  • Consensus in data user agreement is required.

Registration

Participants must register for a CNVSRC account where they can perform various activities such as signing the data user agreement as well as uploading the submission and system description.

The registration is free to all individuals and institutes. The regular case is that the registration takes effect immediately, but the organizers may check the registration information and ask the participants to provide additional information to validate the registration.

Once the account has been created, participants can apply the data, by signing the data agreement and upload it to the system. The organizers will review the application, and if it is successful, participants will be notified the link of the data.

To sign up for an evaluation account, please click Quick Registration

Baseline

The organizers construct baseline systems for the Single-speaker VSR task and the Multi-speaker VSR task, using the data resource permitted on the fixed track. The baselines use the Conformer structure as the building blocks and offer reasonable performance, shown below:

Task Single-speaker VSR Multi-speaker VSR
CER on Dev Set 48.57% 58.77%
CER on Eval Set 48.60% 58.37%

Participants can download the source code of the baseline systems from [here]

Submission and Leaderboard

Participants should submit their results via the submission system. Once the submission is completed, it will be shown in the Leaderboard, and all participants can check their positions. For each task and each track, participants can submit their results no more than 5 times.

All valid submissions are required to be accompanied with a system description, submitted via the submission system. All the system descriptions will be published at the web page of the CNVSRC 2023 workshop. The submission deadline for system description to CNVSRC 2023 is 2023/12/01, 12:00 PM (UTC). The template for system description can be downloaded [here].

In the system description, participants are allowed to hide their name and affiliation.

Dates

2023/09/20 Registration kick-off
2023/09/20 Training data, development data release
2023/09/20 Baseline system release
2023/10/10 Evaluation set release
2023/11/01 Submission system open
2023/12/01 Deadline for result submission
2023/12/09 Workshop at NCMMSC 2023

Organization Committees

DONG WANG, Center for Speech and Language Technologies, Tsinghua University, China
CHEN CHEN, Center for Speech and Language Technologies, Tsinghua University, China 
LANTIAN LI, Beijing University of Posts and Telecommunications, China
KE LI, Beijing Haitian Ruisheng Science Technology Ltd., China
HUI BU, Beijing AIShell Technology Co. Ltd, China
                

THU bupt speechocean speechhome

Please contact e-mail cnvsrc@cnceleb.org if you have any queries.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants No.62171250 and No. 62301075.

Special thanks to Beijing Haitian Ruisheng Science Technology Ltd for their generous donation of Chinese lip recognition video datasets to support CNVSRC 2023.