Welcome to the second Chinese Continuous Visual Speech Recognition Challenge, CNVSRC 2024!
The challenge aims to probe the performance of Large Vocabulary Continuous Visual Speech Recognition (LVCVSR) in two scenarios: reading in a recording studio and speech on the Internet.
Compared to CNVSRC 2023, this year we're offering (1) more powerful baseline system for the fixed tracks; (2) an extra data source, CN-CVS2-P1, for the open tracks.
The challenge will be based on CN-CVS, a large-scale continuous visual-speech dataset in Mandarin Chinese consisting of short clips collected from TV news and Internet speech shows.
CNVSRC 2024 consists of two tasks: Single-speaker VSR (T1) and Multi-speaker VSR (T2). The former T1 focuses on the performance of large-scale tuning for a specific speaker, while the latter T2 focuses on the basic performance of the system for non-specific speakers. The organizers offer baseline systems participants to refer to. The results will be announced and awarded at the NCMMSC 2024 conference.
CNVSRC 2024 defines two tasks: Single-speaker VSR (T1) and Multi-speaker VSR (T2). In both tasks, participants are required to develop a system that can recognize corresponding Chinese text from silent facial videos.
The objective of this task is to assess the performance of the VSR system when developed using a substantial amount of audio-visual data from a particular speaker and then applied to unseen video clips from the same speaker.
For the T1 task, there are two defined tracks based on the data used for system development:
Fixed Track:ONLY the CN-CVS dataset and the development set of T1 is allowed for training/tuning the system. The CN-CVS2-P1 dataset is not allowed for training/tuning the system.
Open Track:ANY data sources can be used for developing the system, with the exception of the T1 evaluation set.
The objective of this task is to assess the performance of the VSR system when it is applied to multiple speakers. In this task, both the data used for developing and evaluating the VSR system involve the same group of speakers. This group comprises multiple speakers, but each speaker has relatively limited data available.
For the T2 task, there are two defined tracks based on the data used for system development:
Fixed Track:ONLY the CN-CVS dataset and the development set of T2 is allowed for training/tuning the system. The CN-CVS2-P1 dataset is not allowed for training/tuning the system.
Open Track:ANY data sources can be used for developing the system, with the exception of the T2 evaluation set.
Specifically, resources that cannot be used in the fixed track include: non-public pre-training models used as feature extractors, pre-training language models with more than 1B parameters, or that are non-public.
Tools and resources that can be used include: publicly available pre-processing tools such as face detection, extraction, lip area extraction, contour extraction, etc.; publicly available external models and tools, datasets for data augmentation; publicly available word lists, pronunciation dictionaries, n-gram language models, neural language models with less than 1B parameters.
For both tasks, the evaluation metric is the Character Error Rate (CER). CER is calculated by
where NIns, NSubs, NDel are the character number of three errors, i.e., insertion, substitution, and deletion. NTotal is the total number of characters in ground-truth text transcription, which contains only Chinese characters.
We use CharErrorRate in TorchMetrics to perform evaluation.
Participants can apply the data, by signing the data user agreement and upload it to the system.
The organizers will review the application. If it is successful, the data download links will be automatically sent to the email address provided during participants' registration.
This challenge is centered around the CN-CVS dataset, which encompasses over 200,000 utterances from 2,557 speakers, with a total duration exceeding 300 hours.
Please click the button below to obtain the data and transcription of CN-CVS.
Additionally, for each task, corresponding development sets will be made available to participants.
Two datasets are released: CNVSRC-Single.Dev and CNVSRC-Single.Eval.
CNVSRC-Single.Dev contains 25,947 utterances from a single speaker, with a total duration of approximately 94 hours. CNVSRC-Single.Eval contains 2,881 utterances from the same speaker, approximately 8.41 hours in total.
For the CNVSRC-Single.Dev, audio-video recordings and text transcriptions will be provided, while for the CNVSRC-Single.Eval, only video data will be available.
In the fixed track, ONLY the CN-CVS dataset and CNVSRC-Single.Dev are permitted to use for system development.
In the open track, ANY data sources and tools are free to utilize for system development, with the exception of CNVSRC-Single.Eval.
Two datasets are released: CNVSRC-Multi.Dev and CNVSRC-Multi.Eval.
CNVSRC-Multi comprises two parts: (1) Video data recorded in a studio setting from 23 speakers. (2) Video data collected from the Internet including 20 speakers.
Each speaker possesses approximately one hour of data. Two-thirds of each person's data make up the CNVSRC-Multi.Dev, while the remaining data make up the CNVSRC-Multi.Eval.
For the CNVSRC-Multi.Dev, audio-video recordings and text transcriptions will be provided, while for the CNVSRC-Multi.Eval, only video data will be available.
In the fixed track, ONLY the CN-CVS dataset and CNVSRC-Multi.Dev are permitted to use for system development.
In the open track, ANY data sources and tools are free to utilize for system development, with the exception of CNVSRC-Multi.Eval.
For the open tracks, the organizers provide CN-CVS2-P1, the preview part of CN-CVS2 dataset, for system development.
It encompasses over 160,000 utterances with a total duration about 200 hours.
Note that this dataset can only be used for OPEN TRACKS.
Please click the button below to obtain the data and transcription of CN-CVS2-P1.
Participants must register for a CNVSRC account where they can perform various activities such as signing the data user agreement as well as uploading the submission and system description.
The registration is free to all individuals and institutes. The regular case is that the registration takes effect immediately, but the organizers may check the registration information and ask the participants to provide additional information to validate the registration.
Once the account has been created, participants can apply the data, by signing the data agreement and upload it to the system. The organizers will review the application, and if it is successful, participants will be notified the link of the data.
To sign up for an evaluation account, please click Quick Registration
The organizers construct baseline systems for the Single-speaker VSR task and the Multi-speaker VSR task, using the data resource permitted on the fixed track. The baselines use the Conformer structure as the building blocks and offer reasonable performance, shown below:
Task | Single-speaker VSR | Multi-speaker VSR |
---|---|---|
CER on Dev Set | 41.22% | 52.42% |
CER on Eval Set | 39.66% | 52.20% |
Participants can download the source code of the baseline systems from [here]
Participants should submit their results via the submission system. Once the submission is completed, it will be shown in the Leaderboard, and all participants can check their positions. For each task and each track, participants can submit their results no more than 5 times.
All valid submissions are required to be accompanied with a system description, submitted via the submission system. All the system descriptions will be published at the web page of the CNVSRC 2024 workshop. The submission deadline for system description to CNVSRC 2024 is 2024/08/01, 12:00 PM (UTC). The template for system description can be downloaded [here].
In the system description, participants are allowed to hide their name and affiliation.
2024/05/08 | Registration kick-off |
2024/05/08 | Data release |
2024/05/08 | Baseline system release |
2024/07/01 | Submission system open |
2024/08/01 | Deadline for result submission |
2024/08/16 | Workshop at NCMMSC 2024 |
DONG WANG, Center for Speech and Language Technologies, Tsinghua University, China CHEN CHEN, Center for Speech and Language Technologies, Tsinghua University, China LANTIAN LI, Beijing University of Posts and Telecommunications, China KE LI, Beijing Haitian Ruisheng Science Technology Ltd., China HUI BU, Beijing AIShell Technology Co. Ltd, China
Please contact e-mail cnvsrc@cnceleb.org if you have any queries.
This work is supported by the National Natural Science Foundation of China (NSFC) under Grants No.62171250 and No. 62301075.
Special thanks to Beijing Haitian Ruisheng Science Technology Ltd for their generous donation of Chinese lip recognition video datasets to support CNVSRC.