Open Speech and Language Resources



Contact
dpovey@gmail.com
Phone: 425 247 4129
(Daniel Povey)

MAGICDATA Mandarin Chinese Read Speech Corpus

Identifier: SLR68

Summary: The corpus by Magic Data Technology Co., Ltd. , containing 755 hours of scripted read speech data from 1080 native speakers of the Mandarin Chinese spoken in mainland China. The sentence transcription accuracy is higher than 98%.

Category: Speech

License: Attribution-NonCommercial-NoDerivatives 4.0 International Public License (CC BY-NC-ND 4.0)

Downloads (use a mirror closer to you):
train_set.tar.gz [52G]   ( Training set speech and transcripts )   Mirrors: [US]  
dev_set.tar.gz [1.0G]   (Development set speech and transcripts )   Mirrors: [US]  
test_set.tar.gz [2.2G]   (Test set speech and transcripts )   Mirrors: [US]  
metadata.tar.gz [3.8M]   (supplementary resources, incl. data introduction (in English and Chinese) and speaker information )   Mirrors: [US]  

About this resource:

MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and freely published for non-commercial use.

The contents and the corresponding descriptions of the corpus include:

  • The corpus contains 755 hours of speech data, which is mostly mobile recorded data.
  • 1080 speakers from different accent areas in China are invited to participate in the recording.
  • The sentence transcription accuracy is higher than 98%.
  • Recordings are conducted in a quiet indoor environment.
  • The database is divided into training set, validation set, and testing set in a ratio of 51: 1: 2.
  • Detail information such as speech data coding and speaker information is preserved in the metadata file.
  • The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command and control, etc.
  • Segmented transcripts are also provided.
The corpus aims to support researchers in speech recognition, machine translation, speaker recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use.

The corpus is a subset of a much bigger data ( 10566.9 hours Chinese Mandarin Speech Corpus ) set which was recorded in the same environment. Please feel free to contact us via business@magicdatatech.com for more details.

Citation

Please cite the corpus as "Magic Data Technology Co., Ltd., "http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/101", 05/2019".

About us

Magic Data Technology Co., Ltd. (referred to as Magic Data) was established in 2016. Through our higher-expertise and higher-precision data services, Magic Data has quickly grown into one of the foremost companies in artificial intelligence industry. We strive to provide the most efficient and highest quality one-stop data services for customers in the fields of speech recognition, intelligent imaging and Natural Language Understanding (NLU). Our services include data scheme design, data collection, data annotation/transcription, etc.

Contact

  • Tel: (+86) 10-82527250
  • Email: business@magicdatatech.com
  • http://www.imagicdatatech.com

External URL: http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/101   Full description from the company website