Workshop: Platform and Language independent framework for speech recognition

Create an easily extensible framework for utilizing speech input in any language to query a dataset of content and provide result set. The demo system takes a speech query in Telugu (a South Indian language) and converts it into text. The resultant text is refined with the help of POS taggers and relevant information is retrieved and processed out in form of speech in the same language. The system typically makes use of standard open source tools for speech recognition, synthesis and POS tagging with fine tunings to derive the necessary results in Indian languages as an example.

In this presentation we would like to show how the above is implemented by below:

- Preparation of global phone set which typically replaces latin letters with a unique combination of english letters from the below link:

http://homepage.ntlworld.com/stone-catend/trimain1.htm

- Transliteration of the available data in unicode with the use of global phone set

- Write and test parsing to chunk the transcription into individual phonemes

- Use of ehmm in festvox or HTKAlign to do automatic labelling of speech into phonemes

- Use of speech tools such as festvox or HTS to train the voice

- Use of festival to generate the synthetic speech

- Write and test JavaScript for Text-To-Speech synthesis system

- Write codes to do automatic labelling of speech data which is to be used for speech recognition

- Prepare a dictionary with each word in the vocabulary against its corresponding phoneme representation

- Use of speech tools such as SphinxTrain or HTK to train the acoustic models for speech recognition

- Use of speech tools such as CMUCLMTK or HLMTools to prepare language model

- Build and test the system with sphinxdecode or HDecode

- Collect sufficient transcribed phrases and corresponding exemplar recordings to test the accuracy of system in decoding

- Enhance the system by passing speech to voice activity detection or noise reduction algorithms before decoding

- Use of POS tagging and word-sense disambiguation to retrieve the necessary information from the resultant output of recognition

- Integrating all the modules and preparation of a simple interface.

A bit more in detail about the implementation and challenges involved:

The system mainly requires two important modules:

Speech-To-Text

Text-To-Speech

Along with these, some knowledge on dialogue systems, POS tagging is required.

Tasks involved:

Speech-To-Text:

- Collection of audio data and corresponding text.

- Text in UTF-8 format and its transliteration to IT3 or Roman

- Construction of pre-defined dictionary based on given vocabulary

- Automatic Labelling of data

Preparation of acoustic models

- Preparation of language Model (LM) or Finite State Grammar (FSG) in case of CMUSphinx

Text-To-Speech:

- Collection of audio data and corresponding text.

- Text in UTF-8 format and its transliteration to IT3 or Roman

- Language parser for chunking transcription into phones

- Automatic Labelling of data

- Question file preparation and acoustic models in case of HTS

Challenges involved:

- Collection of large amount of data and its corresponding text for ASR (availability of speakers for recording)

- Preparation of global phone-set (so that the work can be easily extended to other languages)

- Language parser development (for chunking into phones - depends on transliteration)

- Automatic labelling of data (use of ehmm or HTKAlign)

- Use of noise-reduction algorithm or voice activity detection algorithm to enhance the system

Info

Day: 2014-11-01
Start time: 19:00
Duration: 01:00
Room: C361
Track: Code, data and infrastructure
Language: en

Links:

Feedback

Click here to let us know how you liked this event.

Concurrent Events