Aspiring Minds | Svar

AspiringMindsAM 1,137 views 11 slides Sep 09, 2016
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

Voice recognition and synthesis to automatically evaluate spoken language skills.
Read more: research.aspiringminds.com


Slide Content

Spoken English Evaluation Machine Learning with Crowd Intelligence Varun Aggarwal Presented at KDD, 2015, ACL 2015

Problem Statement & Motivation Importance of spoken English English language has a very high socio-economic impact – with people speaking the language fluently reported to earn 30-50% more than their peers who don’t. Grading spoken English in a scalable way needed by companies, training organization and also individuals. Problem Statement Scalable grading of spontaneous English speech, as good as experts.

Why are automated methods not accurate? Speaker independent Speech recognition for spontaneous speech is a hard problem!

Proposed system architecture Crowdsourcing helps us get accurate transcriptions. Crowd grades also help improve! Crowd Grades FA Features

Crowdsourcing task

Crowdsourcing task Worker quality control Each worker is assigned a risk level which reflects the quality of his past work. Based on the state, number and when to give a gold standard task is determined.

Supervised learning setup Experiment Details Sample Size : 566 319 India 247 from Philippines Expert Grading Two expert raters Overall score based on Pronunciation/Fluency Content-Org/Grammar. Inter-rater correlation ~0.8. The learning task Modelling done separately for Indian and Philippines set. Linear ridge regression, Neural Networks and SVM regression with different kernels were used to build the models.

Case study Studied deployment of proposed algorithm in Philippines. Event had 500 applicants for the role of a customer support executive. The scoring algorithm was tested on a subset of 150 students. Internal expert graded each candidate’s speech as hirable or not- hireable .

Features used We use three classes of features Force Alignment features (FA) and The speech sample is forced aligned on the crowdsourced transcription. Features like– rate of speech, position and length of pauses, log likelihood of recognition, posterior probability, hesitations and repetitions, etc are derived. Natural Language Processing features (NLP). Surface level features : number of words, complexity or difficulty of words and the number of common words used. Semantic features like the coherency in text, context of the words spoken, sentiment of the text and grammar correctness. Crowd Grades (CG) Crowd provides scores on - pronunciation, fluency, content organization and grammar. These grades are combined to form a composite score.

Experiment and Results Crowdsourced transcriptions + Crowd grades outperforms all other methods Accuracy nears inter expert agreement (~0.8).

Summing it up Svar provides an automated assessment of candidate’s pronunciation and fluency. Crowdsourcing, in addition to NLP feature, renders reliable composite scores. Speech assessments can be made scalable with accuracy nearly matching experts’ opinion.