LYU0103
Speech Recognition
Techniques for
Digital Video Library
Supervisor : Prof Michael R. Lyu
Students: Gao Zheng Hong
Lei Mo
Outline of Presentation
Project objectives
ViaVoice recognition experiments
Speech recognition editing tool
Audio scene change detection
Speech classification
Summary
Our Project Objectives
Audio information retrieval
Speech recognition
Last Term’s Work
Extract audio channel (stereo 44.1 kHz)
from a mpeg video files into wave files
(mono 22 kHz)
Segmented the wave files into sentences by
detecting its frame energy
Developed a visual training tool
Visual Training Tool
Video Window; Dictation Window; Text Editor
IBM ViaVoice Experiments
Employed 7 student helpers
Produce transcripts of 77 news video clips
Four experiments:
Baseline measurement
Trained model measurement
Slow down measurement
Indoor news measurement
Baseline Measurement
To measure the ViaVoice recognition
accuracy using TVB news video
Testing set: 10 video clips
The segmented wav files are dictated
Employ the hidden Markov model toolkit
(HTK) to examine the accuracy
Trained Model Measurement
To measure the accuracy of ViaVoice, trained by
its correctly recognized words
10 videos clips are segmented and dictated
The correctly dictated words of training set are
used to train the ViaVoice by the SMAPI function
SmWordCorrection
Repeat the procedures of “baseline measurement”
after training to get the recognition performance
Repeat the procedures of using 20 videos clips
Slow Down Measurement
Investigate the effect of slowing down the
audio channel
Resample the segment wave files in the
testing set by the ratio of 1.05, 1.1, 1.15, 1.2,
1.3, 1.4, and 1.6
Repeat the procedures of “baseline
measurement”
Indoor News Measurement
Eliminate the effect of noise
Select the indoor news reporter sentence
Dictate the test set using untrained model
Repeat the procedure using trained model
Experimental Results
Experiment Accuracy (Max. performance)
Baseline 25.27%
Trained Model 25.87% (with 20 video
trained)
Slow Speech 25.67% (max. at ratio = 1.15)
Indoor Speech (untrained
model)
35.22%
Indoor Speech (trained
model)
36.31% (with 20 video
trained)
Overall Recognition Results (ViaVoice, TVB News )
Experimental Result Cont.
Trained Video
Number
Untrained10 videos20 videos
Accuracy 25.27% 25.82% 25.87%
Ratio1 1.051.11.151.21.31.41.5
Accuracy
(%)
25.2725.46 25.63 25.67 25.82 17.18 12.34 4.04
Result of trained model with different number of training videos
Result of using different slow down ratio
Analysis of Experimental Result
Trained model: about 1% accuracy
improvement
Slowing down speeches: about 1% accuracy
improvement
Indoor speeches are recognized much better
Mandarin: estimated baseline accuracy is
about 70 % ( >> Cantonese)
Speech Processor
Training does not increase accuracy
significantly
Need manually editing of the recognition
result
Word timing information is also important
Editing Functionality
The recognition result is organized in a
basic unit called “firm word”
Retrieve the timing information from the
speech engine
Record the timing information of every firm
word in an index
Highlight corresponding firm word during
video playback
Dynamic Time Index Alignment
While editing recognition result, firm word
structure may be changed
Time index need to be updated to maintain
new firm word
In speech processor, time index is aligned
with firm words whenever user edits the
text
Time Index Alignment Example
Before Editing Editing
After Editing
Motivation for Doing Speech
Segmentation and Classification
Gender classification can help us to build
gender dependent model
Detection of scene changes from video
content is not accurate enough, so we need
audio scene change detection as an assistant
tool
Flow Diagram of Audio
Information Retrieval System
Audio
Signal
From
News’ Audio
Channel
Audio
Signal
MFCC
Feature
Extraction
Segmentation
Audio
Scene
Change
Detect cont’
vowel > 30%
Speech
Non-
Speech
Male?
Female?
Music
Pattern
Matching
Speaker
Identification/
Classification
By MFCC var.
By 256 GMM
By Clustering
Feature Extraction by MFCC
The first thing we should do on the raw
audio input data
MFCC stands for “mel-frequency cepstral
coefficient”
Human perception of the frequency of
sound does not follow a linear scale
Detection of Audio Scene Change by
Bayesian Information Criterion (BIC)
Bayesian information criterion (BIC)is a
likelihood criterion
We maximize the likelihood functions
separately for each model Mand obtain L
(X,M)
The main principle is to penalize the system
by the model complexity
Detection of a single point
change using BIC
We define:
H0 : x1, x2 … xN ~ N(μ,Σ)
to be the whole sequence without changes and
H1: x1, x2 … xL ~ N(μ1,Σ1), xL+1,xL+2 … xN ~ N(μ2,Σ2),
is the hypothesis that change occurring at time i.
The maximum likelihood ratio is defined as:
R(I)=Nlog| Σ|-N1log| Σ1|-N2log| Σ2|
Detection of a single point
change using BIC
The difference between the BIC values of
two models can be expressed as:
BIC(I) = R(I) –λP
P=(1/2)(d+(1/2d(d+1))logN
If BIC value>0, detection of scene change
Detection of multiple point
changes by BIC
a.Initialize the interval [a, b] with a=1, b=2
b.Detect if there is one changing point in interval [a, b]
using BIC
c.If (there is no change in [a, b])
let b= b + 1
else
let t be the changing point detected
assign a = t +1; b = a+1;
end
d. go to step (b) if necessary
Advantages of BIC approach
Robustness
Thresholding-free
Optimality
Comparison of different algorithms
Audio scene change detection
Gender Classification
The mean and covariance of male and female
feature vector is quite different
So we can model it by a Gaussian Mixture
Model (GMM)
Male/Female Classification
(freq count vs. values)
Male Female
Gender Classification
Music/Speech classification by
pitch tracking
speech has more continue contour than
music.
Speech clip always has 30%-55%
continuous contour whereas silence or
music has1%-15%
Thus, we choose >20% for speech.
Frequency Vs no of frames
Speech Music
Summary
ViaVoice training experiments
Speech recognition editing tool
Dynamic time index alignment
Audio scene change detection
Speech classification
Integrated the above functions into a speech
processor
Future Work
Classify the indoor news and outdoor news
for further process the video clips
Train the gender dependent models for
ViaVoice engine. It may increase the
recognition accuracy by having a gender
dependent model