Speech recognition for digital Video .ppt

JamesBond241 6 views 35 slides Jul 10, 2024
Slide 1
Slide 1 of 35
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35

About This Presentation

See pics


Slide Content

LYU0103
Speech Recognition
Techniques for
Digital Video Library
Supervisor : Prof Michael R. Lyu
Students: Gao Zheng Hong
Lei Mo

Outline of Presentation
Project objectives
ViaVoice recognition experiments
Speech recognition editing tool
Audio scene change detection
Speech classification
Summary

Our Project Objectives
Audio information retrieval
Speech recognition

Last Term’s Work
Extract audio channel (stereo 44.1 kHz)
from a mpeg video files into wave files
(mono 22 kHz)
Segmented the wave files into sentences by
detecting its frame energy
Developed a visual training tool

Visual Training Tool
Video Window; Dictation Window; Text Editor

IBM ViaVoice Experiments
Employed 7 student helpers
Produce transcripts of 77 news video clips
Four experiments:
Baseline measurement
Trained model measurement
Slow down measurement
Indoor news measurement

Baseline Measurement
To measure the ViaVoice recognition
accuracy using TVB news video
Testing set: 10 video clips
The segmented wav files are dictated
Employ the hidden Markov model toolkit
(HTK) to examine the accuracy

Trained Model Measurement
To measure the accuracy of ViaVoice, trained by
its correctly recognized words
10 videos clips are segmented and dictated
The correctly dictated words of training set are
used to train the ViaVoice by the SMAPI function
SmWordCorrection
Repeat the procedures of “baseline measurement”
after training to get the recognition performance
Repeat the procedures of using 20 videos clips

Slow Down Measurement
Investigate the effect of slowing down the
audio channel
Resample the segment wave files in the
testing set by the ratio of 1.05, 1.1, 1.15, 1.2,
1.3, 1.4, and 1.6
Repeat the procedures of “baseline
measurement”

Indoor News Measurement
Eliminate the effect of noise
Select the indoor news reporter sentence
Dictate the test set using untrained model
Repeat the procedure using trained model

Experimental Results
Experiment Accuracy (Max. performance)
Baseline 25.27%
Trained Model 25.87% (with 20 video
trained)
Slow Speech 25.67% (max. at ratio = 1.15)
Indoor Speech (untrained
model)
35.22%
Indoor Speech (trained
model)
36.31% (with 20 video
trained)
Overall Recognition Results (ViaVoice, TVB News )

Experimental Result Cont.
Trained Video
Number
Untrained10 videos20 videos
Accuracy 25.27% 25.82% 25.87%
Ratio1 1.051.11.151.21.31.41.5
Accuracy
(%)
25.2725.46 25.63 25.67 25.82 17.18 12.34 4.04
Result of trained model with different number of training videos
Result of using different slow down ratio

Analysis of Experimental Result
Trained model: about 1% accuracy
improvement
Slowing down speeches: about 1% accuracy
improvement
Indoor speeches are recognized much better
Mandarin: estimated baseline accuracy is
about 70 % ( >> Cantonese)

Speech Processor
Training does not increase accuracy
significantly
Need manually editing of the recognition
result
Word timing information is also important

Editing Functionality
The recognition result is organized in a
basic unit called “firm word”
Retrieve the timing information from the
speech engine
Record the timing information of every firm
word in an index
Highlight corresponding firm word during
video playback

Dynamic Time Index Alignment
While editing recognition result, firm word
structure may be changed
Time index need to be updated to maintain
new firm word
In speech processor, time index is aligned
with firm words whenever user edits the
text

Time Index Alignment Example
Before Editing Editing
After Editing

Motivation for Doing Speech
Segmentation and Classification
Gender classification can help us to build
gender dependent model
Detection of scene changes from video
content is not accurate enough, so we need
audio scene change detection as an assistant
tool

Flow Diagram of Audio
Information Retrieval System
Audio
Signal
From
News’ Audio
Channel
Audio
Signal
MFCC
Feature
Extraction
Segmentation
Audio
Scene
Change
Detect cont’
vowel > 30%
Speech
Non-
Speech
Male?
Female?
Music
Pattern
Matching
Speaker
Identification/
Classification
By MFCC var.
By 256 GMM
By Clustering

Feature Extraction by MFCC
The first thing we should do on the raw
audio input data
MFCC stands for “mel-frequency cepstral
coefficient”
Human perception of the frequency of
sound does not follow a linear scale

Detection of Audio Scene Change by
Bayesian Information Criterion (BIC)
Bayesian information criterion (BIC)is a
likelihood criterion
We maximize the likelihood functions
separately for each model Mand obtain L
(X,M)
The main principle is to penalize the system
by the model complexity

Detection of a single point
change using BIC
We define:
H0 : x1, x2 … xN ~ N(μ,Σ)
to be the whole sequence without changes and
H1: x1, x2 … xL ~ N(μ1,Σ1), xL+1,xL+2 … xN ~ N(μ2,Σ2),
is the hypothesis that change occurring at time i.
The maximum likelihood ratio is defined as:
R(I)=Nlog| Σ|-N1log| Σ1|-N2log| Σ2|

Detection of a single point
change using BIC
The difference between the BIC values of
two models can be expressed as:
BIC(I) = R(I) –λP
P=(1/2)(d+(1/2d(d+1))logN
If BIC value>0, detection of scene change

Detection of multiple point
changes by BIC
a.Initialize the interval [a, b] with a=1, b=2
b.Detect if there is one changing point in interval [a, b]
using BIC
c.If (there is no change in [a, b])
let b= b + 1
else
let t be the changing point detected
assign a = t +1; b = a+1;
end
d. go to step (b) if necessary

Advantages of BIC approach
Robustness
Thresholding-free
Optimality

Comparison of different algorithms

Audio scene change detection

Gender Classification
The mean and covariance of male and female
feature vector is quite different
So we can model it by a Gaussian Mixture
Model (GMM)

Male/Female Classification
(freq count vs. values)
Male Female

Gender Classification

Music/Speech classification by
pitch tracking
speech has more continue contour than
music.
Speech clip always has 30%-55%
continuous contour whereas silence or
music has1%-15%
Thus, we choose >20% for speech.

Frequency Vs no of frames
Speech Music

Summary
ViaVoice training experiments
Speech recognition editing tool
Dynamic time index alignment
Audio scene change detection
Speech classification
Integrated the above functions into a speech
processor

Future Work
Classify the indoor news and outdoor news
for further process the video clips
Train the gender dependent models for
ViaVoice engine. It may increase the
recognition accuracy by having a gender
dependent model
Tags