Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization and Speaker Recognition

Pushing Intelligence to Edge Nodes :
Low Power circuits for Self-
Localization and Speaker Recognition
Nick Iliev
Presented to:
prof.Trivedi
prof. Paprotny
prof. Rao
prof. Metlushko
prof. Zheng
1

Intelligence at the edge nodes: Applications
Internet-of-acoustic-things
Simultaneous Localization
and Mapping (SLAM)
Autonomous vehicles
Wearables
2

Focus of this Research
•Develop ultralow power computing platforms for
–Speaker recognition hardware accelerator
–Localization hardware accelerator
•Low power neural network implementations
•Low power GMM-based speaker recognition
•Runtime adaptation : depending on battery state and
performance, select processing clock frequency
and/or number of quantization bits to use.
3

Ultralow power spatial localization
Publications:
•1.on-board Accelerators based on massively parallel neural networks ( Recurrent
Neural Networks , RNNs ) for coordinate computation and localization. Initial
results published in IEEE ICCD 2017
•2.non-RNN image-based localization and coordinate mapping (registration) :
published in IEEE ISM 2016
•3.Review and Comparison of Spatial Localization Methods for Low-Power Wireless
Sensor Networks: IEEE Sensors Journal 2015
4

Spatial Localization: Centralized vs Distributed
Cloud
User
Server
Anchor Sink Node
Nearest Anchor Nodes (shaded)
Broadcasts own location and
Routes data to sink nodes
Unknown location node (IoT node)
Receives locations of anchors
And calculates own location
Based on measurements
In Centralized algorithms,
Server receives all measurements
And calculates locations for all unknown nodes
In Distributed algorithms,
Server only stores locations for all nodes,
Each unknown node computes own location and
broadcasts it to the network
Unknown location node (IoT node)
Receives locations of anchors
And calculates own location
Based on measurements
5

Distributed computational load to anchors
6
•Increases with number of unknown nodes ; no RNN capability at the unknown nodes
= computational
load increase

Distributed computational load to anchors
7
•Each unknown node computes own location with RNN
accelerator –load decrease at anchors. RNN accelerator
offloads CPU, reduces power and latency. No off-line
training.
= computational
load increase
= computational
load decrease
RNN
RNN
RNN
RNN
RNN

Spatial localization in 2D –AOA Geometry
8
•Two or more anchors illuminate each unknown
•Centralized –measure F
1,F
2and transmit to server ;
receive own (x,y) from server
•Decentralized (self-localization) –measure F
1,F
2, compute
own (x,y), and transmit (x,y) to server ; saves
communications bandwidth, powerX
Y
Anchor
‘R1’ at
(X1,Y1)
Anchor ‘R2’ at
(X2,Y2) Sensor ‘U’
of unknown
location
Φ1
Φ2
Angle of
arrival (AOA)

Spatial Localization in 2D: Applications
9
AOA sensor distribution of fields
for sensor with 12 photodetectors
❑Most use CPU and matrix / linear-algebra
hardware accelerators
❑A few use Recurrent Neural Network (RNN) in
hardware/software :
S. Li and S. Chen and Y. Lou and B. Lu and Y. Liang, “A
Recurrent Neural Network for Inter-Localization of
Mobile Phones”, in Proc. IEEE-WCCI, Jun. 10-15,
2012.

•Recurrent Neural Network (RNN) hardware/software
embedded accelerators –Mop/s/W
Current RNN Solutions –up to 128 Neurons
0
50
100
150
200
250
300
Mop/s / W

Spatial Localization in 2D -my RNN Solution
•Formulate 2D AOA localization as a constrained
primal-dual linear program
•Solve it with RNN –from 2 to 128 neurons
•????????????&#3627408475;&#3627408438;
??????
?????? ∀&#3627408442;×??????=&#3627408443;, ??????≥0primal
????????????&#3627408485;&#3627408443;
??????
?????? ∀&#3627408442;
??????
×??????≤&#3627408438;dual
The RNN model for solving the above system is :
•
??????
??????&#3627408481;
??????
??????
=−
??????−??????+&#3627408442;
??????
??????−&#3627408438;
+
&#3627408442;??????+&#3627408442;
??????
??????−&#3627408438;
+
−&#3627408443;
•here, for a variable w, (w)
+
= max (w, 0)

Localization in 2D -Discrete time RNN
•We control convergence rate via dt, which is implemented
as a fixed-point fraction in Q15.17 format. All arithmetic
operations in the data path also use the Q15.17 format
•
??????(??????+1)
??????(??????+1)
=
????????????+??????&#3627408481;×&#3627408479;??????
????????????+??????&#3627408481;×&#3627408443;−&#3627408442;×&#3627408479;??????
,
where&#3627408479;??????=max[????????????+&#3627408442;
??????
????????????−&#3627408438;,0].
•The min cost function coefficients, C, in the above primal
problem can be chosen at random since the primary goal
is to solve for q.

Localization in 2D -Digital RNN ArchitectureRegister
Adder
Register
Adder
×2
×M
θ(k+1)θ(k)
φ(k+1)
φ(k) Matrix
Product Eval.
G
Adder
G
T
φ(k+1)
-C
Register
Register
Comparator
G
T
φ(k+1)+θ(k+1)-C
0
Register
r(k+1)
×2 ×2
Multiplier
r(k)
dt
Matrix
Product Eval.
-G
Register
Adder Mult.
dt
Reg.
H
H-G×r(k)
r(k)
×2
×2
×M
Primal
solution
Dual solution
Hidden variable
evaluation (RNN block)
Adder-
Characterization of FPGA-based Localization
Platform ProASIC3EA3PE3000
CombinatorialCells 24946
Sequentialcells(DFFs) 1453
MaxClockFreqMHz 31.45
PowerDissipationforCoreat1.5V 180mW
PowerDissipationforCore(1.5V)andIOpads
(3.3V)
301.219mW

Digital RNN Architecture
•Characterization of ASIC-based Localization -PDK45 1V VDD
•HSpicesimulations with netlist from Cadence Virtuoso used to
compute average power dissipation, with a 1V supply
•measuring the total current drain from the supply over a 3.2 μs
period
DesignTechnology NCSUPDK45nm
CombinatorialCells 51890
Sequentialcells(DFFs) 962
MaxClockFreqMHz 516
TotalPowerDissipationatVDD=1V 6.15mW

Simulated Performance –Mop/sec/W 1 10 100 1000
RNN PDK45 (This work)
RNN FPGA (This work)
LSTM HW 2x Zynq FPGA
LSTM HW Zynq FPGA
Zynq ZC7020 CPU
Exynos5422 4Cortex-A7
Exynos5422 4Cortex-A15
Tegra TK1 GPU
Tegra TK1 CPU
Performance per unit power of different
embedded RNN realizations ( the higher the
better )
Mop / s / W
FPGA -128 neurons (accounting AOA measurements from 128 anchors),
results in 13 Mop/sec/W with 31.25 MHz processing clock.
PDK 45 –677.165 Mop/sec/W with 516 MHz processing clock.
A.Chang, B.Martini, E.Culurciello, “Recurrent neural networks hardware
implementation on fpga”, IJAREEIE vol. 5, no 1, pp. 401-409, Jan. 2016.

Simulated RNN state convergence0 500 1000 1500 2000 2500 3000 3500 4000
0
0.2
0.4
0.6
0.8
1
Time steps, multiples of dt = 0.01. Inset shows steps 400 to 1400.
Primal states
q
1
(
blue
)
and
q
2
(
red
)

dual states
f
1
(
magenta
)
and
f
2
(
black
)40050060070080090010001100120013001400
0.4
0.5
0.6
0.7
0.8
0.9
1
Simulatedconvergence:q
1(blue)andq
2(red)are2D(x,y)coordinates.Solid
linesfromMATLABreferencesimulation.DashedlinesfromQ17.15fixed-
pointVerilogsimulation.

Estimates with Noisy Measurements -1
Error in X & Y estimates against increasing measurement noise. Noise in
measurement angles β
1& β
2is distributed Normally. Error in X & Y estimates is
defined as sum of absolute differences between true and estimated coordinates.
Each point is average over 100 runs.

Estimates with Noisy Measurements -2
HistogramofestimatedX&Ycoordinates(normalizedto1).

Localization in 2D –Digital RNN Result Summary
•Proposed 2D AOA Localization architecture uses a
digital fixed-point RNN, with a scalable number of
neurons ( 2 to 128 ) in the hidden layer. The largest
overdetermined system has 128 neurons for AOA
measurements from 128 anchors.
•The RNN solves a primal-dual LP program for the
target’s x,ycoordinates.

Future Work in Localization

Localization with Digital RNN
•Reduce power consumption of Hspicenetlist –apply power-
gating PMOS / NMOS transistor techniques
•Reduce power consumption of Verilog gate-level netlist by
aggressive clock gating, arithmetic operand gating, imprecise
add/multbit-widths with acceptable error bounds
•Apply RNN to 3D localization –3x3 primal/dual LP with 3
neurons for the basic 3x3 system : scale to Nx3 for
overdetermined systems, where N=3,6,9, … etc.
•Compare digital RNN solution with analog OTA based solution
–backup slides

Ultralow power speaker recognition
Publications:
1. Paper to be submitted at IEEE ICCD 2018
22

Text Independent Speaker Recognition
23
•Gaussian mixture model (GMM)-based speaker
probability extraction
•Feature extraction as Mel frequency cepstral
coefficients (MFCCs)

IoT Device -Text-Independent Speaker Recognition
•The Classification block above is a Maximum Likelihood GMM-based classifier,
with all computations in the log domain ; p( . | l i) is a speaker’s GMM scored
at each MFCC vector x
1… x
T
Ref : D. Reynolds 1995 Ph.D. Thesis

IoT Device -Text-Independent Speaker Recognition
•Example digital system for GMM scoring, up to log domain ( up to Log_Sumof
Exponents, LSE) –see backup for GMM matrix equation; Simulated in floating-
point Matlab: 16 clocks to score 1 12-dimensional z centroid for mixture GMM_i1
GMM component i–scoring ( evaluation) for incoming 1x12 16-bit two’s complement Q(16,14) MFCC vector
Audio
Stream
MFCC vector
of 12 jointly
Gaussian
Rand Var
X22_Reg1[11:0][15:0]
Mu_[11:0][15:0]
Load_GMM_i_
params
Inv_sigma[11:0][15:0]
Sub_0
…
Sub_11
Sqr_0
…
Sqr_11
mult add
Accum_Reg1[15:0]
Accumulator control :
12 Iterations of mult-add
to Log_Sum
(LSE)
domain
accum[15:0]
Sqr_sub[11:0][15:0]
Sub_vec[11:0][15:0]
Stage 1
Stage 2
Stage 3

GMM scoring –in log domain ( Log_Sumof Exponents, LSE, domain ) ; simulated in flt–point
Matlab
l
To log_LSEunit : M accumulator (accum)
outputs for x
1 … x
M(accum1 …. accum20 )
Log(k
1)
.
.
.
Log(k
M)
Pre-computed
element-wise addition
x
1new… x
Mnew
sorter
Sorter
(systolic bubble sort)
Sort in M cycles
Find max element from
x
1new… x
Mnew
M deep FIFO for
x
1new… x
Mnew Saved max in register
x
max
Subtract element-wise
x
imax -x
max i= 1… M
Register_sub
To expunit
20 clocks for M=20 mixtures

GMM scoring –in log domain ( Log_Sumof Exponents, LSE, domain )
simulated in flt-point Matlab
Total_1z = 16 + 20 + 5 = 41 clocks to score 1 z centroid with GMM_i
Total_40z = 4 * 41 = 164 clocks to score all 40 z centroids all GMMs

GMM scoring –number of operations –Power analysis Estimate
based on published imlementations
•NCSU PDK 45 nm , Vdd= 1.1V , published implementations
•One 16-bit add Carry-Skip = 2 uW, ref 1 , Clk50 MHz, delay 20 nsec
•On 16x16 mult, Array = 55 uW, ref 2 , Clk1.234 GHz, delay 0.824 nsec
•3-way comparator , magnitude = 40 uW, ref 3 , clk1.2 GHz, delay 0.833 nsec
•SRAM , 4Kb, read access dynamic = 350 uW(leakage 800 uW) , ref 4 , clk250 MHz
0
200
400
600
800
1000
1200
adds mults compar lookup
Power (mW) for GMM with 20 mixtures,
1 MFCC frame (12 rand varfeatures) scored
1 GMM 1 speaker38 GMMs 38 speakers 2
Calculated Worst-Case Power:
All ops in each 1.234 GHz cycle :
1 GMM total = 52.48 mW
38 GMMs total = 1994 mW
HIGH !
For each block/operation
using P = aC V
2
f
with a = 1 for all, f=1.234GHz for all

GMM Scoring –Worst case power reduction
techniques -1
•1 –Clock Frequency reduction : from (max) 1.234
GHz to 1.234 MHz (div-1000 ): total 1994 mWto
1.994 mW; in 10 msecframe , above GMM scoring
pipeline has 10 stages, or 1 msecper stage : with
1.234 MHz ( 810 nsec), we have 1234 clkcycles per
stage : enough clocks for all operations in a stage ;
still using 16-bits for math operations
Now Calculated Worst-case power
for 38 GMM total = 1.994 mW
29

GMM Scoring –Worst case power reduction
techniques -2
•2 –Imprecise arithmetic –fewer quantization bits ; using 16-bits from
above :
•Using 6-bits ( vs 16 ) for all arithmetic and for MFCC quantization reduces
adder power from 20 uWto 20uW/(16/6) = 7.5 uW; multto 7.73 uW;
comp to 5.62 uW; new total worst-case power :
•Reducing the clock rate from 1.234 GHz to 1.234 MHz reduces this to
0.8525 mW, and to 0.0224 mWfor 1 GMM
30
1 speaker (GMM ) = 52.48 mW= 517*(20uW) +480*(55uW)+26*(40uW)+42*(350uW)
38 speakers (GMMs ) = 1994 mW= 19646*(20uW) +18240*(55uW)+988*(40uW)+
1596*(350uW)
38 speakers (GMMs ) = 852.5 mW= 19646*(7.5 uW) +18240*(7.73 uW)+988*(5.62 uW)+
1596*(350uW)
Now Calculated Worst-case power
for 38 GMM total = 0.8525 mW
1 speaker (GMM ) = 22.4 mW= 517*(7.5 uW) +480*(7.73 uW)+26*(5.62 uW)+42*(350uW)

GMM Scoring –Worst case power reduction
techniques -3
•3 –Frame Decimation (downsampling) –the majority
of todays GMM-based systems use a fixed rate frame
skipping (usually rate = 1, or skip every other frame );
power is saved since fewer frames are scored with all
GMMs
31

IoT Device -Text-Independent Speaker Recognition : Frame
decimation
•Low-power focus , FS_mode=0 (Simulator mode no frames skipped):
A ) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum
Simulation result from floating point Matlabsimulation : 500 test frames, post
min-energy filtering ; no frame skipping is done, for 100% success but at 100%
computation ; ( every frame scored with all GMMs, maximum power dissipation)
Note that X axis,FS_Rate=0 at all times ( not to scale )
B )

IoT Device -Text-Independent Speaker Recognition : Frame
decimation
•Low-power focus , FS_mode=1 ( simulator mode, skip 1, 2, 4, … 128 frames ):
•B ) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum
Simulation result from floating point Matlabsimulation : 500 test frames as before;
frame skipping is done, for less than 100% success and less than 100% computation
( every frame Not scored with all GMMs ) ; saving power with fewer computations, but
lower success rate of recognition (already below 90% when skipping every 16
th
frame)

IoT Device -Text-Independent Speaker Recognition : Frame
decimation
•Success rate increases as less frames are skipped ( 128, 64, … 4, 2, 1 )
FS_Mode=1
Challenge : develop algorithm and architecture to generate
the Redperformance curve

Text-Independent Speaker Recognition –Clustering test frames : met Challenge
•Low-Power focus, FS_mode=1 vs FS_mode=0 with kMeansclusters : but k = ?
C) reduce exact GMM evaluations ( scoring of GMMs ) to an absolute minimum
- my idea -find clusters in the 500 test frames ; use batch k-Means , starting
-with 10 clusters and incrementing by 10. Use centroids for all clusters to
- score all GMMs with ; this “decimates” 500 test frames to N frames
- ( N centroids ) , N << 500, for each N cluster scenario.
– Simulation result from floating point Matlabsimulation : for N= k =40 clusters
– success rate is already 97%, with 8% computation ( % GMMs scored ).
– Classical FS_mode=1 achieves 94% success with 21% computation.
–

Text-Independent Speaker Recognition –number of
computations –Clustering k-Means
•On-line k-Means Clustering –number computations to find 40 clusters : Uses
LMS-like cluster-center update below ; at each timestep t, each frame x
1…x
t
contributes equally to determine the updated centers z
1…z
k
•Algorithm ( Lloyd’s) :
•
•Clustering (k-Means, k=40 ) method, using on-line k-Means with k=40 , 1
Iteration : 40 distance computations : 480 adds, 480 mults; sort 40 values (
40(log40) = 64 3-way comparisons ) ; centroid update : 1 counter add, 12
sub/add, 12 divide, 12 add
•Total for 10 Iterations : 5,050 adds ; 4800 mults; 640 3-way comparisons ;
120 divides ; above Matlabsimulation for k=40 clusters converges in 10
iterations

Text-Independent Speaker Recognition –number of computations
–GMM scoring with k centroids
•Compare computations for FS_mode=1 vs computations for
FS_mode=0 with clustering and GMM scoring with k centroids
Adds Mults 3-way
comparisons
Lookups divisions
FS_mode=1
250 frames
from 500;
1GMM
scored
129250 120000 6500 10500 0
On-line k=40
clusters from
500
frames,1GM
M scored
25730 24000 1680 1680 120

Text-Independent Speaker Recognition –number of
computations –GMM scoring with k centroids
0
20000
40000
60000
80000
100000
120000
140000
adds mults 3-way comp lookups divisions
Number of operations
FS_mode=1 FS_mode=0 with k=40 clustersColumn1

Text-Independent Speaker Recognition –GMM scoring with k
centroids –Worst-case power analysis
•Using 6-bits ( vs 16 ) for all arithmetic and for MFCC quantization; Clkreduced from 1.234 GHz to 1.234 MHz
•From slide 30, scoring all 38 GMMs with 1 frame takes 0.8525 mW: scale power for rate=1,2,4,8,16 fs_mode=1
•Then scale power for 10,20,30,40,50 centroids (frames) and fs_mode=0
•My worst-case estimate is 34 mW, fs_mode=0 with k=40 centroids
•Competitive with 54 mWdesign by G.He“A 40-nm 54 mW3x-real Time VLSI Processor for 60-Kword Continuous
Speech Recognition”
•State-of-the is 6 mW, M.Price2017 “ A 6mW 5000-Word Real-Time Speech Recognizer Using WFST Models”
•Not apples-to-apples comparison since in Speech Recogniton, decoder’s active-list feedback selects 1 GMM,
GMMs don’t model speakers but senones ; similar in that GMM scoring makes the bulk of all computations

Text-Independent Speaker Recognition –hardware for on-line k-
means
Counter block
n
1… n
k
Block to store k
cluster centers
z
1… z
k
Find closest z
ito
x
t
Updatez
i
New
test
data
vector
at time t
x
t
n
i
FSM

Text-Independent Speaker Recognition –hardware for on-line k-
means –detail on Euclidean distance (closest)
stage 5 stage 6
Euclidean dist( z
i-x
t) to
sorter unit
Drive Reference
with z
i
i= 1…40
Drive Test bus
with x
t
6 clocks to compute 1 Euclidean distbetween
40 12-dimensional z centroids and incoming x vector

Text-Independent Speaker Recognition –hardware for
linear time Sorting of K words sorting_cell_0
state
cell_data
prev_data_is_pushed
data_is_pushed
sorting_cell_1
state
cell_data
prev_data_is_pushed
data_is_pushed
sorting_cell_39
state
cell_data
prev_data_is_pushed
data_is_pushed
. . .
unsorted
_data
32
clk
shift_up
sorted_data
32
42
•For on-line K-Means, with K=40 , 40 systolic sorting cells ; Euclidean-distance block
•drives 1 to 40 words on the unsorted_databus
40 clocks to sort 40 distance values
Only winning (smallest ) z used
in next stage (LMS update stage)

Text-Independent Speaker Recognition –linear
time Sorter Verilog simulation
43

Text-Independent Speaker Recognition –update
winning cluster center stage (LMS update)
•Z
i+1= Z
i+ ( 1/n
i)*( X –Z
i)
•4 clocks to compute LMS update
•Total clocks for 1 iteration = 6 + 40 + 4 = 50
•Total for 10 iterations = 500 clocks
44

Text-Independent Speaker Recognition –Result summary
•For TIMIT TEST/DR1 38 speaker set I’ve shown that 40 clusters from online
k-Means can achieve 97% recognition success rate
•I have achieved a 12.5 : 1 ( 500 to 40 ) reduction in number of frames used
for GMM scoring while maintaining a 97% success rate ; only 40 centroids
are needed
•5 : 1 reduction in number of adds and mults
•3.9 : 1 reduction in number of 3-way comparisons
•6.25 : 1 reduction in number of lookups
•Estimated 6 : 1 reduction in worst-case power ( 34 mWvs 213 mW)
•Above estimates are for 6-bits quantization for all paramsand MFCC data;
using 1.234 MHz processing clock for published PDK 45 nm
implementations of arithblocks

Future Works
•Complete the fixed-point Verilog implementation of the on-line 40 cluster
k-Means datapath
•Complete its integration with the GMM scoring datapath
•Simulate end-to-end design and characterize performance : power,
latency, success rate
•Evaluate additional low-power techniques :
•1 –at GMM layer, select 1 GMM to score, instead of all GMMs (pruning) ;
•2 -deeper pipelines for on-line clustering unit and for GMM scoring unit :
preferred over adding parallel-units due to leakage current issues at 45 nm
and below ;
•3 –power modes : sleep , deep-sleep, doze (last GMM used On, others
Off )
•Scale the design to all 168 speakers in the TIMIT TEST/DRxdata set.
•Publication : paper to be submitted at IEEE ICCD 2018
46

Backup Slides
47

IoT Device –Speaker Recognition
•If speaker recognition computations can be offloaded from the cloud
processor to the edge IoT node, that cloud processor does not have to
be as fast
•Smartphone apps ( Alexa, Siri, Google Assistant ) generally need 1
Watt of power to process a single speech-recognition query ; 100
Watts for 100 queries
•Dominant computation in max-likelihood GMM speaker recognition is
Gaussian probability estimation (scoring ) –from 6 mW(MIT) to
1.8 W ( CMU ) with GMM accelerators and MFCC frames
•I focus on reducing this power by reducing the total number of GMM
scoring operations via a frame downsamplingaccelerator , processing
clock frequency reduction, and imprecise arithmetic ( fewer
quantization bits )
•Initial results in paper to be submitted to IEEE MWCS 2018

IoT Device –Localization and Self-Localization
•-Goal : off-load Cloud server computations to IoT device –less network
congestion, faster response times for IoT device localization
•-IoT device has custom low-power circuits for spatial self-localization
•-2D or 3D spatial coordinates of IoT device : on-board sensors supply data (
acoustic or optical AOA to anchors, anchor’s locations) to the device’s
Processor and Accelerators : it then computes its coordinates ( in its own
coordinate system) and sends them to Cloud server
•Cloud server then does coordinate translation and maps IoT device to global
absolute coordinate map ; or IoT device does coordinate translation on-board
•My research area : on-board Accelerators based on massively parallel neural
networks ( Recurrent Neural Networks , RNNs ) for coordinate computation
•Initial results published in IEEE ICCD 2017
•Additional result : non-RNN image-based localization and coordinate
mapping (registration) : published in IEEE ISM 2016

For AOA measurements from M anchors, this leads to a system of linear equation as
????????????&#3627408475;??????
1−&#3627408438;&#3627408476;&#3627408480;??????
1
⋮ ⋮
????????????&#3627408475;??????
??????−&#3627408438;&#3627408476;&#3627408480;??????
??????
&#3627408485;
&#3627408480;
&#3627408486;
&#3627408480;
=
????????????&#3627408475;??????
1×&#3627408485;
1−&#3627408438;&#3627408476;&#3627408480;??????
1×&#3627408486;
1
⋮
????????????&#3627408475;??????
??????×&#3627408485;
??????−&#3627408438;&#3627408476;&#3627408480;??????
??????×&#3627408486;
??????
.(2)
In the noiseless case, the above set of linear equations is consistent. However, due to noise
in AOA measurements, the system should be solved in a least square sense. Therefore (2)
can be written as
&#3627408467;
1
⋮
&#3627408467;
??????
=
????????????&#3627408475;??????
1−&#3627408438;&#3627408476;&#3627408480;??????
1 −????????????&#3627408475;??????
1×&#3627408485;
1+&#3627408438;&#3627408476;&#3627408480;??????
1×&#3627408486;
1
⋮ ⋮ ⋮
????????????&#3627408475;??????
??????−&#3627408438;&#3627408476;&#3627408480;??????
??????−????????????&#3627408475;??????
??????×&#3627408485;
??????+&#3627408438;&#3627408476;&#3627408480;??????
??????×&#3627408486;
??????
&#3627408485;
&#3627408480;
&#3627408486;
&#3627408480;
1
.(3)
Here, the estimated location of sensor estimated as
&#3627408485;
&#3627408480;&#3627408486;
&#3627408480;=??????&#3627408479;&#3627408468;&#3627408474;??????&#3627408475;෍
??????=1
??????
&#3627408467;
??????
2
. (4)
If we represent, &#3627408443;=σ
??????=1
??????
&#3627408467;
??????
2
, the total error minimizes when Τ??????&#3627408443;??????&#3627408481;=0. However, since
&#3627408443;≥0, Τ??????&#3627408443;??????&#3627408481;≤0is also a sufficient condition to minimize H[]. Τ??????&#3627408443;??????&#3627408481;is expanded as,
??????&#3627408443;
??????&#3627408481;
=&#3627408485;
&#3627408480;&#3627408486;
&#3627408480;1
????????????&#3627408475;??????
1−&#3627408438;&#3627408476;&#3627408480;??????
1 −????????????&#3627408475;??????
1×&#3627408485;
1+&#3627408438;&#3627408476;&#3627408480;??????
1×&#3627408486;
1
⋮ ⋮ ⋮
????????????&#3627408475;??????
??????−&#3627408438;&#3627408476;&#3627408480;??????
??????−????????????&#3627408475;??????
??????×&#3627408485;
??????+&#3627408438;&#3627408476;&#3627408480;??????
??????×&#3627408486;
??????
??????
×
????????????&#3627408475;??????
1−&#3627408438;&#3627408476;&#3627408480;??????
1
⋮ ⋮
????????????&#3627408475;??????
??????−&#3627408438;&#3627408476;&#3627408480;??????
??????
ൗ
??????&#3627408485;
&#3627408480;
??????&#3627408481;
ൗ
??????&#3627408486;
&#3627408480;
??????&#3627408481;
=0.
Localization in 2D Future Work –low power
Analog OTA circuit 1 -backup
The localization problem can be also formulated
As a system of linear differential equations as show below

•
ൗ
??????&#3627408485;
??????
??????&#3627408481;
ൗ
??????&#3627408486;
??????
??????&#3627408481;
=−
????????????&#3627408475;??????
1−&#3627408438;&#3627408476;&#3627408480;??????
1
⋮ ⋮
????????????&#3627408475;??????
??????−&#3627408438;&#3627408476;&#3627408480;??????
??????
??????
•×
????????????&#3627408475;??????
1−&#3627408438;&#3627408476;&#3627408480;??????
1 −????????????&#3627408475;??????
1×&#3627408485;
1+&#3627408438;&#3627408476;&#3627408480;??????
1×&#3627408486;
1
⋮ ⋮ ⋮
????????????&#3627408475;??????
??????−&#3627408438;&#3627408476;&#3627408480;??????
??????−????????????&#3627408475;??????
??????×&#3627408485;
??????+&#3627408438;&#3627408476;&#3627408480;??????
??????×&#3627408486;
??????
&#3627408485;
&#3627408480;
&#3627408486;
&#3627408480;
1
.(6)
•Eq. (6) can be rearranged as
•
ൗ
??????&#3627408485;
??????
??????&#3627408481;
ൗ
??????&#3627408486;
??????
??????&#3627408481;
=−
σ
??????=1
??????
????????????&#3627408475;??????
??????
2
−σ
??????=1
??????
????????????&#3627408475;??????
??????&#3627408438;&#3627408476;&#3627408480;??????
??????
−σ
??????=1
??????
????????????&#3627408475;??????
??????&#3627408438;&#3627408476;&#3627408480;??????
?????? σ
??????=1
??????
&#3627408438;&#3627408476;&#3627408480;??????
??????
2
&#3627408485;
&#3627408480;
&#3627408486;
&#3627408480;
−
σ
??????=1
??????
−????????????&#3627408475;??????
??????
2
×&#3627408485;
??????+????????????&#3627408475;??????
??????&#3627408438;&#3627408476;&#3627408480;??????
??????×&#3627408486;
??????
σ
??????=1
??????
????????????&#3627408475;??????
??????&#3627408438;&#3627408476;&#3627408480;??????
??????×&#3627408485;
??????−&#3627408438;&#3627408476;&#3627408480;??????
??????
2
×&#3627408486;
??????
. (7)
•Eq. (7) is abbreviated as
•
ൗ
??????&#3627408485;
??????
??????&#3627408481;
ൗ
??????&#3627408486;
??????
??????&#3627408481;
=−&#3627408436;
&#3627408485;
&#3627408480;
&#3627408486;
&#3627408480;
−&#3627408437;(8)
Localization in 2D Future Work –low power Analog
OTA circuit 2 -backup

•The following OTA circuit is proposed for solving
equation 8) above ; Andrea Gualco’sOTA design and
OTA-based localizer compared with RNN ckt
•
Localization in 2D Future Work –low power Analog
OTA circuit 3 -backup+
-
+
-
+
-
+
-
VCM
VCM
VCM
VCM
C
C
xs
ys
R1 = 1/A11
R2 = 1/A22
GM2 = A12
GM1 = A12
I1 = -B11
I2 = -B21

Localization in 2D Future Work –low power Analog OTA
circuit 3a -backup
•Linear Coupled Differential equation circuit for 2D localization ; OTA
Verilog-A model completed ;The plot is example simulation output :
OTA output current I(Vcm_out) vs input voltage difference V Hspice
simulation : Vcm(common mode V) = 0.5VOTA Verilog-A completed – unit tested in HSpice sims

Localization in 3D Future Work –low power Analog Linear
System circuit 4 -backup
•In some 3D spatial localization cases the A matrix in the above OTA circuit
may not be positive definite –hence no convergence can be achieved
•I have a solution in this case using a linear voltage op-amp ( balanced
adder-subtractor ) circuit x
+
-
RfxR4x
R5x
R3x
R1x
R2x
GND
+
-
b1/a11
+
-
RfyR4y
R5y
R3yR1y
R2y
GND
+
-
b2/a22
+
-
RfzR4z
R5z
R3z
R1z
R2z
GND
+
-
b3/a33
y
z
y
z
x
z
x
y

Localization in 3D Future Work –low power Analog
Linear System circuit 4a -backup

•The coefficients in these equations are derived from three measured AOA
values ( azimuth angles beta1, beta2, and elevation angle gamma1 ), and two
anchor’s known data (x1,y1,z1) and (x2,y2,z2).
•The active analog network for solving the above 3x3 system is shown below, it
requires 3 op-amps , 3 DC voltage sources, and 18 resistors as shown in the
previous slide.
•

Localization in 3D Future Work –low power Analog Linear
System circuit 4b -backup
•The following 45 nm op-amp and biasing network was used, based on
R.J.Baker( Reference: Baker, “CMOS Circuit Design , Layout, and
Simulation”, 3
rd
edition, sect. 24.1 , Fig. 24.2 )

Localization in 3D Future Work –low power Analog Linear System
circuit 5 -backup
•X coordinate = V(out) convergence = 94.532 mV * 50 = 4.73 approx. 5 (true)
•Y coordinate = V(out2) = 309.014 mV * 50 = 15.45 approx. 15 (true )

Localization in 3D Future Work –low power Analog Linear System
circuit 6 -backup
•Z coordinate = V(out3) = 274.6874 mV * 50 = 13.7 approx. 14 (true)

RNN solver –Quadratic Program
•Solving a Quadratic Program with QP block, via Select
QP or LP mux
59D11 D12
D21 D22
X
Y1(n)
Y2(n)
matrix_vector
mult
+
X1(n)
X2(n)
-
C1
C2
M
a
x
0
R1(n)
R2(n)
- X
X1(n)
X2(n)
dt (scaler)
dX1(n)
dX2(n)
vector-scaler
mult
dX1(n)
dX2(n)
+
X1(n-1)
X2(n-1)
X1(n)
X2(n)
X
I
I + A
Select QP or LP
QP

IoT Device -Text-Independent Speaker Recognition
•I’m focusing on Speaker Recognition ( Identification of 1 speaker from a closed set of M
enrolled speakers) not Verification of speaker’s claimed Identity
•GMM based, generative stochastic models, using open-source TIMIT database for model
construction and algorithm and hardware verification ; GMM model build with EM for
each enrolled speaker, using speaker’s training set of MFCC feature vectors (frames) ; an
offline process. A typical 10 msecspeaker’s training utterance can have 2000 12-element
MFCC vectors for GMM model building during offline training.
•During online recognition, after Voice Activity Detection and minimum acoustic energy
filtering, about 500 12-element MFCC frames are generated by the unknown (test) speaker.
•A typical maximum-likelihood, GMM-based, speaker recognition system : online recognition
uses the bottom path :

IoT Device -Text-Independent Speaker Recognition
•GMM model of 1 speaker : mixture of multivariate Gaussian densities
•
•The Gaussian mixture probability density function of model ( speaker ) λ
consists of a sum of K weighted component densities, given by the above
equation.K is the number of Gaussian components, P
kis the prior probability
(mixture weight) of the k-thGaussian component, and
•is the d-variate Gaussian density function with mean vector μ
kand covariance
matrix Σ
k. The mixture weights P
k≥0 are constrained as

GMM scoring –number of operations –Worst case all done in
each clock cycle ( activity factor = 1 for all )
MFCC
Frames = 1
20 GMMs per
Speaker
Speakers Adds Mults 3-way
comparisons
lookups
1 517 480 26 42
38 19646 18240 988 1596
1 speaker (GMM ) = 52.48 mW= 517*(20uW) +480*(55uW)+26*(40uW)+42*(350uW)
38 speakers (GMMs ) = 1994 mW= 19646*(20uW) +18240*(55uW)+988*(40uW)+1596*(350uW)

IoT Device -Text-Independent Speaker Recognition
•For above FS_Mode=0, FS_Rate=0 sim, evolution of probAll,
p( . |l
s) , of all speaker’s posterior probabilities; winning speaker has
smallest negative log(prob) approx. -7990 ; X axis is number of test
frames
•

IoT Device -Text-Independent Speaker Recognition
•For above FS_Mode=1, FS_Rate=1,2,4,8,16 sim, evolution of probAll,
p( . |l
s) , of all speaker’s posterior probabilities; winning speaker has
smallest negative log(prob) approx. -820 ; X axis is number of test frames;
jumps when FS_ratechanges; probAllrecomputed only for a new FS_rate

Text-Independent Speaker Recognition –Clustering test frames
•Above FS_mode=0 , FS_Rate=0, kMeans40 clusters simulation :
evolution of probAll, all speaker’s posterior probover all 40 test
frames (centroids ); winning speaker has smallest negative log(prob)
approx. -590 ; X axis is number of test frames

Text-Independent Speaker Recognition –GMM scoring with k
centroids –power analysis
•ref 1 -S. Shartmaet al. 2015, “Design of Low Power High Speed 16 bit Address with
McCMOSin 45 nm Technology”
•ref 2 -S. Mohan et al. 2017, “An improved implementation of hierarchy array multiplier
using Cs1A adder and full swing GDI logic –45 nm PDK”
•ref 3 -P. Sharma et al. 2016, “Design Analysis of 1-bit Comparator using 45nm Technology”
•ref 4 –J. Stine et al. 2017, “A high performance multi-port SRAM for low voltage shared
memory systems”

Text-Independent Speaker Recognition –number of computations –
Clustering k-Means –table of operations
•Above on-line k-Means, k=40, on 500 test frames, clustering
algorithm requires the following operations per iteration : ( 40 squared
Euclidean distances, sorting (find min of 40 values ), and LMS update
of winning cluster :
Number of
iterations
Adds Mults 3-way
comparisons
divisions
1 505 480 64 12
10 5050 4800 640 120

Text-Independent Speaker Recognition –number of computations –GMM
scoring with k centroids ; table of ops; FS_Mode=0
Adds Mults 3-way
compariso
ns
DivisionsLookups
10
iterations
to converge
to 40
frames
(centroids)
5050 4800 640 120 0
Score GMM
with 40
frames
(centroids)
20680 19200 1040 0 1680
Total
kMeans
and GMM
25730 24000 1680 120 1680

Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization and Speaker Recognition

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization and Speaker Recognition

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx