GlobalLogicUkraine
5,486 views
53 slides
Jun 19, 2015
Slide 1 of 53
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
About This Presentation
The presentation is dedicated to advantages and disadvantages of FPGA (Field-Programmable Gate Array): its construction and speed features, as well as security elements. It also deals with such issues as new devices synthesis and expanding the existing hardware functionality, realisation of micropro...
The presentation is dedicated to advantages and disadvantages of FPGA (Field-Programmable Gate Array): its construction and speed features, as well as security elements. It also deals with such issues as new devices synthesis and expanding the existing hardware functionality, realisation of microprocessors for specialized tasks, as well as OpenCL, a system for parallel calculations.
This presentation by Andriy Smolskyy (Lead Software Engineer, GlobalLogic) was delivered at Embedded TechTalk Lviv on June 17, 2015.
Size: 3.4 MB
Language: en
Added: Jun 19, 2015
Slides: 53 pages
Slide Content
FPGA b y Andriy Smolskyy
FPGA Programmable Logic Evolution: TTL PLA CPLD FPGA ASIC Development aspects Using FPGA for high speed data processing OpenCL
Programmable Logic is Found Everywhere!
TTL Logic Design
Digital Design with TTL Logic
Digital Design with TTL Logic
Digital Design with TTL Logic
Digital Design with TTL Logic
Digital Design with TTL Logic
From TTL to Programmable Logic General features of logic implementations Sum of products (AND-OR gates, combinatorial logic) Stored results (registered outputs) Wired together What if Logic functions were fixed (like TTL), but combined into a single device? Wiring (routing) connections could be controlled (programmed) somehow?
Programmable Array Logic (PAL) Simplest implementation of programmable logic Logic gates and registers are fixed Programmable sum of products array and output control
Programmable Logic Advantages Fewer devices required Lower cost Power savings Simpler to test and debug Design security (prevent reverse engineering) Design flexibility Automated tools simplify and consolidate design flow In-system reprogrammability ! (in some cases)
From PAL to Programmable Logic Device (PLD) Arrange multiple PAL arrays in a single device
From PLD to Complex PLD (CPLD) Combine multiple PLDs in single device with programmable interconnect and I/O
General CPLD Advantages Ample amounts of logic and advanced configurable I/ Os Programmable routing Instant on Low cost Non-volatile configuration Reprogrammable
From CPLD to FPGAs Higher density CPLDs don’t scale well because of requires additional global routing Rearrange LABs themselves into an array
Field Programmable Gate Array (FPGA) LABs arranged in an array Row and column programmable interconnect Interconnect may span all or part of the array
CPLD LABs vs. FPGA LABs FPGA LABs made up of logic elements (LEs) instead of product terms and macrocells Easier to create complex functions through LE cascading
Lookup Tables (LUTs) Replaces product term array Combinational functions created with programmed “tables” (cascaded multiplexers) LUT inputs are mux select lines
Adaptive Logic Modules (ALM) Based on LE, but includes dedicated resources & adaptive LUT (ALUT) Improves performance and resource utilization
FPGA Routing All device resources can feed into or be fed by any routing in device Differing fixed lengths to adjust for timing Scales linearly as density increases Local interconnect Connects between Les or ALMs within a LAB Can include direct connections between adjacent LABs Row and column interconnect Fixed length routing segments Span a number of LABs or entire device
Other Typical FPGA Features Embedded multipliers Useful for DSP High-performance multiply/add/accumulate operations Memory blocks High-speed transceivers Replace some LABs with dedicated functional hardware blocks PLLs SDRAM controllers Hard Processor System
System on Chip ( SoC ) + FPGA
FPGA Programming FPGA programming information must be stored somewhere to program device at power on Use external EEPROM, CPLD or CPU to program T wo programming methods Active: FPGA controls programming sequence automatically at power on Passive: Intelligent host (typically CPU) controls programming Also programmable through JTAG connection
FPGA Advantages High density to create many complex logic functions High performance Low cost Integration of many functions Many available I/O standards and features Fast programming
From FPGA to ASIC A true ASIC: no configuration at power-on required Create and test design with FPGA device Migrate design to pin–compatible, functionally equivalent ASIC device
FPGA design development Verilog Hardware Description Language VDHL - Very high speed integrated circuits Hardware Description Language
FPGA design development Schematic design development
FPGA for high speed data processing CPU data processing optimization Pipelining, parallelism OpenCL
A simple CPU B A A ALU Op Val Instruction Fetch Registers Aaddr Baddr Caddr PC Load Store LdAddr StAddr CWriteEnable C Op LdData StData Op CData
Load immediate value into register B A A ALU Op Val Instruction Fetch Registers Aaddr Baddr Caddr PC Load Store LdAddr StAddr CWriteEnable C Op LdData StData Op CData
Load memory value into register B A A ALU Op Val Instruction Fetch Registers Aaddr Baddr Caddr PC Load Store LdAddr StAddr CWriteEnable C Op LdData StData Op CData
Store register value into memory B A A ALU Op Val Instruction Fetch Registers Aaddr Baddr Caddr PC Load Store LdAddr StAddr CWriteEnable C Op LdData StData Op CData
Add two registers, store result in register B A A ALU Op Val Instruction Fetch Registers Aaddr Baddr Caddr PC Load Store LdAddr StAddr CWriteEnable C Op LdData StData Op CData
A simple program Mem[100] += 42 * Mem[101 ] CPU instructions: R0 Load Mem [100] R1 Load Mem [101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem [100 ]
CPU activity, step by step A A A A A R0 Load Mem [100] R1 Load Mem [101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem [100] A Time
Unroll the CPU hardware… A A A A A R0 Load Mem [100] R1 Load Mem [101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem [100] A Space
… and specialize by position A A A A A R0 Load Mem [100] R1 Load Mem [101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem [100] A Instructions are fixed. Remove “Fetch”
… and specialize A A A A A R0 Load Mem [100] R1 Load Mem [101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem [100] A Instructions are fixed. Remove “Fetch” Remove unused ALU ops
and specialize A A A A A R0 Load Mem [100] R1 Load Mem [101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem [100] A Instructions are fixed. Remove “Fetch” Remove unused ALU ops Remove unused Load / Store
… and specialize R0 Load Mem [100] R1 Load Mem [101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem [100] Instructions are fixed. Remove “Fetch” Remove unused ALU ops Remove unused Load / Store Wire up registers properly! And propagate state.
… and specialize R0 Load Mem [100] R1 Load Mem [101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem [100] Instructions are fixed. Remove “Fetch” Remove unused ALU ops Remove unused Load / Store Wire up registers properly! And propagate state. Remove dead data.
… and specialize R0 Load Mem [100] R1 Load Mem [101] R2 Load #42 R2 Mul R1, R2 R0 Add R2, R0 Store R0 Mem [100] Instructions are fixed. Remove “Fetch” Remove unused ALU ops Remove unused Load / Store Wire up registers properly! And propagate state. Remove dead data. Reschedule!
So what ? Load Load Store 42 FPGA datapath = Your algorithm, in silicon Build exactly what you need: Operations Data widths Memory size, configuration Efficiency: Throughput / Latency / Power
OpenCL Programming Model Accelerator Local Mem Global Mem Local Mem Local Mem Local Mem Accelerator Accelerator Accelerator Processor Accelerator Local Mem Global Mem Local Mem Local Mem Local Mem Accelerator Accelerator Accelerator Processor Host Accelerator Local Mem Global Mem Local Mem Local Mem Local Mem Accelerator Accelerator Accelerator Processor __kernel void sum( __global float *a, __ global float *b, __ global float *y) { int gid = get_global_id (0); y[ gid ] = a[ gid ] + b[ gid ]; } main () { read_data ( … ); maninpulate ( … ); clEnqueueWriteBuffer ( … ); clEnqueueNDRange ( …,sum,…); clEnqueueReadBuffer ( … ); display_result ( … ); } Host + Accelerator Programming Model Sequential Host program on microprocessor Function offload onto a highly parallel accelerator device
OpenCL FPGA is NOT just ‘C’-to-HW IP IP EMIF IP Processor Rest of the System ? HLS void F(...) { # pragma ... for ( int i ...) { #pragma ... for ( int j ...) { #pragma ... } } } RTL OpenCL kernel void F(...) { for ( int i ...) { for ( int j ...) { } } } ? Complete Platform C-to-HW tools Standard OpenCL Users Hardware Designers Target FPGA Only FPGA Expertise Yes Timing Closure Manual Users Software Programmers Target Complete Platforms FPGA Expertise No Timing Closure Automatic
Traditional OpenCL Data Parallelism OpenCL kernels expresses parallelism explicitly __kernel void sum( __global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id (0); answer[ xid ] = a[ xid ] + b[ xid ]; } for ( int i =0; i < n; i ++) { answer[ i ] = a[ i ] + b[ i ]; } Host Code Kernel Code setup_memory_buffers (); transfer_data_to_fpga (); size_t global_size = {N, 1, 1}; clEnqueueNDRangeKernel ( sum_kernel , .., & global_size , ..); read_data_from_fpga ();
Loop Pipelining To achieve acceleration, we can pipeline each iteration of the loop Analyze any dependencies between iterations Schedule these operations Launch the next iteration as soon as possible float array[M]; for ( int i =0; i < n* numSets ; i ++) { for ( int j=0; j < M-1; j++) array[j] = array[j+1]; array[M-1] = a[ i ]; for ( int j=0; j < M; j++) answer[ i ] += array[j] * coefs [j]; } At this point, we can launch the next iteration
Loop Pipelining Example No Loop Pipelining i0 i1 i2 i0 i1 i2 i3 i4 Looks almost like parallel thread execution! With Loop Pipelining