What is FPGA?

GlobalLogicUkraine 5,486 views 53 slides Jun 19, 2015
Slide 1
Slide 1 of 53
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53

About This Presentation

The presentation is dedicated to advantages and disadvantages of FPGA (Field-Programmable Gate Array): its construction and speed features, as well as security elements. It also deals with such issues as new devices synthesis and expanding the existing hardware functionality, realisation of micropro...


Slide Content

FPGA b y Andriy Smolskyy

FPGA Programmable Logic Evolution: TTL  PLA  CPLD  FPGA  ASIC Development aspects Using FPGA for high speed data processing OpenCL

Programmable Logic is Found Everywhere!

TTL Logic Design

Digital Design with TTL Logic

Digital Design with TTL Logic

Digital Design with TTL Logic

Digital Design with TTL Logic

Digital Design with TTL Logic

From TTL to Programmable Logic General features of logic implementations Sum of products (AND-OR gates, combinatorial logic) Stored results (registered outputs) Wired together What if Logic functions were fixed (like TTL), but combined into a single device? Wiring (routing) connections could be controlled (programmed) somehow?

Programmable Array Logic (PAL) Simplest implementation of programmable logic Logic gates and registers are fixed Programmable sum of products array and output control

Programmable Logic Advantages Fewer devices required Lower cost Power savings Simpler to test and debug Design security (prevent reverse engineering) Design flexibility Automated tools simplify and consolidate design flow In-system reprogrammability ! (in some cases)

From PAL to Programmable Logic Device (PLD) Arrange multiple PAL arrays in a single device

From PLD to Complex PLD (CPLD) Combine multiple PLDs in single device with programmable interconnect and I/O

General CPLD Advantages Ample amounts of logic and advanced configurable I/ Os Programmable routing Instant on Low cost Non-volatile configuration Reprogrammable

From CPLD to FPGAs Higher density CPLDs don’t scale well because of requires additional global routing Rearrange LABs themselves into an array

Field Programmable Gate Array (FPGA) LABs arranged in an array Row and column programmable interconnect Interconnect may span all or part of the array

CPLD LABs vs. FPGA LABs FPGA LABs made up of logic elements (LEs) instead of product terms and macrocells Easier to create complex functions through LE cascading

Lookup Tables (LUTs) Replaces product term array Combinational functions created with programmed “tables” (cascaded multiplexers) LUT inputs are mux select lines

Adaptive Logic Modules (ALM) Based on LE, but includes dedicated resources & adaptive LUT (ALUT) Improves performance and resource utilization

FPGA Routing All device resources can feed into or be fed by any routing in device Differing fixed lengths to adjust for timing Scales linearly as density increases Local interconnect Connects between Les or ALMs within a LAB Can include direct connections between adjacent LABs Row and column interconnect Fixed length routing segments Span a number of LABs or entire device

Other Typical FPGA Features Embedded multipliers Useful for DSP High-performance multiply/add/accumulate operations Memory blocks High-speed transceivers Replace some LABs with dedicated functional hardware blocks PLLs SDRAM controllers Hard Processor System

System on Chip ( SoC ) + FPGA

FPGA Programming FPGA programming information must be stored somewhere to program device at power on Use external EEPROM, CPLD or CPU to program T wo programming methods Active: FPGA controls programming sequence automatically at power on Passive: Intelligent host (typically CPU) controls programming Also programmable through JTAG connection

FPGA Advantages High density to create many complex logic functions High performance Low cost Integration of many functions Many available I/O standards and features Fast programming

From FPGA to ASIC A true ASIC: no configuration at power-on required Create and test design with FPGA device Migrate design to pin–compatible, functionally equivalent ASIC device

FPGA design development Verilog Hardware Description Language VDHL - Very high speed integrated circuits Hardware Description Language

FPGA design development Schematic design development

Dedicated FPGA design block Core IP SDRAM Controller Ethernet PHY, Custom Transceiver PHY PCIe PHY SDi , Display Port Megafunctions PLL I/O Custom logic blocks

FPGA for high speed data processing CPU data processing optimization Pipelining, parallelism OpenCL

A simple CPU B A A ALU Op Val Instruction Fetch Registers Aaddr Baddr Caddr PC Load Store LdAddr StAddr CWriteEnable C Op LdData StData Op CData

Load immediate value into register B A A ALU Op Val Instruction Fetch Registers Aaddr Baddr Caddr PC Load Store LdAddr StAddr CWriteEnable C Op LdData StData Op CData

Load memory value into register B A A ALU Op Val Instruction Fetch Registers Aaddr Baddr Caddr PC Load Store LdAddr StAddr CWriteEnable C Op LdData StData Op CData

Store register value into memory B A A ALU Op Val Instruction Fetch Registers Aaddr Baddr Caddr PC Load Store LdAddr StAddr CWriteEnable C Op LdData StData Op CData

Add two registers, store result in register B A A ALU Op Val Instruction Fetch Registers Aaddr Baddr Caddr PC Load Store LdAddr StAddr CWriteEnable C Op LdData StData Op CData

A simple program Mem[100] += 42 * Mem[101 ] CPU instructions: R0  Load Mem [100] R1  Load Mem [101] R2  Load #42 R2  Mul R1, R2 R0  Add R2, R0 Store R0  Mem [100 ]

CPU activity, step by step A A A A A R0  Load Mem [100] R1  Load Mem [101] R2  Load #42 R2  Mul R1, R2 R0  Add R2, R0 Store R0  Mem [100] A Time

Unroll the CPU hardware… A A A A A R0  Load Mem [100] R1  Load Mem [101] R2  Load #42 R2  Mul R1, R2 R0  Add R2, R0 Store R0  Mem [100] A Space

… and specialize by position A A A A A R0  Load Mem [100] R1  Load Mem [101] R2  Load #42 R2  Mul R1, R2 R0  Add R2, R0 Store R0  Mem [100] A Instructions are fixed. Remove “Fetch”

… and specialize A A A A A R0  Load Mem [100] R1  Load Mem [101] R2  Load #42 R2  Mul R1, R2 R0  Add R2, R0 Store R0  Mem [100] A Instructions are fixed. Remove “Fetch” Remove unused ALU ops

and specialize A A A A A R0  Load Mem [100] R1  Load Mem [101] R2  Load #42 R2  Mul R1, R2 R0  Add R2, R0 Store R0  Mem [100] A Instructions are fixed. Remove “Fetch” Remove unused ALU ops Remove unused Load / Store

… and specialize R0  Load Mem [100] R1  Load Mem [101] R2  Load #42 R2  Mul R1, R2 R0  Add R2, R0 Store R0  Mem [100] Instructions are fixed. Remove “Fetch” Remove unused ALU ops Remove unused Load / Store Wire up registers properly! And propagate state.

… and specialize R0  Load Mem [100] R1  Load Mem [101] R2  Load #42 R2  Mul R1, R2 R0  Add R2, R0 Store R0  Mem [100] Instructions are fixed. Remove “Fetch” Remove unused ALU ops Remove unused Load / Store Wire up registers properly! And propagate state. Remove dead data.

… and specialize R0  Load Mem [100] R1  Load Mem [101] R2  Load #42 R2  Mul R1, R2 R0  Add R2, R0 Store R0  Mem [100] Instructions are fixed. Remove “Fetch” Remove unused ALU ops Remove unused Load / Store Wire up registers properly! And propagate state. Remove dead data. Reschedule!

So what ? Load Load Store 42 FPGA datapath = Your algorithm, in silicon Build exactly what you need: Operations Data widths Memory size, configuration Efficiency: Throughput / Latency / Power

OpenCL Programming Model Accelerator Local Mem Global Mem Local Mem Local Mem Local Mem Accelerator Accelerator Accelerator Processor Accelerator Local Mem Global Mem Local Mem Local Mem Local Mem Accelerator Accelerator Accelerator Processor Host Accelerator Local Mem Global Mem Local Mem Local Mem Local Mem Accelerator Accelerator Accelerator Processor __kernel void sum( __global float *a, __ global float *b, __ global float *y) { int gid = get_global_id (0); y[ gid ] = a[ gid ] + b[ gid ]; } main () { read_data ( … ); maninpulate ( … ); clEnqueueWriteBuffer ( … ); clEnqueueNDRange ( …,sum,…); clEnqueueReadBuffer ( … ); display_result ( … ); } Host + Accelerator Programming Model Sequential Host program on microprocessor Function offload onto a highly parallel accelerator device

OpenCL  FPGA is NOT just ‘C’-to-HW IP IP EMIF IP Processor Rest of the System ? HLS void F(...) { # pragma ... for ( int i ...) { #pragma ... for ( int j ...) { #pragma ... } } } RTL OpenCL kernel void F(...) { for ( int i ...) { for ( int j ...) { } } } ? Complete Platform C-to-HW tools Standard OpenCL Users Hardware Designers Target FPGA Only FPGA Expertise Yes Timing Closure Manual Users Software Programmers Target Complete Platforms FPGA Expertise No Timing Closure Automatic

Traditional OpenCL Data Parallelism OpenCL kernels expresses parallelism explicitly __kernel void sum( __global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id (0); answer[ xid ] = a[ xid ] + b[ xid ]; } for ( int i =0; i < n; i ++) { answer[ i ] = a[ i ] + b[ i ]; } Host Code Kernel Code setup_memory_buffers (); transfer_data_to_fpga (); size_t global_size = {N, 1, 1}; clEnqueueNDRangeKernel ( sum_kernel , .., & global_size , ..); read_data_from_fpga ();

Loop Pipelining To achieve acceleration, we can pipeline each iteration of the loop Analyze any dependencies between iterations Schedule these operations Launch the next iteration as soon as possible float array[M]; for ( int i =0; i < n* numSets ; i ++) { for ( int j=0; j < M-1; j++) array[j] = array[j+1]; array[M-1] = a[ i ]; for ( int j=0; j < M; j++) answer[ i ] += array[j] * coefs [j]; } At this point, we can launch the next iteration

Loop Pipelining Example No Loop Pipelining i0 i1 i2 i0 i1 i2 i3 i4 Looks almost like parallel thread execution! With Loop Pipelining

Digital Filter

FPGA Q&A

FPGA Thank You!