high level synthesis algorithms_slides.pptx

VishwaShanika2 53 views 117 slides Aug 29, 2025
Slide 1
Slide 1 of 117
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117

About This Presentation

Xilinx HLS design methilogy explained


Slide Content

High Level Synthesis CSE 237C Professor Ryan Kastner

High Level Synthesis Goal Convert high-level code into a RTL circuit while optimizing for power, performance, area, cost, security, … Steps 1) Build FSM for controller 2) Build datapath based on FSM acc = 0; for (i=0; i < 128; i++) acc += a[i]; Source: Greg Stitt

High Level Synthesis Steps Build a FSM (controller) Decompose code into states acc = 0; for (i=0; i < 128; i++) acc += a[i]; if (i < 128) acc=0, i = 0 load a[i] acc += a[i] i++ Done Source: Greg Stitt

High Level Synthesis Steps Build a datapath Allocate resources for each state acc i if (i < 128) acc=0, i = 0 load a[i] acc += a[i] i++ Done < addr a[i] + + + 1 128 1 acc = 0; for ( i =0; i < 128; i ++) acc += a[ i ]; Source: Greg Stitt

High Level Synthesis Steps Build a datapath Determine register inputs acc i if (i < 128) acc=0, i = 0 load a[i] acc += a[i] i++ Done < addr a[i] + + + 1 128 2x1 2x1 1 2x1 &a In from memory acc = 0; for (i=0; i < 128; i++) acc += a[i]; Source: Greg Stitt

High Level Synthesis Steps Build a datapath Add outputs acc i if (i < 128) acc=0, i = 0 load a[i] acc += a[i] i++ Done < addr a[i] + + + 1 128 2x1 2x1 1 2x1 &a In from memory Memory address acc acc = 0; for (i=0; i < 128; i++) acc += a[i]; Source: Greg Stitt

High Level Synthesis Steps Build a datapath Add control signals acc i if (i < 128) acc=0, i = 0 load a[i] acc += a[i] i++ Done < addr a[i] + + + 1 128 2x1 2x1 1 2x1 &a In from memory Memory address acc acc = 0; for (i=0; i < 128; i++) acc += a[i];

High Level Synthesis Steps Combine controller+datapath acc i < addr a[i] + + + 1 128 2x1 2x1 1 2x1 &a In from memory Memory address acc Done Memory Read Controller acc = 0; for (i=0; i < 128; i++) acc += a[i]; Source: Greg Stitt

High Level Synthesis Steps Alternative design: Use only one adder acc i < addr a[i] + 128 2x1 2x1 1 2x1 &a In from memory Memory address acc MUX MUX Source: Greg Stitt

High Level Synthesis Steps Syntactic Analysis Optimization Scheduling/Resource Allocation Binding/Resource Sharing High-level Code Intermediate Representation Controller + Datapath Converts code to intermediate representation - allows all following steps to use language independent format. Determines when each operation will execute, and resources used Maps operations onto physical resources Front-end Back-end Source: Greg Stitt

High Level Synthesis Steps Definition: Analysis of code to verify syntactic correctness Converts code into intermediate representation Two steps 1) Lexical analysis ( Lexing ) 2) Parsing Syntactic Analysis High-level Code Intermediate Representation Lexical Analysis Parsing Source: Greg Stitt

High Level Synthesis Steps Lexical analysis: ( lexing ) break code into a series of defined tokens Token: defined language constructs x = 0; if (y < z) x = 1; Lexical Analysis ID(x), ASSIGN, INT(0), SEMICOLON, IF, LPAREN, ID(y), LT, ID(z), RPAREN, ID(x), ASSIGN, INT(1), SEMICOLON Source: Greg Stitt

High Level Synthesis Steps Parsing: Analysis of token sequence to determine correct grammatical structure Languages defined by context-free grammar Program = Exp Exp = Stmt SEMICOLON | IF LPAREN Cond RPAREN Exp | Exp Exp Cond = ID Comp ID Stmt = ID ASSIGN INT x = 0; if (y < z) x = 1; x = 0; x = 0; y = 1; if (a < b) x = 10; if (var1 != var2) x = 10; x = 0; if (y < z) x = 1; y = 5; t = 1; Grammar Correct Programs Comp = LT | NE Source: Greg Stitt

High Level Synthesis Steps Parser converts tokens to intermediate representation, e.g., an abstract syntax tree x = 0; if (y < z) x = 1; d = 6; Assign if cond assign assign x x 1 d 6 y < z Source: Greg Stitt

High Level Synthesis Steps Intermediate representation: Language independent model of computation Translate from different input languages HLS back end is independent from input language Syntactic Analysis C/C++ Intermediate Representation Syntactic Analysis OpenCL Syntactic Analysis MATLAB Back End Scheduling, resource allocation, binding, independent of source language Source: Greg Stitt

Dataflow Model of Computation + - * ADD R0, R1, R2 SUB R3, R4, R5 MULT R6, R0, R3 R1 R2 R4 R5 1 3 6 4

Dataflow Model of Computation + - * ADD R0, R1, R2 SUB R3, R4, R5 MULT R6, R0, R3 R1 R2 R4 R5 1 3 2

Dataflow Model of Computation + - * ADD R0, R1, R2 SUB R3, R4, R5 MULT R6, R0, R3 R1 R2 R4 R5 2 4

Dataflow Model of Computation + - * ADD R0, R1, R2 SUB R3, R4, R5 MULT R6, R0, R3 R1 R2 R4 R5 8

Example Dataflow System FIR Filter (all single-rate) dup *c dup *c + dup *c + dup *c + *c + One-cycle delay Duplicate Constant multiply (filter coefficient) Adder

Modeling Control Flow Control Flow Graph Represents control flow dependencies of basic blocks Basic block is section of code that always executes from beginning to end I.e. no jumps into or out of block acc = 0; for (i=0; i < 128; i++) acc += a[i]; if (i < 128) acc=0, i = 0 acc += a[i] i ++ Done Source: Greg Stitt

Control Data Flow Graph Combines CFG and DFG Maintains DFG for each node of CFG acc = 0; for (i=0; i < 128; i++) acc += a[i]; if (i < 128) acc=0; i=0; acc += a[i] i ++ Done acc i + acc a[i] acc + i 1 i Source: Greg Stitt

Operation Scheduling Problem Data-flow Graph Nodes: operations Edges: dependencies Directed acyclic graph Resource allocation Scheduling determines the start time of each operation Resource-constraint scheduling Minimize latencies given allocated resources Timing-constraint scheduling Minimize the total number of components given maximum latencies

Auto Regressive Filter

Cosine Transform

Matrix Inversion

Operation Scheduling Algorithms NP-hard Fundamental problem - many different heuristic methods have been developed ILP Force directed Genetic algorithm Path based Graph theoretic Computational geometry List scheduling + start       + < - - end 1 2 3 4 v2 v1 v3 v4 v5 vn v6 v7 v8 v9 v10 v11

Synthesis in Temporal Domain Scheduling and allocation can be done in different orders or together Schedule: Mapping of operations to time slots (cycles) A scheduled sequencing graph is a labeled graph [©Gupta] + NOP       + < - - NOP 1 2 3 4 + NOP       + < - - NOP 1 2 3 4

Operation Types For each operation, define its type . For each resource, define a resource type, and a delay (in terms of # cycles) T is a relation that maps an operation to a resource type that can implement it T : V  {1, 2, ..., n res } . More general case: A resource type may implement more than one operation type (e.g., ALU) Resource binding: Map each operation to a resource with the same type Might have multiple options Source: Gupta

Schedule in Spatial Domain Resource sharing More than one operation bound to same resource Operations much be serialized Can be represented using hyperedges (define vertex partition) Source: Gupta + NOP       + < - - NOP 1 2 3 4

Scheduling and Binding Resource constraints: Number of resource instances of each type {a k : k=1, 2, ..., n res } . Scheduling: Labeled vertices f (v 3 )=1 . Binding: Hyperedges (or vertex partitions) b (v 2 )=adder1 . Cost: Number of resources  area Registers, steering logic (Muxes, busses), wiring, control unit Delay: Start time of the “ sink ” node Might be affected by steering logic and schedule (control logic) – resource-dominated vs. ctrl-dominated

Architectural Optimization Optimization in view of design space flexibility A multi-criteria optimization problem: Determine schedule f and binding b . Under area A , latency l and cycle time t objectives Find non-dominated points in solution space Want Pareto optimal solutions Solution space tradeoff curves: Non-linear, discontinuous Area / latency / cycle time (more?) Evaluate (estimate) cost functions Unconstrained optimization problems for resource dominated circuits: Min area: solve for minimal binding Min latency: solve for minimum l scheduling Source: Gupta

Operation Scheduling Input: Sequencing graph G(V, E), with n vertices Cycle time t . Operation delays D = {d i : i=0..n} . Output: Schedule f determines start time t i of operation v i . Latency l = t n – t . Goal: determine area / latency tradeoff Classes: Non-hierarchical and unconstrained Latency constrained Resource constrained Hierarchical Source: Gupta

Min Latency Unconstrained Scheduling Simplest case: no constraints, find min latency Given set of vertices V, delays D and a partial order > on operations E, find an integer labeling of operations f : V  Z + Such that: t i = f (v i ) . t i  t j + d j  (v j , v i )  E. l = t n – t is minimum. Solvable in polynomial time Bounds on latency for resource constrained problems ASAP algorithm used: topological order Algorithm?

ASAP Schedules Schedule v at t =0. While (v n not scheduled) Select v i with all scheduled predecessors Schedule v i at t i = max {t j +d j }, v j being a predecessor of v i . Return t n . + NOP       + < - - NOP 1 2 3 4

ALAP Schedules Schedule v n at t = l . While (v not scheduled) Select v i with all scheduled successors Schedule v i at t i = min {t j -d j }, v j being a succecessor of v i . + NOP       + < - - NOP 1 2 3 4

Resource Constrained Scheduling Constrained scheduling General case NP-complete Minimize latency given constraints on area or the resources (ML-RCS) Minimize resources subject to bound on latency (MR-LCS) Exact solution methods ILP: Integer Linear Programming Hu’ s heuristic algorithm for identical processors Heuristics List scheduling Force-directed scheduling

Hu’ s Algorithm Simple case of the scheduling problem All operations have unit delay All operations (and resources) of the same type Graph is forest Hu’ s algorithm Greedy Polynomial AND optimal Computes lower bound on number of resources for a given latency OR: computes lower bound on latency subject to resource constraints [©Gupta]

Basic Idea: Hu’ s Algorithm Relies on labeling of operations Based on their distances from the sink Length of the longest path passing through that node Try to schedule nodes with higher labels first (i.e., most “ critical” operations have priority) Schedule a nodes at a time a is the number of resources Only schedule nodes that have all their parent/predecessor’ s scheduled Each time you schedule one time step (start with step 1, 2, 3, … [©Gupta]

Hu’ s Algorithm HU (G(V,E), a) { Label the vertices // label = length of longest path passing through the vertex l = 1 repeat { U = unscheduled vertices in V whose predecessors have been scheduled (or have no predecessors) Select S  U such that |S|  a and labels in S are maximal Schedule the S operations at step l by setting t i =l, i: v i  S. l = l + 1 } until v n is scheduled. }

Hu’s Algorithm + + * a b c d + - e f g + * j k - r = 3 Source: Greg Stitt

Hu’s Algorithm Step 1 - Label each node by max distance from output, i.e. use path length as priority a b c d e f g 4 4 3 2 3 2 1 j k 1 r = 3 Source: Greg Stitt

Hu’s Algorithm Step 2 - Determine C, the set of scheduling candidates a b c d e f g 4 4 3 2 3 2 1 j k 1 C = r = 3 Cycle 1 Source: Greg Stitt

Hu’s Algorithm Step 3 - From C, schedule up to r nodes to current cycle, using label as priority a b c d e f g 4 4 3 2 3 2 1 j k 1 r = 3 Not scheduled due to lower priority Cycle1 Cycle 1 Source: Greg Stitt

Hu’s Algorithm Cycle 2 a b c d e f g 4 4 3 2 3 2 1 j k 1 r = 3 C = Cycle1 Cycle 2 Source: Greg Stitt

Hu’s Algorithm Cycle 2 a b c d e f g 4 4 3 2 3 2 1 j k 1 r = 3 Cycle1 Cycle2 Cycle 2 Source: Greg Stitt

Hu’s Algorithm Cycle 3 & 4 a b c d e f g 4 4 3 2 3 2 1 j k 1 r = 3 Cycle1 Cycle2 Cycle3 Cycle4 Source: Greg Stitt

Hu’s Algorithm Optimal and polynomial for simplified problem Common Extensions: Multiple resource types Multi-cycle operation a b c d + - / * Cycle1 Cycle2 Source: Greg Stitt

Use binary decision variables i = 0, 1, ..., n l = 1, 2, ..., l ’ +1 l ’ given upper-bound on latency x il = 1 if operation i starts at step l , 0 otherwise. Set of linear inequalities (constraints), and an objective function (min latency) Observations t i = start time of op i . is op v i (still) executing at step l ? ILP Formulation of ML-RCS [Mic94] p.198 ?

Start Time vs. Execution Time For each operation v i , only one start time If d i =1 , then the following questions are the same: Does operation v i start at step l ? Is operation v i running at step l ? But if d i >1 , then the two questions should be formulated as: Does operation v i start at step l ? Does x il = 1 hold? Is operation v i running at step l ? Does the following hold? ?

Operation v i Still Running at Step l ? Is v 9 running at step 6? Is x 9,6 + x 9,5 + x 9,4 = 1 ? Note: Only one (if any) of the above three cases can happen To meet resource constraints, we have to ask the same question for ALL steps, and ALL operations of that type v 9 4 5 6 x 9,4 =1 v 9 4 5 6 x 9,5 =1 v 9 4 5 6 x 9,6 =1

Operation v i Still Running at Step l ? Is v i running at step l ? Is x i,l + x i,l-1 + ... + x i,l-di+1 = 1 ? v i l l-1 l-d i +1 ... x i,l-di+1 =1 v i l l-1 l-d i +1 ... x i,l-1 =1 v i l l-1 l-d i +1 ... x i,l =1 . . .

Constraints: Unique start times: Sequencing (dependency) relations must be satisfied Resource constraints Objective: min c T t . t =start times vector, c = cost weight (e.g., [0 0 ... 1]) When c =[0 0 ... 1], c T t = ILP Formulation of ML-RCS (cont.)

ILP Example Assume l = 4 First, perform ASAP and ALAP (we can write the ILP without ASAP and ALAP, but using ASAP and ALAP will simplify the inequalities) + NOP       + < - - NOP 1 2 3 4 + NOP       + < - - NOP 1 2 3 4 v2 v1 v3 v4 v5 vn v6 v7 v8 v9 v10 v11 v2 v1 v3 v4 v5 vn v6 v7 v8 v9 v10 v11

ILP Example: Unique Start Times Constraint Without using ASAP and ALAP values: Using ASAP and ALAP:

ILP Example: Dependency Constraints Using ASAP and ALAP, the non-trivial inequalities are: (assuming unit delay for + and *)

ILP Example: Resource Constraints Resource constraints (assuming 2 adders and 2 multipliers) Objective: Min X n,4

ILP Formulation of MR-LCS Dual problem to ML-RCS Objective: Goal is to optimize total resource usage, a . Objective function is c T a , where entries in c are respective area costs of resources Constraints: Same as ML-RCS constraints, plus: Latency constraint added: Note: unknown a k appears in constraints. [©Gupta]

Real Design Complications Heterogeneous mapping One operation has many implementations Different bit-width, e.g. 32-bit multiplier good for mul(24) and mul(32) Different area and delay Real technology library extremely sophisticated Hard to estimate final timing and total area Sharing depends on the cost of multiplexers Downstream tools may not generate what we expect Resource sharing, register sharing Downstream tools break components’ boundaries Logic synthesis, placement and routing

List Scheduling Heuristic methods for ML-RCS and MR-LCS Does NOT guarantee optimum solution Similar to Hu’ s algorithm Greedy strategy Operation selection decided by criticality O(n) time complexity More general input Works on general graphs (unlike Hu’ s) Resource constraints on different resource types

RCS: List Scheduling A simple scheduling algorithm based on greedy strategies List scheduling algorithm: Construct a priority list based on some metrics (operation mobility, numbers of successors, etc) While not all operations scheduled For each available resource, select an operation in the ready list following the descending priority. Assign these operations to the current clock cycle Update the ready list Clock cycle ++ Qualities depend on benchmarks and particular metrics

List Scheduling Algorithm: ML-RCS LIST_L (G(V,E), a ) { l = 1 repeat { for each resource type k { U l,k = available vertices in V T l,k = operations in progress. Select S k  U l,k such that | S k | + | T l,k |  a k Schedule the S k operations at step l } l = l + 1 } until v n is scheduled. }

List Scheduling Example Source: Gupta Assumptions: three multipliers with latency 2; 1 ALU with latency 1

List Scheduling Algorithm: MR-LCS LIST_R (G(V,E), l ’ ) { a = 1, l = 1 Compute the ALAP times t L . if t L < 0 return (not feasible) repeat { for each resource type k { U l,k = available vertices in V. Compute the slacks { s i = t i L - l,  v i  U l,k } . Schedule operations with zero slack, update a Schedule additional S k  U l,k under a constraints } l = l + 1 } until v n is scheduled. }

Force-Directed Scheduling Developed by Paulin and Knight [DAC87] Similar to list scheduling Can handle ML-RCS and MR-LCS For ML-RCS, schedules step-by-step BUT, selection of the operations tries to find the globally best set of operations Difference with list scheduling in selecting operations Select operations with least force Consider the effect on the type distribution Consider the effect on successor nodes and their type distributions Idea: Find the mobility m i = t i L – t i S of operations Look at the operation type probability distributions Try to flatten the operation type distributions [©Gupta]

Force-Directed Scheduling Rationale: Reward uniform distribution of operations across schedule steps Force Used as a priority function Related to concurrency – sort operations for least force Mechanical analogy: Force = constant x displacement Constant = operation-type distribution Displacement = change in probability Definition: operation probability density p i ( l ) = Pr { v i starts at step l } . Assume uniform distribution: [©Gupta]

Force-Directed Scheduling: Definitions Operation-type distribution (NOT normalized to 1) Operation probabilities over control steps: Distribution graph of type k over all steps: q k ( l ) can be thought of as expected operator cost for implementing operations of type k at step l .

Example + NOP       + < - - NOP 1 2 3 4 2.83 2.33 .83 1 2 1.66 0.33

Forces Self-force Sum of forces to other steps Self-force for operation v i in step l Successor-force Related to scheduling of the successors operations Delay an operation may cause the delay of its successors

Example: operation v 6 Multiply Add It can be scheduled in the first two steps p(1) = p(2) = 0.5, p(3) = p(4) = 0.0 Distribution: q(1) = 2.8, q(2) = 2.3 Assign v 6 to step 1 Variation in probability of step 1: 1 – 0.5 = 0.5 Variation in probability of step 2: 0 – 0.5 = -0.5 Self-force: 2.8 x 0.5 + 2.3 x -0.5 = 0.25

Example: operation v 6 Multiply Add Assign v 6 step 2 Variation in probability: 0 – 0.5 = -0.5 Variation in probability: 1 – 0.5 = 0.5 Self-force: 2.8 x -0.5 + 2.3 x 0.5 = -0.25

Example: operation v 6 Multiply Add Successor-force Operation v 7 assigned to step 3 2.3(0 – 0.5) + 0.8(1 – 0.5) = -.75 Total-force = -1 Conclusion Least force is for step 2 Assigning v 6 to step 2 reduces concurrency

Force Directed Scheduling Algorithm

Ant System Optimization: Overview ? Ants work corporately on the graph Each creates a feasible solution Ants leave pheromones on their traces Ant make decisions partially on amount of pheromones Global Optimizations Evaporation : Pheromones dissipate over time Reinforcement : Update pheromones from good solutions Quickly converges to good solutions

Solving Design Problems using AS Problem model Define the solution space: create decision variables Pheromone model Global heuristic: Provides history of search space traversal Ant search strategy Local heuristic: Deterministic strategy for individual ant decision making Solution construction Probabilistically derive solution from local and global heuristics Feedback Evaluate solution quality, Reinforce good solutions (pheromones), Slightly evaporate all decisions (weakens poor solutions)

Max-Min Ant System (MMAS) Scheduling Problem: Some pheromones can overpower others leading to local minimums (premature convergence) Solution: Bound the strength of the pheromones If , always a chance to make any decision If , the decision is based solely on local heuristics, i.e. no past information is taken into account

MMAS RCS Formulation Idea: Combine ACO and List Scheduling Ants determine priority list List scheduling framework evaluates the “goodness” of the list Global heuristics  permutation index Local heuristic – can use different properties Instruction mobility (IM) Instruction depth (ID) Latency weighted instruction depth (LWID) Successor number (SN)

RCS: List Scheduling A simple scheduling algorithm based on greedy strategies List scheduling algorithm: Construct a priority list based on some metrics (operation mobility, numbers of successors, etc) While not all operations scheduled For each available resource, select an operation in the ready list following the descending priority. Assign these operations to the current clock cycle Update the ready list Clock cycle ++ Qualities depend on benchmarks and particular metrics

MMAS RCS: Global and Local Heuristics Global heuristic: Pheromones : the favorableness of selecting operation i to position j Global pheromone matrix Local heuristic: Local metrics : Instruction mobility, number of successors, etc Local decision making: a probabilistic decision Evaporate pheromone and reinforce good solution

MMAS RCS Algorithm

RCS Results: Pheromones (ARF)

RCS Experimental Results ILP (optimal) using CPLEX List scheduling Instruction mobility (IM), instruction depth (ID), latency weighted instruction depth (LWID), successor number (SN) Ant scheduling results using different local heuristics (Averaged over 5 runs, each run 100 iteration with 5 ants) Benchmark (nodes/edges) Resources CPLEX (latency /runtime) Force Directed List Scheduling MMAS-IS (average over 5 runs) IM ID LWID SN IM ID LWID SN HAL(21/25) la, lfm, lm, 3i, 3o 8/32 8 8 8 9 8 8 8 8 8 ARF(28/30) 2a, lfm, 2m 11/22 11 11 13 13 13 11 11 11 11 EWF(34/47) la, lfm, lm 27 /24000 28 28 31 31 28 27.2 27.2 27 27.2 FIR1 (40/39) 2a, 2m, 3i, 3o 13/232 19 19 19 19 18 17.2 17.2 17 17.8 FIR2(44/43) la, lfm, lm, 3i, 3o 14/11560 19 19 21 21 21 16.2 16.4 16.2 17 COSINE 1(66/76) 2a,2m, lfm, 3i, 3o  18 19 20 18 18 17.4 18.2 17.6 17.6 COSINE2(82/91) 2a,2m, lfm, 3i, 3o  23 23 23 23 23 21.2 21.2 21.2 21.2 Average 18 18.2 19.3 20.5 18.5 16.8 17.0 16.9 17.1

MMAS RCS: Results Consistently g enerates better results over all testing cases Up to 23.8% better than list scheduler Average 6.4%, and up to 15% better than force-directed scheduling Quantitatively closer to known optimal solutions

Idea: Combine ACO and Force Directed Scheduling Quick FDS review Uniformly distribute the operations onto the available resource s. Operation probability Distribution graph Self force : changes on DG of scheduling an operation Predecessor/successor force : implicit effects on DG Schedule an operation to a step with the minimum force MMAS TCS Formulation

ACO Formulation for TCS Initialize pheromone model While (termination not satisfied) Create ants Each ant finds a solution Evaluate solutions and update pheromone Report the best result found + S       + < - - E 1 2 3 4 v2 v1 v3 v4 v5 vn v6 v7 v8 v9 v10 v11 + S       + < - - E 1 2 3 4 v2 v1 v3 v4 v5 vn v6 v7 v8 v9 v10 v11 1 4 trails   ij indicates the favorableness of assigning instruction i to position j

86 ACO Formulation for TCS Initialize pheromone model While (termination not satisfied) Create ants Each ant finds a solution Evaluate solutions and update pheromone Report the best result found Select operation op h probabilistically Select its timestep as following: Global Heuristics: tied with the searching experience Local Heuristics: use the inverse of distribution graph, 1/q k (j) Here and β are constants

ACO Formulation for TCS Initialize pheromone model While (termination not satisfied) Create ants Each ant finds a solution Evaluate solutions and update pheromone Report the best result found Pheromone evaporation Rewarding good partial solutions based on solution quality

Final Version of MMAS-TCS

Effectiveness of MMAS-TCS

MMAS TCS: Results MMAS TCS is more stable than FDS, especially solution highly unconstrained 258 out of 263 test cases are equal to or better than FDS results 16.4% fewer resources

Exploiting TCS and RCS Duality TCS and RCS are dual problems Can we use them in harmony to perform area/delay exploration? MMAS enables seamless switching between problems

Theorem If C is the optimal TCS result at time t 1 , then the RCS result t 2 of C satisfies t 2 <= t 1 . More importantly, there is no configuration C′ with a smaller cost can produce an execution time within [ t 2 , t 1 ].

What does it give us? It implies that we can construct L: Starting from the rightmost t Find TCS solution C Push it to leftwards using RCS solution of C Do this iteratively (switch between TCS & RCS) Also, exhaustive scanning actually is the worst-case For every deadline t and TCS result C at t, RCS result = t

DSE Using Time/Resource Duality

Experimental Results

Resource Allocation and Scheduling Scheduling Cost function? Homogeneous TCS Total number of component Heterogeneous TCS Total area of functional units FPGA designs: LUTs, slicecs, BRAMs, … ASIC design: Silicon Area Total area comes from: Functional units Register Multiplexers Interconnect

A hierarchical directed graph Nodes V : operations Edges E ( v i , v j , T ij ): timing constraints Timing constraint T i,j ( c , o ) Start time dependencies Finish time dependencies Chained dependencies Towards Real World: Constraint Graph

Constraint Graph: Examples Operations a and b scheduled at same cycle Operation b scheduled exactly one cycle after start of Operation a Operation b must start after Operation a Operation a starts at least two cycles after start of Operation b

Pipelined Designs Start a new task before the prior one completed Feedback constraints among nodes Specific initial interval Improve throughput Requires more hardware

Operation Chaining Two or more operations scheduled in the same clock cycle Faster/larger component Shorter latency Saving registers Chaining across clock edges

Speculative Execution

Problem Formulation Constraint graph Nodes V : operations Edges E : data dependencies and timing constraints Technology library Q Area, timing Resource constraints Desired clock period: C Determine start time and the allocation of each resource type Resource constraint scheduling Timing constraint scheduling

MMAS CRAAS: Overview Start with an initial results Using fastest components ASAP/ALAP Resolving resource conflicts Meet timing and resource constraints MMAS iteratively searches optimal solutions

MMAS CRAAS: ASAP/ALAP Iteratively ASAP/ALAP Handle loops/feedbacks in constraint graph Check ill-posed timing constraint

MMAS CRAAS: Initial Schedule Resource conflicts More than available resources are used in the ASAP results Pushing operations forward

MMAS CRAAS: Overview Individual ant constructs schedules Load ASAP timing results Update mobility range, operation probability Update distribution graph Probabilistically defer operations Probabilistically select operations Schedule operations using p(i,j,k) Update ASAP/ALAP results

MMAS CRAAS: Global Heuristics Local heuristics Favor smaller functional units and less registers for this operation Uniform probability among all compatible resources Global heuristics Favor solutions with smaller area

MMAS CRAAS: Scheduling Defer operations from this iteration Favor operations with many options Schedule an operation Update ASAP schedules Update global heuristics

MMAS CRAAS: Results Implemented in a leading high-level synthesis framework Leverage the HDL back-ends to collect results Front-end parses C and performs optimizations Resource sharing and register sharing after scheduling The existing algorithm Based on FDS/FDLS Refined for real designs Force-directed operation deferring Re-allocate resources and iterative until area increasing Results overview 3 - 15% smaller (optimizing area) 1-4% faster (optimizing latency)

MMAS CRAAS: Results

MMAS CRAAS: Results Hard to generate good results with control-dominant designs (158, 160, and 54) Better resource allocation and sharing Existing algorithm prematurely converges Consistent with previous observations

Conclusion ILP – optimal, but exponential worst case runtime Hu’ s Optimal and polynomial Only works in restricted cases List scheduling Extension to Hu’ s for general case Greedy (fast) but suboptimal Force directed More complicated list scheduling algorithm Take into account more global view of the graph Still suboptimal Ant Colony …

Instruction Scheduling(1) Immense technology process makes computing resource ever abundant How to best use these resources is the problem Instruction Scheduling is one of the fundamental problems and becomes more important A poor scheduling of instructions can easily offset system performance gain

Problem Formulation Given: set of operations and collection of computational units Typically modeled using data flow graph (DFG) Directed acyclic graph Each node is operation Each edge is a data dependence Find schedule for operation to minimize some function (latency, area, power, …) Auto Regressive Filter

List Scheduling(1) Simple and effective Greedy strategy Operation selection decided by criticality O(n) time complexity (i.e. one scan ) Make a priority list of the instructions based on some measure (mobility, instruction depth, number of successors, etc.) No single priority function works well over all applications Highly dependent on problem instance Priority function quality highly varied

List Scheduling(2) + start       + < - - end 1 2 3 4 v2 v1 v3 v4 v5 vn v6 v7 v8 v9 v10 v11 Procedure ListScheduling(G, R, L) Input: DFG(V,E), resource set R, priority list L Output : instruction schedule cycle  0 ReadyList  successors of start While node end is not scheduled do for op in ReadyList in decending priority do if resource exists for op to start then schedule op at time cycle end if Update ReadyList end for cycle  cycle + 1 end while return schedule
Tags