Why Halide?
3
Halide's answer: decouples Algorithm from Scheduling
Algorithm: what is computed.
Schedule: where and when it's computed.
Easy for programmers to build pipelines
•simplifies algorithm code
•improves modularity
Easy for programmers to specify & explore optimizations
•fusion, tiling, parallelism, vectorization
•can’t break the algorithm
Easy for the compiler to generate fast code
Image Processing Tradeoffs
Experienced Engineers always keep
PARALLELISM, LOCALITY and REDUNDANT
WORK in mind.
Performance - It's All about Scheduling
6
To optimize is to find a
better scheduling in the
valid space.
Example – C++, Optimized C++ and Halide
7
How Halide works
8
define Algorithms & JIT in
Halide
9
Types – Var, Expr, Func and RDom
10
Func: represents a (schedulable) pipeline stage.
Func gradient;
Var: names to use as variables in the definition of a Func.
Var x, y;
Expr: calculations of those variables, expressions and other functions in a function.
Expr e = x + y;
gradient(x, y) = e;
add a definition for the Func object:
RDom: reduction domain, calculate a value from a area of inputs, as loops for calculation
RDom r(-1, 3)// MIN, EXTENTS
Expr e = sum(f(x+r, y));
Advanced things in Functions
11
bfloat16_t: truncated version 16b version of float32
Func ops;
ops(x, y) = Tuple( expr_add, expr_sub, expr_mul, expr_div);
Tuple: represents a Func with mutiple outputs
float16_t: IEEE754 16-bit float representation
Halide::* : special math or operations, refer to https://halide-lang.org/docs/namespace_halide.html
Expr u8val;
u8val = u8(clamp(out, 0, 255));
u8val = saturating_cast<uint8>(out);
// math, other like ceil, floor, pow, sin/cos/tan ...
Expr logval = log(x);
// select, works like “?:” in C or switch-case in complex cases
Expr c = select( c < 0, 0, c);
JIT Image Processing Example
12
// load the input image
Buffer<uint8_t> input = load_image("images/rgb.png");
// function used to brighter the image
Func brighter;
// variables used to define brighter function
Var x, y, c;
// 'value' Expr is used to define the procedure of image processing
Expr value = input(x, y, c);
value = Halide::cast<float>( value);
value = value * 1.5f;
value = Halide::min(value, 255.0f);
value = Halide::cast<uint8_t>( value);
// define the function
brighter(x, y, c) = value;
// get output result
Buffer<uint8_t> output =
brighter.realize(input.width(), input.height(), input.channels());
// save the output to a file
save_image(output, "brighter.png");
Put It All Together! - 3x3 Blur - In JIT
13
https://github.com/champyen/halide_2019.git
Scheduling in Halide
14
Scheduling Basics – Default Loop Structure
15
func_foo (a, b, c, … x, y, z) = …
inner-most loop
outermost loop
//default scheduling equal to the below loop:
for(z = 0; z < Z_MAX; z++){
for(y = 0; y < Y_MAX; y++){
for(x = 0; x < X_MAX; x++){
…
for(a = 0; a < A_MAX; A++){
// computing at here
}
…
}
}
}
Scheduling Basics - Reodering
16
func_foo.reorder (z, y, x, … c, b, a) = …
inner-most loop
outermost loop
//reordered scheduling equal to the below loop:
for(a = 0; a < A_MAX; a++){
for(b = 0; b < B_MAX; b++){
for(c = 0; c < C_MAX; c++){
…
for(z = 0; z < Z_MAX; Z++){
// computing at here
}
…
}
}
}
Scheduling Basics - Splitting
17
func_foo(x, y) = ...
func_foo.split(y, yo, yi, 32);
//splitted scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/ 32; yo++){
for(yi = 0; yi < 32; yi++){
for(x = 0; x < X_MAX; x++){
//computation is here
}
}
}
Scheduling Basics - Tiling
18
func_foo(x, y) = ...
func_foo.tile(x, y, xo, xi, yo, yi, 32, 32);
//tiled scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/ 32; yo++){
for(xo = 0; xo < X_MAX/ 32; xo++){
for(yi = 0; yi < 32; yi++){
for(xi = 0; xi < 32; xi++{
//computation is here
}
}
}
}
Schedule Basics - Fuse
19
func_foo(x, y) = ...
func_foo.fuse(x, y, fidx);
//fused scheduling equal to the below loop:
for(fidx = 0; fidx < X_MAX*Y_MAX; fidx++){
//computation is here
}
serialized by fidx
Scheduling – Vectorize, Parallel
20
func_foo(x, y) = ...
func_foo.vectorize(x, 8);
//vectorized scheduling equal to the below loop:
for(y = 0; y < Y_MAX; y++){
for(x = 0; x < X_MAX; x+=8){
//8-LANE auto-vectorization
}
}
func_foo(x, y) = ...
func_foo.parallel(y);
//parallel scheduling equal to the below loop:
#pragma omp paralle for
for(y = 0; y < Y_MAX; y++){
for(x = 0; x < X_MAX; x++){
//computation is here
}
}
Vectorize
Parallel
compute_at/store_at, compute_root/store_root
●store position should be same or outer than computation
●store_root => indicate the stage/function has whole frame buffer output
●compute_root => bread-first
○and also mean store_root
●store_at(Func, Var)
○the Func’s storage is declared in Var’s loop of Func
●compute_at( Func, Var )
○computed in Var’s loop of Func
○also mean store_at(Func, Var)
●Var::outermost()
21
AOT code structure & example
24
//box_aot.cpp: Box_2x2 DownSample
class BoxDown2 : public Generator<BoxDown2> {
public:
// Input/Output types are not specified, they are set in code-generation phase.
Input<Buffer<>> input{" input", 3};
Output<Func> output{" output", 3};
void generate() {
Func clamp_input = BoundaryConditions::repeat_edge(input);
output(x, y, c) = cast(output.type(),
((clamp_input(2*x, 2*y, c)+
clamp_input(2*x+1, 2*y, c)+
clamp_input(2*x, 2*y+1, c)+
clamp_input(2*x+1, 2*y+1), c) >> 2) );
}