Halide tutorial 2019

champ_yen 345 views 29 slides Apr 08, 2021
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

Halide Language Tutorial


Slide Content

Introduction to Halide
Champ Yen
[email protected]

https://tinyurl.com/ubqye3y

Overview of Halide
2

Why Halide?
3
Halide's answer: decouples Algorithm from Scheduling

Algorithm: what is computed.
Schedule: where and when it's computed.
Easy for programmers to build pipelines 
•simplifies algorithm code
•improves modularity
Easy for programmers to specify & explore optimizations 
•fusion, tiling, parallelism, vectorization
•can’t break the algorithm
Easy for the compiler to generate fast code

Image Processing Tradeoffs
Experienced Engineers always keep
PARALLELISM, LOCALITY and REDUNDANT
WORK in mind.

Processing Policies/Skills used in image processing coding
5
bh(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)/3
bv(x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1)/3
Breadth-First
Sliding-Window
Fusion
Tiling
Sliding-Window
with Tiling

Performance - It's All about Scheduling
6
To optimize is to find a
better scheduling in the
valid space.

Example – C++, Optimized C++ and Halide
7

How Halide works
8

define Algorithms & JIT in
Halide
9

Types – Var, Expr, Func and RDom
10
Func: represents a (schedulable) pipeline stage.
Func gradient;
Var: names to use as variables in the definition of a Func.
Var x, y;
Expr: calculations of those variables, expressions and other functions in a function.
Expr e = x + y;
gradient(x, y) = e;
add a definition for the Func object:
RDom: reduction domain, calculate a value from a area of inputs, as loops for calculation
RDom r(-1, 3)// MIN, EXTENTS
Expr e = sum(f(x+r, y));

Advanced things in Functions
11
bfloat16_t: truncated version 16b version of float32
Func ops;
ops(x, y) = Tuple( expr_add, expr_sub, expr_mul, expr_div);
Tuple: represents a Func with mutiple outputs
float16_t: IEEE754 16-bit float representation
Halide::* : special math or operations, refer to https://halide-lang.org/docs/namespace_halide.html
Expr u8val;
u8val = u8(clamp(out, 0, 255));
u8val = saturating_cast<uint8>(out);
// math, other like ceil, floor, pow, sin/cos/tan ...
Expr logval = log(x);

// select, works like “?:” in C or switch-case in complex cases
Expr c = select( c < 0, 0, c);

JIT Image Processing Example
12
// load the input image
Buffer<uint8_t> input = load_image("images/rgb.png");
// function used to brighter the image
Func brighter;
// variables used to define brighter function
Var x, y, c;

// 'value' Expr is used to define the procedure of image processing
Expr value = input(x, y, c);
value = Halide::cast<float>( value);
value = value * 1.5f;
value = Halide::min(value, 255.0f);
value = Halide::cast<uint8_t>( value);

// define the function
brighter(x, y, c) = value;
// get output result
Buffer<uint8_t> output = 
        brighter.realize(input.width(), input.height(), input.channels());
// save the output to a file
save_image(output, "brighter.png");

Put It All Together! - 3x3 Blur - In JIT

13
https://github.com/champyen/halide_2019.git

Scheduling in Halide
14

Scheduling Basics – Default Loop Structure
15
func_foo (a, b, c, … x, y, z) = …
inner-most loop
outermost loop
//default scheduling equal to the below loop:
for(z = 0; z < Z_MAX; z++){
    for(y = 0; y < Y_MAX; y++){
        for(x = 0; x < X_MAX; x++){
            …
                for(a = 0; a < A_MAX; A++){
                    // computing at here
                }
            … 
        }
    }
}

Scheduling Basics - Reodering
16
func_foo.reorder (z, y, x, … c, b, a) = …
inner-most loop
outermost loop
//reordered scheduling equal to the below loop:
for(a = 0; a < A_MAX; a++){
    for(b = 0; b < B_MAX; b++){
        for(c = 0; c < C_MAX; c++){
            …
                for(z = 0; z < Z_MAX; Z++){
                    // computing at here
                }
            … 
        }
    }
}

Scheduling Basics - Splitting
17
func_foo(x, y) = ...
func_foo.split(y, yo, yi, 32);
//splitted scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/ 32; yo++){
    for(yi = 0; yi < 32; yi++){
        for(x = 0; x < X_MAX; x++){
            //computation is here
        }
    }
}

Scheduling Basics - Tiling
18
func_foo(x, y) = ...
func_foo.tile(x, y, xo, xi, yo, yi, 32, 32);
//tiled scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/ 32; yo++){
    for(xo = 0; xo < X_MAX/ 32; xo++){
        for(yi = 0; yi < 32; yi++){
            for(xi = 0; xi < 32; xi++{
                //computation is here
            }
        }
    }
}

Schedule Basics - Fuse
19
func_foo(x, y) = ...
func_foo.fuse(x, y, fidx);
//fused scheduling equal to the below loop:
for(fidx = 0; fidx < X_MAX*Y_MAX; fidx++){
    //computation is here
}
serialized by fidx

Scheduling – Vectorize, Parallel
20
func_foo(x, y) = ...
func_foo.vectorize(x, 8);
//vectorized scheduling equal to the below loop:
for(y = 0; y < Y_MAX; y++){
    for(x = 0; x < X_MAX; x+=8){
        //8-LANE auto-vectorization
    }
}
func_foo(x, y) = ...
func_foo.parallel(y);
//parallel scheduling equal to the below loop:
#pragma omp paralle for
for(y = 0; y < Y_MAX; y++){
    for(x = 0; x < X_MAX; x++){
        //computation is here
    }
}
Vectorize
Parallel

compute_at/store_at, compute_root/store_root
●store position should be same or outer than computation
●store_root => indicate the stage/function has whole frame buffer output
●compute_root => bread-first
○and also mean store_root
●store_at(Func, Var)
○the Func’s storage is declared in Var’s loop of Func
●compute_at( Func, Var )
○computed in Var’s loop of Func
○also mean store_at(Func, Var)
●Var::outermost()
21

The Schedule Directives Combinations
22

Ahead-of-Time(AOT) Workflow
23
CodeGen
Executable
Halide
Code
Static
Library
(.a + .h)
Function
Implement
Code
Halide
Shared
Library
(.so)
Final
Executable
/Library
Halide
Runtime
Buffer
(.h)

AOT code structure & example
24
//box_aot.cpp: Box_2x2 DownSample
class BoxDown2 : public Generator<BoxDown2> {
public:
    // Input/Output types are not specified, they are set in code-generation phase.
    Input<Buffer<>> input{" input", 3};
    Output<Func> output{" output", 3};

    void generate() {
        Func clamp_input = BoundaryConditions::repeat_edge(input);
        output(x, y, c) = cast(output.type(), 
                            ((clamp_input(2*x, 2*y, c)+
                            clamp_input(2*x+1, 2*y, c)+ 
                            clamp_input(2*x, 2*y+1, c)+ 
                            clamp_input(2*x+1, 2*y+1), c) >> 2) );
    }

    void schedule() {
        output.vectorize(x, 16).parallel(y);
    }

private:
    Var x, y, c;
};
HALIDE_REGISTER_GENERATOR( BoxDown2, box_down2);
$ clang++ -O3 -fno-rtti -std=c++11 -o box_aot box_aot.cpp $HALIDE_ROOT/tools/GenGen.cpp -I $HALIDE_ROOT/include/ -L $HALIDE_ROOT/bin/ -lHalide -ltinfo
-lpthread -ldl; 
//change targe to "arm-64-android" for Android usage
$ LD_LIBRARY_PATH=$HALIDE_ROOT/bin/ ./box_aot -g box_down2 -o ./aot input.type=uint8 output.type=uint8 target=host

AOT code usage
25
//test.cpp

#include "halide_image_io.h"
#include "HalideBuffer.h"
#include "box_down2.h"

using namespace Halide::Tools;
using Halide::Runtime::Buffer;

int main(int argc, char** argv)
{
    Buffer<uint8_t> input = load_image(argv[1]);
    Buffer<uint8_t> output(input.width()/2, input.height()/2, input.channels());
    box_down2(input, output);

    save_image(output, "output.png");
}

$ clang++ -fno-rtti -std=c++11 -O3 -o test test.cpp aot/box_down2.a -I aot -I $HALIDE_ROOT/include -I
aot/ -lpthread -ldl -ljpeg -ltinfo -lpng –lz
$ ./test input.jpg

More about Runtime Buffer Manipulation
Buffer<uint8_t> buf(width, height); //2D buffer

// get buffer pointer
unsigned char* buf_ptr = (unsigned char*)( buf.data());

// get ROI buffer object
Buffer<uint8_t> crop_buf= buf. cropped(0, crop_x, crop_w).cropped(1, crop_y,
crop_h);



// use external memory (from other place, eg: OpenCV mat) for Buffer creation
uint8_t *data = (uint8*)malloc(width*height*channels);
Buffer<uint8_t> external_buf( data, channels, width, height);
26

Put It All Together! - Matrix Multiplication
•https://github.com/champyen/halide_2019
•halide_mm
•Generator
•mm_generator.cpp
•Application
•mm.cpp
27

Resource
•Halide Official Tutorial
•http://halide-lang.org/tutorials/tutorial_introduction.html
•Halide Site
•http://halide-lang.org/
•Halide GitHub
•https://github.com/halide/Halide
•https://suif.stanford.edu/~courses/cs243/lectures/l14-halide.pdf
•Qualcomm Halide Software (in Hexagon SDK)
•https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools
28

Q & A
29
Tags