Halide tutorial 2019

Introduction to Halide
Champ Yen
[email protected]

https://tinyurl.com/ubqye3y

Overview of Halide
2

Why Halide?
3
Halide's answer: decouples Algorithm from Scheduling

Algorithm: what is computed.
Schedule: where and when it's computed.
Easy for programmers to build pipelines
•simpliﬁes algorithm code
•improves modularity
Easy for programmers to specify & explore optimizations
•fusion, tiling, parallelism, vectorization
•can’t break the algorithm
Easy for the compiler to generate fast code

Image Processing Tradeoﬀs
Experienced Engineers always keep
PARALLELISM, LOCALITY and REDUNDANT
WORK in mind.

Processing Policies/Skills used in image processing coding
5
bh(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)/3
bv(x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1)/3
Breadth-First
Sliding-Window
Fusion
Tiling
Sliding-Window
with Tiling

Performance - It's All about Scheduling
6
To optimize is to ﬁnd a
better scheduling in the
valid space.

Example – C++, Optimized C++ and Halide
7

How Halide works
8

deﬁne Algorithms & JIT in
Halide
9

Types – Var, Expr, Func and RDom
10
Func: represents a (schedulable) pipeline stage.
Func gradient;
Var: names to use as variables in the deﬁnition of a Func.
Var x, y;
Expr: calculations of those variables, expressions and other functions in a function.
Expr e = x + y;
gradient(x, y) = e;
add a deﬁnition for the Func object:
RDom: reduction domain, calculate a value from a area of inputs, as loops for calculation
RDom r(-1, 3)// MIN, EXTENTS
Expr e = sum(f(x+r, y));

Advanced things in Functions
11
bﬂoat16_t: truncated version 16b version of ﬂoat32
Func ops;
ops(x, y) = Tuple( expr_add, expr_sub, expr_mul, expr_div);
Tuple: represents a Func with mutiple outputs
ﬂoat16_t: IEEE754 16-bit ﬂoat representation
Halide::* : special math or operations, refer to https://halide-lang.org/docs/namespace_halide.html
Expr u8val;
u8val = u8(clamp(out, 0, 255));
u8val = saturating_cast<uint8>(out);
// math, other like ceil, ﬂoor, pow, sin/cos/tan ...
Expr logval = log(x);

// select, works like “?:” in C or switch-case in complex cases
Expr c = select( c < 0, 0, c);

JIT Image Processing Example
12
// load the input image
Buffer<uint8_t> input = load_image("images/rgb.png");
// function used to brighter the image
Func brighter;
// variables used to define brighter function
Var x, y, c;

// 'value' Expr is used to define the procedure of image processing
Expr value = input(x, y, c);
value = Halide::cast<float>( value);
value = value * 1.5f;
value = Halide::min(value, 255.0f);
value = Halide::cast<uint8_t>( value);

// define the function
brighter(x, y, c) = value;
// get output result
Buffer<uint8_t> output =
brighter.realize(input.width(), input.height(), input.channels());
// save the output to a file
save_image(output, "brighter.png");

Put It All Together! - 3x3 Blur - In JIT

13
https://github.com/champyen/halide_2019.git

Scheduling in Halide
14

Scheduling Basics – Default Loop Structure
15
func_foo (a, b, c, … x, y, z) = …
inner-most loop
outermost loop
//default scheduling equal to the below loop:
for(z = 0; z < Z_MAX; z++){
    for(y = 0; y < Y_MAX; y++){
        for(x = 0; x < X_MAX; x++){
            …
                for(a = 0; a < A_MAX; A++){
                    // computing at here
                }
            …
        }
    }
}

Scheduling Basics - Reodering
16
func_foo.reorder (z, y, x, … c, b, a) = …
inner-most loop
outermost loop
//reordered scheduling equal to the below loop:
for(a = 0; a < A_MAX; a++){
    for(b = 0; b < B_MAX; b++){
        for(c = 0; c < C_MAX; c++){
            …
                for(z = 0; z < Z_MAX; Z++){
                    // computing at here
                }
            …
        }
    }
}

Scheduling Basics - Splitting
17
func_foo(x, y) = ...
func_foo.split(y, yo, yi, 32);
//splitted scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/ 32; yo++){
    for(yi = 0; yi < 32; yi++){
        for(x = 0; x < X_MAX; x++){
            //computation is here
        }
    }
}

Scheduling Basics - Tiling
18
func_foo(x, y) = ...
func_foo.tile(x, y, xo, xi, yo, yi, 32, 32);
//tiled scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/ 32; yo++){
    for(xo = 0; xo < X_MAX/ 32; xo++){
        for(yi = 0; yi < 32; yi++){
            for(xi = 0; xi < 32; xi++{
              //computation is here
            }
        }
    }
}

Schedule Basics - Fuse
19
func_foo(x, y) = ...
func_foo.fuse(x, y, fidx);
//fused scheduling equal to the below loop:
for(fidx = 0; fidx < X_MAX*Y_MAX; fidx++){
//computation is here
}
serialized by ﬁdx

Scheduling – Vectorize, Parallel
20
func_foo(x, y) = ...
func_foo.vectorize(x, 8);
//vectorized scheduling equal to the below loop:
for(y = 0; y < Y_MAX; y++){
    for(x = 0; x < X_MAX; x+=8){
        //8-LANE auto-vectorization
    }
}
func_foo(x, y) = ...
func_foo.parallel(y);
//parallel scheduling equal to the below loop:
#pragma omp paralle for
for(y = 0; y < Y_MAX; y++){
    for(x = 0; x < X_MAX; x++){
        //computation is here
    }
}
Vectorize
Parallel

compute_at/store_at, compute_root/store_root
●store position should be same or outer than computation
●store_root => indicate the stage/function has whole frame buﬀer output
●compute_root => bread-ﬁrst
○and also mean store_root
●store_at(Func, Var)
○the Func’s storage is declared in Var’s loop of Func
●compute_at( Func, Var )
○computed in Var’s loop of Func
○also mean store_at(Func, Var)
●Var::outermost()
21

The Schedule Directives Combinations
22

Ahead-of-Time(AOT) Workﬂow
23
CodeGen
Executable
Halide
Code
Static
Library
(.a + .h)
Function
Implement
Code
Halide
Shared
Library
(.so)
Final
Executable
/Library
Halide
Runtime
Buﬀer
(.h)

AOT code structure & example
24
//box_aot.cpp: Box_2x2 DownSample
class BoxDown2 : public Generator<BoxDown2> {
public:
    // Input/Output types are not specified, they are set in code-generation phase.
    Input<Buffer<>> input{" input", 3};
    Output<Func> output{" output", 3};

    void generate() {
        Func clamp_input = BoundaryConditions::repeat_edge(input);
        output(x, y, c) = cast(output.type(),
                            ((clamp_input(2*x, 2*y, c)+
                            clamp_input(2*x+1, 2*y, c)+
                            clamp_input(2*x, 2*y+1, c)+
                            clamp_input(2*x+1, 2*y+1), c) >> 2) );
    }

    void schedule() {
        output.vectorize(x, 16).parallel(y);
    }

private:
    Var x, y, c;
};
HALIDE_REGISTER_GENERATOR( BoxDown2, box_down2);
$ clang++ -O3 -fno-rtti -std=c++11 -o box_aot box_aot.cpp $HALIDE_ROOT/tools/GenGen.cpp -I $HALIDE_ROOT/include/ -L $HALIDE_ROOT/bin/ -lHalide -ltinfo
-lpthread -ldl;
//change targe to "arm-64-android" for Android usage
$ LD_LIBRARY_PATH=$HALIDE_ROOT/bin/ ./box_aot -g box_down2 -o ./aot input.type=uint8 output.type=uint8 target=host

AOT code usage
25
//test.cpp
…
#include "halide_image_io.h"
#include "HalideBuffer.h"
#include "box_down2.h"
…
using namespace Halide::Tools;
using Halide::Runtime::Buffer;

int main(int argc, char** argv)
{
    Buffer<uint8_t> input = load_image(argv[1]);
    Buffer<uint8_t> output(input.width()/2, input.height()/2, input.channels());
    box_down2(input, output);

    save_image(output, "output.png");
}

$ clang++ -fno-rtti -std=c++11 -O3 -o test test.cpp aot/box_down2.a -I aot -I $HALIDE_ROOT/include -I
aot/ -lpthread -ldl -ljpeg -ltinfo -lpng –lz
$ ./test input.jpg

More about Runtime Buffer Manipulation
Buffer<uint8_t> buf(width, height); //2D buffer

// get buffer pointer
unsigned char* buf_ptr = (unsigned char*)( buf.data());

// get ROI buffer object
Buffer<uint8_t> crop_buf= buf. cropped(0, crop_x, crop_w).cropped(1, crop_y,
crop_h);

…

// use external memory (from other place, eg: OpenCV mat) for Buffer creation
uint8_t *data = (uint8*)malloc(width*height*channels);
Buffer<uint8_t> external_buf( data, channels, width, height);
26

Put It All Together! - Matrix Multiplication
•https://github.com/champyen/halide_2019
•halide_mm
•Generator
•mm_generator.cpp
•Application
•mm.cpp
27

Resource
•Halide Oﬃcial Tutorial
•http://halide-lang.org/tutorials/tutorial_introduction.html
•Halide Site
•http://halide-lang.org/
•Halide GitHub
•https://github.com/halide/Halide
•https://suif.stanford.edu/~courses/cs243/lectures/l14-halide.pdf
•Qualcomm Halide Software (in Hexagon SDK)
•https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools
28

Q & A
29

Halide tutorial 2019

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Halide tutorial 2019

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

TLE-9-Prepare-Salad-and-Dressing.pptxkkk

LESSON 1 ABOUT MEDIA AND INFORMATION.pptx

GRADE-8-AQUACULTURE-WEEKQ1.pdfdfawgwyrsewru

Feelings PP Game FOR CHILDREN IN ELEMENTARY SCHOOL.pptx

Jeopardy_Figures_of_Speech_Template.pptx [Autosaved].pptx

Jeopardy_Figures_of_Speech.pptxvdsvdsvsdvsd