A Peek into TFRT

kstan2 1,478 views 31 slides Aug 02, 2020
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

TensorFlow Runtime, TFRT, is a new runtime for TensorFlow. TFRT has some plausible designs we can spend some time on.


Slide Content

Koan-Sin Tan,
[email protected]
COSCUP, Aug 2nd, 2020
TensorFlow Runtime
A Peek into the Future of TensorFlow
1

•disclaimer: opinions are my own
•feel free to interrupt me if you have any questions during the presentation
•questions could be Taiwanese, English, or Mandarin
•most of TFRT materials are adapted from TFRT deep dive in MLIR design meeting [1] and TFRT docs [2]
•code around Aug 1, 2020 (git commit ecf1c20 [3])
[1] TFRT Deep Dive,   slides - recording, https://mlir.llvm.org/talks/
[2] https://github.com/tensorflow/runtime/tree/master/documents
[3] https://github.com/tensorflow/runtime/commit/ecf1c20
2

•Used open source before the term “open
source” is used
•A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD
•Used to be a programming language junkie
•Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components
•Recently, on NN performance on edge devices
related stuff
•Contributed from time to time to TensorFlow Lite
•started a command line label_image for TFLite
who i am
https://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3

What is TFRT
•TensorFlow Runtime (TFRT) is one of the two new MLIR runtimes emerged in 2020 so far.
•The other one is Intermediate Representation Execution Environment, IREE. It seems
so far tfrt has better design documentation
•Both of them have mobile / edge environment in mind.
•I didn’t see mobile accelerated code in TFRT yet.
•IREE has some Vulkan related code and some simple code works on Android already
•ResNet GPU inference is 28% faster with TFRT
•https://github.com/tensorflow/runtime, https://youtu.be/15tiQoPpuZ8
4

Build it
•if you follow the instructions described in README.md, it should just work. At least on x86_64 linux.
•however, it’s not tested for non Linux environment yet
•ssize_t and int64_t
•on Mac OS X: ssize_t: long, int64_t: long long
•current code mixed the use of ssize_t and int64_t
•test: one the acclaimed features of TFRT, like MLIR, is its use of 

LLVM FileCheck
•my hacks, shape related (ssize_t) tests not fixed yet
•it’s not tested on non-x86 platforms, such as aarch64, either 


5

•The three key directories under the TFRT root directory are
•lib: Contains core TFRT infrastructure code
•backends: Contains device specific infrastructure and op/kernel implementations
•include: Contains public header files for core TFRT infrastructure
6

Walking thru the tutorial
•unfortunately, it seems it’s not easy to jump directly into source code without having
some background knowledge
•so we’ll walk thru the tutorial [1]
•What are in the tutorial
•print hello world
•print integer
•adding kernels
[1] https://github.com/tensorflow/runtime/blob/master/documents/tutorial.md
7

using tfrt and tfrt_test
hello.mlir
func @hello() {
%chain = tfrt.new.chain
// Create a string containing "hello world" and store it in %hello.
%hello = "tfrt_test.get_string" () { string_attr = "hello world" } : () -> !tfrt.string
// Print the string in %hello.
"tfrt_test.print_string" (%hello, %chain) : (!tfrt.string, !tfrt.chain) -> !tfrt.chain
tfrt.return
}
The ‘@hello function above shows how to create and print a string. The text after each ‘:’ specifies the types involved:
•()->!tfrt.string means that tfrt_test.get_string takes no arguments and returns a !tfrt.string. tfrt is a
MLIR dialect prefix (or namespace) for TFRT
•(!tfrt.string, !tfrt.chain) -> !tfrt.chain means that tfrt_test.print_string takes two arguments (!
tfrt.string and !tfrt.chain) and returns a !tfrt.chain. chain [1] is a TFRT abstraction to manage dependencies
[1] https://github.com/tensorflow/runtime/blob/master/documents/explicit_dependency.md
8

hello world in MLIR
func @stringconstant() -> !llvm<"[12 x i8]"> {
%1 = llvm.constant("Hello world!") : !llvm<"i8*">
// CHECK: ret [12 x i8] c"Hello world!"
llvm.return %1 : !llvm<"i8*">
}
func @main() {
%0 = llvm.constant(0) : !llvm.i64
%1 = call @stringconstant() : () -> !llvm<"[12 x i8]">
%2 = llvm.getelementptr %1[%0] : (!llvm<"[12 x i8]">, !llvm.i64) -> !llvm<"i8*">
%3 = llvm.bitcast %2 : !llvm<"i8*"> to !llvm<"i8*">
%32 = llvm.call @puts(%2) : (!llvm<"i8*">) -> !llvm.i32
return
}
func @puts(!llvm<"i8*">) -> !llvm.i32
•MLIR “standard dialect” doesn’t have I/O functions
•there is LLVM dialect, of course we can use LLVM to call standard libc
function
9

Hello integer
func @hello_integers() {
%chain = tfrt.new.chain
// Create an integer containing 42.
%forty_two = tfrt.constant.i32 42
// Print 42.
tfrt.print.i32 %forty_two, %chain
tfrt.return
}
•as stated in the tutorial, we can run other functions in the same modular
•we can turn to more basic ones, such as integers or floating point numbers
•@hello_integers shows how to create and print integers
•This example does not have the verbose type information we saw in @hello because there are
custom parsers for the tfrt.constant.i32 and tfrt.print.32 kernels in
basic_kernels.td
10

basic_kernels.td
•.td (table description?) files are for LLVM TableGen
[1] TableGen, https://llvm.org/docs/TableGen/
class ConstantOp<string suffix, Type baseType, Attr attr>
: TFRT_Op<"constant." # suffix, [NoSideEffect]> {
let summary = "host executor constant value constructor" ;
let arguments = (ins attr:$value);
let results = (outs baseType);
}
class PrintOp<string suffix, Type type> : TFRT_Op< "print." # suffix> {
let summary = "tfrt.print operation" ;
let description = [{
An operation takes a number input and a chain input.
It prints the number to stdout and returns a chain output.
The chain input must be the second operand.
Example:
%2 = tfrt.print.i32 %0, %1
}];
let arguments = (ins type, TFRT_ChainType);
let results = (outs TFRT_ChainType);
let assemblyFormat = "operands attr-dict" ;
let verifier = ?;
}
https://github.com/tensorflow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L376-L390
https://github.com/tensorflow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L58-L64
11

Define kernels
12

user defined kernels
func @print_coordinate() {
%chain = tfrt.new.chain
%two = tfrt.constant.i32 2
%four = tfrt.constant.i32 4
%coordinate = "my.create_coordinate" (%two, %four) : (i32, i32) -> !my.coordinate
"my.print_coordinate" (%coordinate, %chain) : (!my.coordinate, !tfrt.chain) -> !tfrt.chain
tfrt.return
}
coordinate.mlir shows several TFRT features:
•MLIR types that begin with exclamation mark ( !) are user-defined types like !my.coordinate,
compared to built-in types like i32
•Kernels are just C++ functions with a name in MLIR: my.print_coordinate is the MLIR name for
the C++ PrintCoordinate function
•Kernels may pass arbitrary user-defined types: my.create_coordinate passes a custom
Coordinate struct to my.print_coordinate
13

to dig into some code we need
more system information
14

Host Runtime
15

•TensorFlow user passes into TFRT a
TensorFlow graph created via high-level
TensorFlow APIs, and
•TFRT then calls the  MLIR-based graph
compiler to optimize and lower the
graph into  BEF, a Binary Executable
Format for TFRT graph execution (MLIR
is the compiler infrastructure that we
use to represent TFRT host programs).
•The blue arrows in the simplified
TensorFlow training stack diagram
show this flow.
16

•In the README.md we are told to build two
binaries: tfrt_translate and bef_excutor
•tfrt_translate
•The tfrt_translate program does round trip
translation between MLIR and BEF, similar
to an assembler and disassembler.
•bef_executor
•The bef_executor program is the
execution driver of BEF files. It reads in a
BEF file, sets up runtime, and
asynchronously executes function(s) in
that file.
17

TFRT Host Runtime
•Foundation of TFRT: schedules work on the host and devices
•Clean separation between host and device runtimes:
•Host runtime does not know anything about devices, just their runtimes (sets of kernels)
•Key design points:
•Fully asynchronous - kernel executions can not block
•Excellent error propagation in the presence of asynchrony
•Performance as a first-class concern, for graph and eager
•Outline:
•Common runtime infrastructure
•Graph execution
•Op-by-op execution (“eager”)
18

•Container for data or resources
•Not Tensor specific
•A “future” type, fulfilled with exactly one value, or an error
•Lock-free, low memory overhead, type erased, reference
counted& "
•Helper class AsyncValueRef<T> provides type safety when
contained type is known
•AsyncValues enable efficient asynchronous compute
•Asynchronous functions return unavailable AsyncValues
•Caller can schedule dependent
computations with AsyncValue::AndThen()
•Caller need not block until AsyncValue
becomes available
Key Abstraction: AsyncValue
https://github.com/tensorflow/runtime/blob/master/include/tfrt/host_context/async_value.h
19

Kernels
•Kernel: unit of computation scheduled by the runtime
•Similar to kernel concept in current TensorFlow
•Kernels accept AsyncValue inputs and produce AsyncValue output
•Runtime coordinates dataflow of AsyncValues between kernels
•Outputs may not be immediately available, unlike current TensorFlow
•Runtime generally does not understand kernel semantics
// Kernel that adds two integers.
// AsyncKernelFrame holds the kernel’s arguments and results.
static void TFRTAdd(AsyncKernelFrame* frame) {
// Fetch the kernel’s 0th argument.
AsyncValue* arg1 = frame->GetArgAt(0);
// Fetch the kernel’s 1st argument.
AsyncValue* arg2 = frame->GetArgAt(1);
int v1 = arg1->get<int>();
int v2 = arg2->get<int>();
// Set the kernel’s 0th result.
frame->EmplaceResultAt<int>(0, v1 + v2);
}
https://github.com/tensorflow/runtime/blob/master/documents/tfrt_host_runtime_design.md
https://github.com/tensorflow/runtime/blob/master/lib/basic_kernels/integer_kernels.cc#L39-L45
https://github.com/tensorflow/runtime/blob/master/include/tfrt/host_context/kernel_utils.h#L61-L149
20

Host Program
•Host programs encode a dataflow graph
•Similar to GraphDef in current TensorFlow
•Expressed in MLIR. Typically compiler generated
•Designed for low-level dispatch efficiency
•Designed for compiler transformations and analysis, e.g.,
•Use dataflow analysis for buffer reuse



func @sample_function() -> i32 {
%one = tfrt.constant.i32 1 // Make AsyncValue with value 1
%two = tfrt.constant.i32 2 // Make AsyncValue with value 2
%three = tfrt.add.i32 %one, %two // Make AsyncValue with value 3 (1+2)
%ch0 = tfrt.new.chain
tfrt.print.i32 %three, %ch0 // Print AsyncValue %three
tfrt.return %three : i32 // Return AsyncValue %three
}
21

TFRT Binary Executable Format (BEF)
•BEF encodes a hardware-specific lowered graph
function
•Primary interface between compiler and runtime 

•Designed for efficient execution
•Low overhead: execute program by reading mmap’d
byte array 

•Persistent and stable: Compile once offline, run
many times 

online. Great for inference use-cases 

•Composed of sections, similar to ELF. Each section
has its own format 

•Extensible: BEF is versioned, reader ignores unknown
sections, new versions may define new sections 
 https://github.com/tensorflow/runtime/blob/master/documents/binary_executable_format.md
22

BEF Executor
•BEF Executor evaluates a BEF dataflow graph “executor” style:
•Not a bytecode-like interpreter: no concept of program counter
•“Strict” execution by default: run a kernel only when all its inputs are available
•Executor features:
•Lock-free: atomics instead of mutexes
•Non-blocking: defer dependent work with AsyncValue::AndThen
•Supports “non-strict” execution: may run a kernel when some of its
inputs are available
•Good for efficiently forwarding unavailable inputs to outputs
•Key concepts:
•BEF: dataflow graph
•Kernel: dataflow node
•AsyncValues: dataflow edge
https://github.com/tensorflow/runtime/blob/master/lib/bef_executor/bef_interpreter.cc#L223-L25423

Host Runtime Summary
24

How about Core Runtime?
•Surely, we can do similar walkthrough, but that will takes more time
•Two things
•Op Execution API, Execute()
•BEF Executor can handle it too



void CoreRuntime::Impl::Execute(const ExecutionContext& exec_ctx,
string_view op_name, OpHandler* op_handler,
MutableArrayRef<TensorHandle> arguments,
const OpAttrsRef& attrs,
MutableArrayRef<TensorHandle> results,
AsyncValueRef<Chain>* chain) {
// Ask the op_handler to execute the op. If successful, we're done.
auto op_handle = op_handler->MakeOp(op_name);
if (op_handle) {
op_handle.get()(exec_ctx, arguments, attrs, results, chain);
return;
}
// Otherwise, we fail with an 'unknown op' error.
auto err =
EmitErrorAsync(exec_ctx, "op '" + op_name.str() + "' is not supported" );
for (auto& result : results) result = TensorHandle(err.CopyRef());
if (chain) *chain = std::move(err);
}
25
https://github.com/tensorflow/runtime/blob/master/lib/core_runtime/core_runtime.cc#L124-L143
https://github.com/tensorflow/runtime/blob/master/documents/
tfrt_op_by_op_execution_design.md

BEF Executor for “op” graph
•corert.executeop
•sample



26

https://github.com/tensorflow/runtime/blob/master/lib/core_runtime/kernels.cc
func @example() -> !tfrt.chain {
%cpu = corert.get_op_handler( "cpu")
// Create TensorHandles
%lhs = corert.executeop( %cpu)
"test.create_dense_tensor" () { shape = [1, 1], values = [-1.0 : f32] }
%rhs = corert.executeop( %cpu)
"test.create_dense_tensor" () { shape = [1, 1], values = [-2.0 : f32] }
%result = corert.executeop( %cpu) "test.add" (%lhs, %rhs)
%ch0 = tfrt.new.chain
%ch1 = corert.print_tensorhandle( %result, %ch0)
tfrt.return %ch1 : !tfrt.chain
}
func @example() -> !tfrt.chain {
%ch0 = tfrt.new.chain
%cpu = corert.get_op_handler %ch0 "cpu"
// Create TensorHandles
%lhs = corert.executeop( %cpu)
"test.create_dense_tensor" () { shape = [1, 1], values = [-1.0 : f32] } : 1
%rhs = corert.executeop( %cpu)
"test.create_dense_tensor" () { shape = [1, 1], values = [-2.0 : f32] } : 1
%result = corert.executeop( %cpu) "test.add" (%lhs, %rhs) : 1
%ch1 = "corert.print_tensorhandle" (%result, %ch0) : (!corert.tensorhandle, !tfrt.chain) -> !tfrt.chain
tfrt.return %ch1 : !tfrt.chain
}

Device Runtime
CPU
27
//===----------------------------------------------------------------------===//
// CPU Relu kernels
//===----------------------------------------------------------------------===//
// Computes B = Relu(A).
template <typename T>
static AsyncValueRef<Chain> Relu(const DenseHostTensor& A, DenseHostTensor* B,
const ExecutionContext& exec_ctx) {
auto fn = [](auto& a, auto& b) { return a.cwiseMax(static_cast<T>(0)); };
return ::tfrt::compat::UnaryEigenKernelAsync< T, T>(A, B, std::move(fn),
exec_ctx);
}
//===----------------------------------------------------------------------===//
// CPU BiasAdd kernels
//===----------------------------------------------------------------------===//
// A special case of tf.add where bias is restricted to be 1-D.
// Currently only support NHWC data format.
template <typename T, size_t RANK>
static AsyncValueRef<Chain> BiasAdd(const DenseHostTensor& input,
const DenseHostTensor& bias,
DenseHostTensor* output,
const ExecutionContext& exec_ctx) {
DHTIndexableView<T, RANK> input_view(&input);
MutableDHTIndexableView <T, RANK> output_view(output);
DHTIndexableView<T, 1> bias_view(&bias);
const auto& shape_input = input_view.FixedShape();
const auto& shape_bias = bias_view.FixedShape();
const auto& shape_output = output_view.FixedShape();
if (shape_input != shape_output) {
return EmitErrorAsync(exec_ctx, "unexpected output shape" );
}
if (shape_bias[0] != shape_input[RANK - 1]) {
return EmitErrorAsync(exec_ctx, "bias shape does not match input shape" );
}
// Reshape bias to the shape of input. Broadcast along the last axis of input.
Eigen::array<Eigen::Index, RANK> reshape_dims;
Eigen::array<Eigen::Index, RANK> broadcast_dims;
for (size_t i = 0; i < RANK - 1; ++i) {
reshape_dims[i] = static_cast<Eigen::Index>(1);
broadcast_dims[i] = static_cast<Eigen::Index>(shape_input[i]);
}
reshape_dims[RANK - 1] = static_cast<Eigen::Index>(shape_bias[0]);
broadcast_dims[RANK - 1] = static_cast<Eigen::Index>(1);
auto input_t = AsEigenConstTensor(input_view);
auto bias_t = AsEigenConstTensor(bias_view);
auto output_t = AsEigenTensor(output_view);
auto expr = input_t + bias_t.reshape(reshape_dims).broadcast(broadcast_dims);
return AsyncAssign(
exec_ctx.host()->GetOrCreateSharedContext<EigenHostContext>(),
std::move(output_t), std::move(expr),
KeepBuffers::alive(&input, &bias, output));
}
https://github.com/tensorflow/runtime/blob/master/backends/cpu/lib/kernels/cpu_kernels.h

Dialects we can see now
•tfrt: we know what this is for
•tfrt_test: to test tfrt
•tfrt_data: tf.data, to deal with input pipeline
•tfrt_dht: dense host tensor
•corert: Core Runtime, eager execution
•ts: tensor shape
•coo: COOrdinate list sparse tensor
•eigen: wrapper around the eigen library
•btf: binary tensor format
•cuda: you know what cuda means :-)
28

Concluding Remarks
•MLIR related talks and publications, https://mlir.llvm.org/talks/
•We scratched the surface of TFRT host runtime and core runtime. There are more details
•threading model: thread pool / work queue,
•memory allocation: tcmalloc for server, other small allocators for embedded systems,
•non-strict execution, and
•registers: BEF executor is a register machine
•we didn’t touch other important components such as device runtimes, eps. the GPU
part, and distributed environment
29

Fin
30

Device Runtime Design Principles
•A thin wrapper of low-level (driver) APIs, exposing device capabilities to graph compiler
•Memory Allocation
•Async host <-> device transfer, and kernel execution
•Dependency management
•Focus on mechanism instead of policy
•E.g. No built-in special-purpose streams for GPU support:
•For pure eager execution, can default to one stream for everything
•For tf.function execution, compiler can pick streams
31