A Peek into TFRT

Koan-Sin Tan,
[email protected]
COSCUP, Aug 2nd, 2020
TensorFlow Runtime
A Peek into the Future of TensorFlow
1

•disclaimer: opinions are my own
•feel free to interrupt me if you have any questions during the presentation
•questions could be Taiwanese, English, or Mandarin
•most of TFRT materials are adapted from TFRT deep dive in MLIR design meeting [1] and TFRT docs [2]
•code around Aug 1, 2020 (git commit ecf1c20 [3])
[1] TFRT Deep Dive, slides - recording, https://mlir.llvm.org/talks/
[2] https://github.com/tensorﬂow/runtime/tree/master/documents
[3] https://github.com/tensorﬂow/runtime/commit/ecf1c20
2

•Used open source before the term “open
source” is used
•A software guy, learned to use Unix and open
source software on VAX-11/780 running 4.3BSD
•Used to be a programming language junkie
•Worked on various system software, e.g., CPU
scheduling and power management of non-
CPU components
•Recently, on NN performance on edge devices
related stuﬀ
•Contributed from time to time to TensorFlow Lite
•started a command line label_image for TFLite
who i am
https://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg
3

What is TFRT
•TensorFlow Runtime (TFRT) is one of the two new MLIR runtimes emerged in 2020 so far.
•The other one is Intermediate Representation Execution Environment, IREE. It seems
so far tfrt has better design documentation
•Both of them have mobile / edge environment in mind.
•I didn’t see mobile accelerated code in TFRT yet.
•IREE has some Vulkan related code and some simple code works on Android already
•ResNet GPU inference is 28% faster with TFRT
•https://github.com/tensorﬂow/runtime, https://youtu.be/15tiQoPpuZ8
4

Build it
•if you follow the instructions described in README.md, it should just work. At least on x86_64 linux.
•however, it’s not tested for non Linux environment yet
•ssize_t and int64_t
•on Mac OS X: ssize_t: long, int64_t: long long
•current code mixed the use of ssize_t and int64_t
•test: one the acclaimed features of TFRT, like MLIR, is its use of  
LLVM FileCheck
•my hacks, shape related (ssize_t) tests not ﬁxed yet
•it’s not tested on non-x86 platforms, such as aarch64, either  
•
5

•The three key directories under the TFRT root directory are
•lib: Contains core TFRT infrastructure code
•backends: Contains device speciﬁc infrastructure and op/kernel implementations
•include: Contains public header ﬁles for core TFRT infrastructure
6

Walking thru the tutorial
•unfortunately, it seems it’s not easy to jump directly into source code without having
some background knowledge
•so we’ll walk thru the tutorial [1]
•What are in the tutorial
•print hello world
•print integer
•adding kernels
[1] https://github.com/tensorﬂow/runtime/blob/master/documents/tutorial.md
7

using tfrt and tfrt_test
hello.mlir
func @hello() {
%chain = tfrt.new.chain
// Create a string containing "hello world" and store it in %hello.
%hello = "tfrt_test.get_string" () { string_attr = "hello world" } : () -> !tfrt.string
// Print the string in %hello.
"tfrt_test.print_string" (%hello, %chain) : (!tfrt.string, !tfrt.chain) -> !tfrt.chain
tfrt.return
}
The ‘@hello function above shows how to create and print a string. The text after each ‘:’ speciﬁes the types involved:
•()->!tfrt.string means that tfrt_test.get_string takes no arguments and returns a !tfrt.string. tfrt is a
MLIR dialect preﬁx (or namespace) for TFRT
•(!tfrt.string, !tfrt.chain) -> !tfrt.chain means that tfrt_test.print_string takes two arguments (!
tfrt.string and !tfrt.chain) and returns a !tfrt.chain. chain [1] is a TFRT abstraction to manage dependencies
[1] https://github.com/tensorﬂow/runtime/blob/master/documents/explicit_dependency.md
8

hello world in MLIR
func @stringconstant() -> !llvm<"[12 x i8]"> {
%1 = llvm.constant("Hello world!") : !llvm<"i8*">
// CHECK: ret [12 x i8] c"Hello world!"
llvm.return %1 : !llvm<"i8*">
}
func @main() {
%0 = llvm.constant(0) : !llvm.i64
%1 = call @stringconstant() : () -> !llvm<"[12 x i8]">
%2 = llvm.getelementptr %1[%0] : (!llvm<"[12 x i8]">, !llvm.i64) -> !llvm<"i8*">
%3 = llvm.bitcast %2 : !llvm<"i8*"> to !llvm<"i8*">
%32 = llvm.call @puts(%2) : (!llvm<"i8*">) -> !llvm.i32
return
}
func @puts(!llvm<"i8*">) -> !llvm.i32
•MLIR “standard dialect” doesn’t have I/O functions
•there is LLVM dialect, of course we can use LLVM to call standard libc
function
9

Hello integer
func @hello_integers() {
%chain = tfrt.new.chain
// Create an integer containing 42.
%forty_two = tfrt.constant.i32 42
// Print 42.
tfrt.print.i32 %forty_two, %chain
tfrt.return
}
•as stated in the tutorial, we can run other functions in the same modular
•we can turn to more basic ones, such as integers or ﬂoating point numbers
•@hello_integers shows how to create and print integers
•This example does not have the verbose type information we saw in @hello because there are
custom parsers for the tfrt.constant.i32 and tfrt.print.32 kernels in
basic_kernels.td
10

basic_kernels.td
•.td (table description?) ﬁles are for LLVM TableGen
[1] TableGen, https://llvm.org/docs/TableGen/
class ConstantOp<string suffix, Type baseType, Attr attr>
: TFRT_Op<"constant." # suffix, [NoSideEffect]> {
let summary = "host executor constant value constructor" ;
let arguments = (ins attr:$value);
let results = (outs baseType);
}
class PrintOp<string suffix, Type type> : TFRT_Op< "print." # suffix> {
let summary = "tfrt.print operation" ;
let description = [{
An operation takes a number input and a chain input.
It prints the number to stdout and returns a chain output.
The chain input must be the second operand.
Example:
%2 = tfrt.print.i32 %0, %1
}];
let arguments = (ins type, TFRT_ChainType);
let results = (outs TFRT_ChainType);
let assemblyFormat = "operands attr-dict" ;
let verifier = ?;
}
https://github.com/tensorﬂow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L376-L390
https://github.com/tensorﬂow/runtime/blob/master/include/tfrt/basic_kernels/opdefs/basic_kernels.td#L58-L64
11

Define kernels
12

user defined kernels
func @print_coordinate() {
%chain = tfrt.new.chain
%two = tfrt.constant.i32 2
%four = tfrt.constant.i32 4
%coordinate = "my.create_coordinate" (%two, %four) : (i32, i32) -> !my.coordinate
"my.print_coordinate" (%coordinate, %chain) : (!my.coordinate, !tfrt.chain) -> !tfrt.chain
tfrt.return
}
coordinate.mlir shows several TFRT features:
•MLIR types that begin with exclamation mark ( !) are user-deﬁned types like !my.coordinate,
compared to built-in types like i32
•Kernels are just C++ functions with a name in MLIR: my.print_coordinate is the MLIR name for
the C++ PrintCoordinate function
•Kernels may pass arbitrary user-deﬁned types: my.create_coordinate passes a custom
Coordinate struct to my.print_coordinate
13

to dig into some code we need
more system information
14

Host Runtime
15

•TensorFlow user passes into TFRT a
TensorFlow graph created via high-level
TensorFlow APIs, and
•TFRT then calls the MLIR-based graph
compiler to optimize and lower the
graph into BEF, a Binary Executable
Format for TFRT graph execution (MLIR
is the compiler infrastructure that we
use to represent TFRT host programs).
•The blue arrows in the simpliﬁed
TensorFlow training stack diagram
show this ﬂow.
16

•In the README.md we are told to build two
binaries: tfrt_translate and bef_excutor
•tfrt_translate
•The tfrt_translate program does round trip
translation between MLIR and BEF, similar
to an assembler and disassembler.
•bef_executor
•The bef_executor program is the
execution driver of BEF ﬁles. It reads in a
BEF ﬁle, sets up runtime, and
asynchronously executes function(s) in
that ﬁle.
17

TFRT Host Runtime
•Foundation of TFRT: schedules work on the host and devices
•Clean separation between host and device runtimes:
•Host runtime does not know anything about devices, just their runtimes (sets of kernels)
•Key design points:
•Fully asynchronous - kernel executions can not block
•Excellent error propagation in the presence of asynchrony
•Performance as a ﬁrst-class concern, for graph and eager
•Outline:
•Common runtime infrastructure
•Graph execution
•Op-by-op execution (“eager”)
18

•Container for data or resources
•Not Tensor speciﬁc
•A “future” type, fulﬁlled with exactly one value, or an error
•Lock-free, low memory overhead, type erased, reference
counted& "
•Helper class AsyncValueRef<T> provides type safety when
contained type is known
•AsyncValues enable eﬃcient asynchronous compute
•Asynchronous functions return unavailable AsyncValues
•Caller can schedule dependent
computations with AsyncValue::AndThen()
•Caller need not block until AsyncValue
becomes available
Key Abstraction: AsyncValue
https://github.com/tensorﬂow/runtime/blob/master/include/tfrt/host_context/async_value.h
19

Kernels
•Kernel: unit of computation scheduled by the runtime
•Similar to kernel concept in current TensorFlow
•Kernels accept AsyncValue inputs and produce AsyncValue output
•Runtime coordinates dataﬂow of AsyncValues between kernels
•Outputs may not be immediately available, unlike current TensorFlow
•Runtime generally does not understand kernel semantics
// Kernel that adds two integers.
// AsyncKernelFrame holds the kernel’s arguments and results.
static void TFRTAdd(AsyncKernelFrame* frame) {
// Fetch the kernel’s 0th argument.
AsyncValue* arg1 = frame->GetArgAt(0);
// Fetch the kernel’s 1st argument.
AsyncValue* arg2 = frame->GetArgAt(1);
int v1 = arg1->get<int>();
int v2 = arg2->get<int>();
// Set the kernel’s 0th result.
frame->EmplaceResultAt<int>(0, v1 + v2);
}
https://github.com/tensorﬂow/runtime/blob/master/documents/tfrt_host_runtime_design.md
https://github.com/tensorﬂow/runtime/blob/master/lib/basic_kernels/integer_kernels.cc#L39-L45
https://github.com/tensorﬂow/runtime/blob/master/include/tfrt/host_context/kernel_utils.h#L61-L149
20

Host Program
•Host programs encode a dataﬂow graph
•Similar to GraphDef in current TensorFlow
•Expressed in MLIR. Typically compiler generated
•Designed for low-level dispatch eﬃciency
•Designed for compiler transformations and analysis, e.g.,
•Use dataﬂow analysis for buﬀer reuse

func @sample_function() -> i32 {
%one = tfrt.constant.i32 1 // Make AsyncValue with value 1
%two = tfrt.constant.i32 2 // Make AsyncValue with value 2
%three = tfrt.add.i32 %one, %two // Make AsyncValue with value 3 (1+2)
%ch0 = tfrt.new.chain
tfrt.print.i32 %three, %ch0 // Print AsyncValue %three
tfrt.return %three : i32 // Return AsyncValue %three
}
21

TFRT Binary Executable Format (BEF)
•BEF encodes a hardware-speciﬁc lowered graph
function
•Primary interface between compiler and runtime  
•Designed for eﬃcient execution
•Low overhead: execute program by reading mmap’d
byte array  
•Persistent and stable: Compile once oﬄine, run
many times  
online. Great for inference use-cases  
•Composed of sections, similar to ELF. Each section
has its own format  
•Extensible: BEF is versioned, reader ignores unknown
sections, new versions may deﬁne new sections   https://github.com/tensorﬂow/runtime/blob/master/documents/binary_executable_format.md
22

BEF Executor
•BEF Executor evaluates a BEF dataﬂow graph “executor” style:
•Not a bytecode-like interpreter: no concept of program counter
•“Strict” execution by default: run a kernel only when all its inputs are available
•Executor features:
•Lock-free: atomics instead of mutexes
•Non-blocking: defer dependent work with AsyncValue::AndThen
•Supports “non-strict” execution: may run a kernel when some of its
inputs are available
•Good for eﬃciently forwarding unavailable inputs to outputs
•Key concepts:
•BEF: dataﬂow graph
•Kernel: dataﬂow node
•AsyncValues: dataﬂow edge
https://github.com/tensorﬂow/runtime/blob/master/lib/bef_executor/bef_interpreter.cc#L223-L25423

Host Runtime Summary
24

How about Core Runtime?
•Surely, we can do similar walkthrough, but that will takes more time
•Two things
•Op Execution API, Execute()
•BEF Executor can handle it too

void CoreRuntime::Impl::Execute(const ExecutionContext& exec_ctx,
string_view op_name, OpHandler* op_handler,
MutableArrayRef<TensorHandle> arguments,
const OpAttrsRef& attrs,
MutableArrayRef<TensorHandle> results,
AsyncValueRef<Chain>* chain) {
// Ask the op_handler to execute the op. If successful, we're done.
auto op_handle = op_handler->MakeOp(op_name);
if (op_handle) {
op_handle.get()(exec_ctx, arguments, attrs, results, chain);
return;
}
// Otherwise, we fail with an 'unknown op' error.
auto err =
EmitErrorAsync(exec_ctx, "op '" + op_name.str() + "' is not supported" );
for (auto& result : results) result = TensorHandle(err.CopyRef());
if (chain) *chain = std::move(err);
}
25
https://github.com/tensorﬂow/runtime/blob/master/lib/core_runtime/core_runtime.cc#L124-L143
https://github.com/tensorﬂow/runtime/blob/master/documents/
tfrt_op_by_op_execution_design.md

BEF Executor for “op” graph
•corert.executeop
•sample

26

https://github.com/tensorﬂow/runtime/blob/master/lib/core_runtime/kernels.cc
func @example() -> !tfrt.chain {
%cpu = corert.get_op_handler( "cpu")
// Create TensorHandles
%lhs = corert.executeop( %cpu)
"test.create_dense_tensor" () { shape = [1, 1], values = [-1.0 : f32] }
%rhs = corert.executeop( %cpu)
"test.create_dense_tensor" () { shape = [1, 1], values = [-2.0 : f32] }
%result = corert.executeop( %cpu) "test.add" (%lhs, %rhs)
%ch0 = tfrt.new.chain
%ch1 = corert.print_tensorhandle( %result, %ch0)
tfrt.return %ch1 : !tfrt.chain
}
func @example() -> !tfrt.chain {
%ch0 = tfrt.new.chain
%cpu = corert.get_op_handler %ch0 "cpu"
// Create TensorHandles
%lhs = corert.executeop( %cpu)
"test.create_dense_tensor" () { shape = [1, 1], values = [-1.0 : f32] } : 1
%rhs = corert.executeop( %cpu)
"test.create_dense_tensor" () { shape = [1, 1], values = [-2.0 : f32] } : 1
%result = corert.executeop( %cpu) "test.add" (%lhs, %rhs) : 1
%ch1 = "corert.print_tensorhandle" (%result, %ch0) : (!corert.tensorhandle, !tfrt.chain) -> !tfrt.chain
tfrt.return %ch1 : !tfrt.chain
}

Device Runtime
CPU
27
//===----------------------------------------------------------------------===//
// CPU Relu kernels
//===----------------------------------------------------------------------===//
// Computes B = Relu(A).
template <typename T>
static AsyncValueRef<Chain> Relu(const DenseHostTensor& A, DenseHostTensor* B,
const ExecutionContext& exec_ctx) {
auto fn = [](auto& a, auto& b) { return a.cwiseMax(static_cast<T>(0)); };
return ::tfrt::compat::UnaryEigenKernelAsync< T, T>(A, B, std::move(fn),
exec_ctx);
}
//===----------------------------------------------------------------------===//
// CPU BiasAdd kernels
//===----------------------------------------------------------------------===//
// A special case of tf.add where bias is restricted to be 1-D.
// Currently only support NHWC data format.
template <typename T, size_t RANK>
static AsyncValueRef<Chain> BiasAdd(const DenseHostTensor& input,
const DenseHostTensor& bias,
DenseHostTensor* output,
const ExecutionContext& exec_ctx) {
DHTIndexableView<T, RANK> input_view(&input);
MutableDHTIndexableView <T, RANK> output_view(output);
DHTIndexableView<T, 1> bias_view(&bias);
const auto& shape_input = input_view.FixedShape();
const auto& shape_bias = bias_view.FixedShape();
const auto& shape_output = output_view.FixedShape();
if (shape_input != shape_output) {
return EmitErrorAsync(exec_ctx, "unexpected output shape" );
}
if (shape_bias[0] != shape_input[RANK - 1]) {
return EmitErrorAsync(exec_ctx, "bias shape does not match input shape" );
}
// Reshape bias to the shape of input. Broadcast along the last axis of input.
Eigen::array<Eigen::Index, RANK> reshape_dims;
Eigen::array<Eigen::Index, RANK> broadcast_dims;
for (size_t i = 0; i < RANK - 1; ++i) {
reshape_dims[i] = static_cast<Eigen::Index>(1);
broadcast_dims[i] = static_cast<Eigen::Index>(shape_input[i]);
}
reshape_dims[RANK - 1] = static_cast<Eigen::Index>(shape_bias[0]);
broadcast_dims[RANK - 1] = static_cast<Eigen::Index>(1);
auto input_t = AsEigenConstTensor(input_view);
auto bias_t = AsEigenConstTensor(bias_view);
auto output_t = AsEigenTensor(output_view);
auto expr = input_t + bias_t.reshape(reshape_dims).broadcast(broadcast_dims);
return AsyncAssign(
exec_ctx.host()->GetOrCreateSharedContext<EigenHostContext>(),
std::move(output_t), std::move(expr),
KeepBuffers::alive(&input, &bias, output));
}
https://github.com/tensorﬂow/runtime/blob/master/backends/cpu/lib/kernels/cpu_kernels.h

Dialects we can see now
•tfrt: we know what this is for
•tfrt_test: to test tfrt
•tfrt_data: tf.data, to deal with input pipeline
•tfrt_dht: dense host tensor
•corert: Core Runtime, eager execution
•ts: tensor shape
•coo: COOrdinate list sparse tensor
•eigen: wrapper around the eigen library
•btf: binary tensor format
•cuda: you know what cuda means :-)
28

Concluding Remarks
•MLIR related talks and publications, https://mlir.llvm.org/talks/
•We scratched the surface of TFRT host runtime and core runtime. There are more details
•threading model: thread pool / work queue,
•memory allocation: tcmalloc for server, other small allocators for embedded systems,
•non-strict execution, and
•registers: BEF executor is a register machine
•we didn’t touch other important components such as device runtimes, eps. the GPU
part, and distributed environment
29

Fin
30

Device Runtime Design Principles
•A thin wrapper of low-level (driver) APIs, exposing device capabilities to graph compiler
•Memory Allocation
•Async host <-> device transfer, and kernel execution
•Dependency management
•Focus on mechanism instead of policy
•E.g. No built-in special-purpose streams for GPU support:
•For pure eager execution, can default to one stream for everything
•For tf.function execution, compiler can pick streams
31

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

A Peek into TFRT

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx