WebAssembly is Key to Better LLM Performance

SamyFodil 169 views 41 slides May 19, 2024

Slide 1 of 41

About This Presentation

How WebAssembly can be used to optimize and accelerate Large Language Models Inference in the Cloud.

Size: 4.96 MB

Language: en

Added: May 19, 2024

Slides: 41 pages

Slide Content

Samy Fodil
Founder & CEO of Taubyte
WebAssembly
is Key to Better
LLM
Performance

Samy Fodil
Founder & CEO of Taubyte

Taubyte
An Open Source Cloud
Platform on Autopilot, where
coding in local environment
equals scaling to global
production. ??????
Local Coding
Global Production

Taubyte Locally

Taubyte In Prod

https://tau.how

LLMs
Generative AI

INPUT
Large Language Models
OUTPUT

ProprietaryOpen Source
✅ Train
✅ Host
Large Language Models

LLM Training

LLM Inference
?????? Huge PyTorch Images
?????? Complex Dependencies
??????Lock GPU ressources
??????Primitive Orchestration
Problems

??????Lightweight
??????Sand-boxed
⚡Easy to orchestrate
LLM Inference
WASM to the rescue

?????? Lightweight
<2MB ~4GB

?????? Lightweight
⚡Fast to provision
??????Cheap to cache
Closer to Data (Data Gravity) &
User (Edge Computing)

??????Sand-boxed
??????️Secure
??????️Interfaces
Which means No, Restricted, Virtual or
Mocked: Networking, Filesystem and more

LLM Inference
You can use transformers
directly from python. This
will result in PyTorch and
other dependencies.
?????? Huge PyTorch Images
?????? Complex Dependencies
??????Lock GPU ressources

LLM Inference
You can for example use onnx, llama.cpp, candle (which
wrapps llama.cpp) or TensorRT-LLM and have a lower foot
print. But it comes with a few challenges:
?????? Lock GPU ressources
?????? Way harder to implement

Scaling LLM Inference (lvl1)
?????? Orchestration
?????? Caching

Inference Engines for LLM
?????? Load balancing?
?????? Routing?

With WASM
If we made LLM and AI, in general, available through a set of
common host calls we can combine benfits of
Inference Engines
?????? Orchestration
?????? Caching
WebAssembly
??????Lightweight
??????Sand-boxed

?????? Load balancing?
?????? Routing?

github.com/taubyte/tau
Will provide:
?????? Load-balancing
?????? Routing
?????? Orchestration
It’ll also provide abstractions so what’s built locally will work
in production with no changes.

APP
HTTP
Inference Engine
HTTP
HTTP
WASM
MODULE
HOST CALL
HOST CALL
HOST CALL
HTTP
HTTP
Inference
Host
Module
The idea
1️⃣
2️⃣

Implementation of 1️⃣
tau implement a protocols called `gateway` that will
determine what host will be best suited to serve the request
based on:
WebAssembly caching and dependency modules
availability (including host modules)
Host Module Resource availability
Host Resource availability
Other constraints defined by developer like data gravity

Node Running
Gateway Proto
Gateway Protocol
Serving Node
(Substrate)
Serving Node
(Substrate)
Serving Node
(Substrate)
Satellite
(i.e. LLM Inference)
Satellite
Satellite
MUXED TUNNEL
MUXED TUNNEL
MUXED TUNNEL
ORBIT PROTO
ORBIT PROTO
ORBIT PROTO

Host Module Resource
Availability
This is still to be implemented feature that will ask Host
Module to provide basic metrics for the gateway:
Caching score. Example: is the particular model loaded.
Resources Availability Score. Example: Is there enough
GPU mem to spin-up the model
Queue Score. Example: If Host Module uses queues, how
filled is the queue.

github.com/taubyte/vm-orbit
Extending WebAssembly Runtime in a secure way.
orbit
external
process
PROXY WASM CALL
PROXY ACCESS TO MODULE MEMORY

Example

github.com/ollama-cloud
trunings ollama into a satellite

Compile llama.cpp
Build plugin

Install dreamland
Start local Cloud
Attach plug-in

Create a project

Create a function

Call Generate
Stream Tokens

No SDK!
Check my previous project
github.com/samyfodil/taubyte-llama-satellite
Which actually has a nice SDK!

Trigger the function

- Copy ollama plugin to /tb/plugins
- Add it to config
In production

Your Application is Live!

taubyte/dllama
Ready for Cloud with better backends
Always local friendly!

Thanks!

Download

Download Slideshow Get the original presentation file

Quick Actions

Statistics

Views 169
Slides 41
Age 562 days

WebAssembly is Key to Better LLM Performance

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

WebAssembly is Key to Better LLM Performance

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx