How WebAssembly can be used to optimize and accelerate Large Language Models Inference in the Cloud.
Size: 4.96 MB
Language: en
Added: May 19, 2024
Slides: 41 pages
Slide Content
Samy Fodil
Founder & CEO of Taubyte
WebAssembly
is Key to Better
LLM
Performance
Samy Fodil
Founder & CEO of Taubyte
Taubyte
An Open Source Cloud
Platform on Autopilot, where
coding in local environment
equals scaling to global
production. ??????
Local Coding
Global Production
Taubyte Locally
Taubyte In Prod
https://tau.how
LLMs
Generative AI
INPUT
Large Language Models
OUTPUT
ProprietaryOpen Source
✅ Train
✅ Host
Large Language Models
??????Lightweight
??????Sand-boxed
⚡Easy to orchestrate
LLM Inference
WASM to the rescue
?????? Lightweight
<2MB ~4GB
?????? Lightweight
⚡Fast to provision
??????Cheap to cache
Closer to Data (Data Gravity) &
User (Edge Computing)
??????Sand-boxed
??????️Secure
??????️Interfaces
Which means No, Restricted, Virtual or
Mocked: Networking, Filesystem and more
LLM Inference
You can use transformers
directly from python. This
will result in PyTorch and
other dependencies.
?????? Huge PyTorch Images
?????? Complex Dependencies
??????Lock GPU ressources
LLM Inference
You can for example use onnx, llama.cpp, candle (which
wrapps llama.cpp) or TensorRT-LLM and have a lower foot
print. But it comes with a few challenges:
?????? Lock GPU ressources
?????? Way harder to implement
Inference Engines for LLM
?????? Load balancing?
?????? Routing?
With WASM
If we made LLM and AI, in general, available through a set of
common host calls we can combine benfits of
Inference Engines
?????? Orchestration
?????? Caching
WebAssembly
??????Lightweight
??????Sand-boxed
?????? Load balancing?
?????? Routing?
github.com/taubyte/tau
Will provide:
?????? Load-balancing
?????? Routing
?????? Orchestration
It’ll also provide abstractions so what’s built locally will work
in production with no changes.
Implementation of 1️⃣
tau implement a protocols called `gateway` that will
determine what host will be best suited to serve the request
based on:
WebAssembly caching and dependency modules
availability (including host modules)
Host Module Resource availability
Host Resource availability
Other constraints defined by developer like data gravity
Node Running
Gateway Proto
Gateway Protocol
Serving Node
(Substrate)
Serving Node
(Substrate)
Serving Node
(Substrate)
Satellite
(i.e. LLM Inference)
Satellite
Satellite
MUXED TUNNEL
MUXED TUNNEL
MUXED TUNNEL
ORBIT PROTO
ORBIT PROTO
ORBIT PROTO
Host Module Resource
Availability
This is still to be implemented feature that will ask Host
Module to provide basic metrics for the gateway:
Caching score. Example: is the particular model loaded.
Resources Availability Score. Example: Is there enough
GPU mem to spin-up the model
Queue Score. Example: If Host Module uses queues, how
filled is the queue.
github.com/taubyte/vm-orbit
Extending WebAssembly Runtime in a secure way.
orbit
external
process
PROXY WASM CALL
PROXY ACCESS TO MODULE MEMORY
Example
github.com/ollama-cloud
trunings ollama into a satellite
Compile llama.cpp
Build plugin
Install dreamland
Start local Cloud
Attach plug-in
Login to Local Cloud
Create a project
Create a function
Call Generate
Stream Tokens
No SDK!
Check my previous project
github.com/samyfodil/taubyte-llama-satellite
Which actually has a nice SDK!
Trigger the function
- Copy ollama plugin to /tb/plugins
- Add it to config
In production
Your Application is Live!
taubyte/dllama
Ready for Cloud with better backends
Always local friendly!