Everything You Need to Know About Running LLMs Locally
AllThingsOpen
7 views
46 slides
Oct 20, 2025
Slide 1 of 46
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
About This Presentation
Presented at All Things Open 2025
Presented by Cedric Clyburn - Red Hat
Title: Everything You Need to Know About Running LLMs Locally
Abstract: As large language models (LLMs) become more accessible, running them locally unlocks exciting opportunities for developers, engineers, and privacy-focused ...
Presented at All Things Open 2025
Presented by Cedric Clyburn - Red Hat
Title: Everything You Need to Know About Running LLMs Locally
Abstract: As large language models (LLMs) become more accessible, running them locally unlocks exciting opportunities for developers, engineers, and privacy-focused users. Why rely on costly cloud AI services that share your data when you could deploy your own models tailored to your needs? In this session, we’ll dive into the advantages of local LLM deployment, from selecting the right open source model to optimizing performance on consumer hardware and integrating with your unique data.
Let’s explore the journey to your own local stack for AI, and cover the important technical details such as model quantization, API integrations with IDE code assistants, and advanced methods like Retrieval-Augmented Generation (RAG) to connect your LLM to private data sources. Don’t miss out on the fun live demos that prove the bright future of open source AI is already here!
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
Bluesky: https://bsky.app/profile/allthingsopen.bsky.social
YouTube: https://www.youtube.com/@allthingsopen
2025 conference: https://2025.allthingsopen.org/
Size: 5.34 MB
Language: en
Added: Oct 20, 2025
Slides: 46 pages
Slide Content
Everything You Need to Know About
Running LLMs Locally
All Things Open: 2025
Cedric Clyburn
Senior Developer Advocate
@cedricclyburn
2
We’ve got a lot to cover today!
We’ve got a lot to cover today!
Wait, so you can
run your own
language models…
completely local?
We’ve got a lot to cover today!
and there are plenty of
open source tools to do so?!
Wait, so you can
run your own
language models…
completely local?
We’ve got a lot to cover today!
and there are plenty of
open source tools to do so?!
Wait, so you can
run your own
language models…
completely local?
But there’s over 2 million
models, which to pick?
We’ve got a lot to cover today!
and there are plenty of
open source tools to do so?!
Wait, so you can
run your own
language models…
completely local?
But there’s over 2 million
models, which to pick?
or how can i use my own
PDF’s or codebase or API’s?
Today’s Schedule
▸Demo #1: Model
serving & RAG
▸Demo #2: Code
assistance
▸Demo #3: Adding AI
features to apps
(Agentic AI)
▸Running your own
AI & LLMs
▸How to choose the
right model?
▸Integrating your
data & codebase!
Agenda
7
Session Slides
red.ht/local-llm
Why’s everyone running
their own AI models?
8
9
Why run a model locally?
For Developers
Familiarity with the Development Environment and adherence
of the developers to their “local developer experience” in
particular for testing and debugging
Convenience & Simplicity
Direct Access to Hardware
Ease of Integration
Simplify the integration of the model with existing systems
and applications that are already running locally.
For Organizations
Data Privacy and Security
Data is the fuel for AI, and a differentiator factor (quality, quantity,
qualification). Keeping data on-premises ensures sensitive information doesn’t
leave the local environment → crucial for privacy-sensitive applications
Cost Control
While there is an initial investment in hardware and setup, running locally can
potentially reduce ongoing costs of cloud computing services and alleviate the
vendor-locking played by Amazon, MSFT, Google
Regulatory Compliance
Some industries have strict regulations about where and how data is
processed
Take advantage of total AI customization and control
Customization & Control
Easily train or fine-tune your own model, from the
convenience of the developer’s local machine.
10
But the stack can be a bit overwhelming!
11
But the stack can be a bit overwhelming!
2024 MAD (Machine
learning, Artificial
Intelligence & Data)
Landscape
12
But the stack can be a bit overwhelming!
2024 MAD (Machine
learning, Artificial
Intelligence & Data)
Landscape
13
14
Average developer trying to download & manage models, configure serving runtimes, quantize and
compress LLMs, ensure correct prompt templates… (Colorized, 2023)
So, what open source
tech can help us run AI?
15
Fortunately, there’s a lot… for every use case!
Fortunately, there’s a lot… for every use case!
17
18
▸Simple CLI: “Docker” style tool for running
LLMs locally, offline, and privately
▸Extensible: Basic model customization
(Modelfile) and importing of fine-tuned LLMs
▸Lightweight: Efficient and resource-friendly.
▸Easy API: API for both inferencing and Ollama
itself (ex. download models)
For simple model downloading & serving
Tool #1: Ollama
LLM Tools
19
▸Research-Based: UC Berkeley project to
improve model speeds and GPU consumption
▸Standardized: Works with Hugging Face &
OpenAI API.
▸Versatile: Supports NVIDIA, AMD, Intel, TPUs &
more.
▸Scalable: Manages multiple requests efficiently,
ex. with Kubernetes as an LLM runtime
For scaling things up in production environments
Tool #2: vLLM
LLM Tools
20
▸AI in Containers: Run models with
Podman/Docker with no config needed.
▸Registry Agnostic: Freedom to pull models
from Hugging Face, Ollama, or OCI registries.
▸GPU Optimized: Auto-detect & accelerate
performance.
▸Flexible: Supports llama.cpp, vLLM,
whisper.cpp & more.
To make AI boring by using containers
Tool #3: Ramalama
LLM Tools
21
▸For App Builders: Choose from various recipes
like RAG, Agentic, Summarizers
▸Curated Models: Easily access Apache 2.0
open-source options.
▸Container Native: Easy app integration and
movement from local to production.
▸Interactive Playgrounds: Test & optimize
models with your custom prompts and data.
For developers looking to build AI features
Tool #4: Podman AI Lab
LLM Tools
Cool! But, what specific
model should I be using?
22
23
There are plenty of open, and closed model choices!
24
Darn, we’re back to here again!
25
But again, we’ll use another video game analogy!
26
But again, we’ll use another video game analogy!
27
▸It depends on the use case that you want to
tackle.
▸DeepSeek models excel in reasoning tasks
and complex problem-solving.
▸Granite SLM models perform well in various
NLP tasks and multimodal applications.
▸Mistral and LLaMA are particularly strong in
summarization and sentiment analysis.
Well… it depends!
So, which model should you select?
Model Selection
28
Our data isn’t always in one format, it’s text, image, audio, etc.
But not all models are the same!
Model Selection
Text Image
Unimodal
text-to-image
text-to-text
image-to-text
image-to-image
text-to-code
Text Image Audio Video
Multimodal
any-to-any
✓Single data input
✓Less resources
✓Single modality
✓Limited depth and
accuracy
✓Multiple data inputs
✓More resources
✓Multiple modality
✓Better understanding
and accuracy
OR
29
Kind of like how our apps are compiled for various architectures!
Also! There’s a naming convention
Model Selection
ibm-granite/granite-3.0-8b-base
Family name
Model architecture and
version Number of parameters
Model fine-tuned to be
a baseline
Mixtral-8x7B-Instruct-v0.1
Family name
Model version
Number of
parameters Model fine-tuned for
instructive tasks
Architecture type
30
For example, DeepSeek-R1 or Llama 3.3-405B
How to deploy a larger model?
Model Selection
Let’s say you
want the best
benchmarks with
a frontier model
31
What about model size?
32
For example, DeepSeek-R1 or Llama 3.3-405B
How to deploy a larger model?
Model Selection
Neither of these
situations is ideal :)
33
It’s a way to compress models, think like a .zip or .tar
Well, most models for local usage are quantized!
Model Selection
▸Quantization: A technique to compress
LLMs by reducing numerical precision.
▸Converts high-precision weights (FP32) into
lower-bit formats (FP16, INT8, INT4).
▸Reduces memory footprint, making models
easier to deploy.
34
It’s a way to compress models, think like a .zip or .tar
Well, most models for local usage are quantized!
Model Selection
▸The Benefit? Run LLMs on “any” device, not
just your local machine but IoT & Edge too
▸Results in faster and lighter models that still
maintain reasonable accuracy
・Testing with Llama 3.1, for W4A16-INT
resulted in 2.4x performance speedup
and 3.5x model size compression
▸Works on GPUs & CPUs!
Source:
https://neuralmagic.com/blog/we-ran-over-half-a-million-evaluations-on-quantized-llms-heres-what-we-found
35
Check it out on Hugging Face & save resources on LLM serving!
& there’s a open repository of Quantized Models
Model Selection
Comprehensive Validation Extensive SelectionBroad Collection
Instinct
GPUs
CPUs
TPUs
Formats
●W4/8A16
●W8A8-INT8
●W8A8-FP8
●2:4 sparse
Hardware
Algorithms
●GPTQ / AWQ
●SmoothQuant
●SparseGPT
●RTN
Llama Qwen
Mistral DeepSeek
Gemma
Phi
Molmo Granite Nemotron
Cut GPU costs in half ready-to-deploy
inference-optimized checkpoints
AI Engine? Check ✔
AI Model? Check ✔
What about your data?
36
37
Data
Interfaces
Pull in documents (PDF), web
results, and agents together.
Tools: AnythingLLM,
OpenWebUI, LM Studio
Prompting &
Building Apps
Code Assistance
Tuning models with private
data for enterprise use cases
is too complex for non-data
scientists.
Enterprise AI use cases span
data center, cloud & edge and
can’t be constrained to a
single public cloud service.
AI + Your Data
How can you integrate AI with your unique data?
Fortunately, many tools exist for this too!
Ask a question to a PDF &
receive citations!
38
Code
Assistance
Use a model as a pair
programmer, to generate and
explain your codebase.
Tools: Continue, Cody,
Cursor, Windsurf
Prompting &
Building Apps
Code Assistance
Tuning models with private
data for enterprise use cases
is too complex for non-data
scientists.
Enterprise AI use cases span
data center, cloud & edge and
can’t be constrained to a
single public cloud service.
AI + Your Data
How can you integrate AI with your unique data?
Fortunately, many tools exist for this too!
No more copy/pasting, it’s
part of the IDE!
39
Prompting &
Building Apps
Experiment with data, build
Proof of Concepts, and
integrate AI into apps.
Tools: Podman AI Lab,
Docker Gen AI Stack
Prompting &
Building Apps
Code Assistance
Tuning models with private
data for enterprise use cases
is too complex for non-data
scientists.
Enterprise AI use cases span
data center, cloud & edge and
can’t be constrained to a
single public cloud service.
AI + Your Data
How can you integrate AI with your unique data?
Fortunately, many tools exist for this too!
Starting
points for
common AI
apps
40
Data
Interfaces
Pull in documents (PDF), web
results, and agents together.
Tools: AnythingLLM,
OpenWebUI, LM Studio
Prompting &
Building Apps
Experiment with data, build
Proof of Concepts, and
integrate AI into apps.
Tools: Podman AI Lab, Docker
Gen AI Stack
AI + Your Data
How can you integrate AI with your unique data?
Fortunately, many tools exist for this too!
Code Assistance
Use a model as a pair
programmer, to generate and
explain your codebase.
Demo Time!
Adding AI features to
apps (aka Agentic AI)
43
Code Repository URL
red.ht/local-llm-demo
Demo Repository:
https://github.com/rh-aiservices-bu/deploy-local-llms-talk-demo
44
Data
Interfaces
Pull in documents (PDF), web
results, and agents together.
Tools: AnythingLLM,
OpenWebUI, LM Studio
Prompting &
Building Apps
Experiment with data, build
Proof of Concepts, and
integrate AI into apps.
Tools: Podman AI Lab, Docker
Gen AI Stack
AI + Your Data
How can you integrate AI with your unique data?
Fortunately, many tools exist for this too!
Code Assistance
Use a model as a pair
programmer, to generate and
explain your codebase.
Tools: Continue, Cody, Cursor,
Windsurf
45
Thank you! You’re awesome!
Session Slides
red.ht/local-llm
▸Running your own
AI & LLMs
▸How to choose the
right model?
▸Integrating your
data & codebase!
Feel free to connect
on LinkedIn!
CONFIDENTIAL designator
linkedin.com/company/red-hat
youtube.com/user/RedHatVideo
s
facebook.com/redhatinc
twitter.com/RedHat
Join the DevNation
Red Hat Developer serves the builders. The problem solvers who
create careers with code. Let’s keep in touch!
●Join Red Hat Developer at developers.redhat.com/register
●Follow us on any of our social channels
●Visit dn.dev/upcoming for a schedule of our upcoming events
Red Hat Developer
Build here. Go anywhere.
Thank you
46