Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Application Development by Ettikan Karuppiah, NVIDIA

APIdays_official 369 views 20 slides May 02, 2024
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

Scalable LLM APIs for AI and Generative AI Application Development
Ettikan Karuppiah, Director/Technologist - NVIDIA

Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024)

------

Check out our conferences at https://www.apidays.global/

Do you want to spon...


Slide Content

Scalable LLM APIs for AI and Generative
AI Application Development
Ettikan Kandasamy Karuppiah (Ph.D)
Director/Technologist, NVIDIA ROAP Region

TEXT SOUND
TEXT
TEXT
IMAGE
VIDEO
SPEECH
MULTI-MODAL
AMINO ACID
BRAINWAVES SPEECH
IMAGE
VIDEO
IMAGE
3D
ANIMATION
MANIPULATION
PROTEIN
LEARN AND UNDERSTAND
EVERYTHING
TEXT
TEXT
TEXT
IMAGE
VIDEO
SPEECH
MULTI-MODAL
AMINO ACID
BRAINWAVES
SOUND
SPEECH
IMAGE
VIDEO
IMAGE
3D
ANIMATION
MANIPULATION
PROTEIN
“An adorable cat in 3D
confidently riding a
flying, rocket-powered
bike, adorned with a
sleek black leather
jacket.”
”A close shot of a cat in a futuristic
space suit confidently operating
controls in the cockpit of a sci-fi
spaceship. The cockpit has lots of
lights and holographic screens with
data in cool colors. The spaceship is
traveling through a warm, colorful
nebula. Shot on 35mm, vivid colors.”
Generative AI Can Learn and Understand Everything

Generative AI Adoption Across Industries
Fraud Detection
Personalized Banking
Investment Insights
Molecule Simulation
Drug Discovery
Clinical Trial Data Analysis
Personalized Shopping
Automated Catalog Descriptions
Automatic Price Optimization
AI Virtual Assistants
Performance Tuning
Remote Support Capabilities
Character Development
Video Editing & Image Creation
Style Augmentation
Factory Simulation
Product Design
Predictive Maintenance
Document Summarization
Audit Compliance
AI Virtual Assistants
Knowledge Base Q&A
Predictive Maintenance
Customer Service
Finance Healthcare Retail Telecommunications
Media & Entertainment Manufacturing Federal Energy
Fraud Detection
Personalized Banking
Investment Insights
Molecule Simulation
Drug Discovery
Clinical Trial Data Analysis
Personalized Shopping
Automated Catalog Descriptions
Automatic Price Optimization
AI Virtual Assistants
Network Performance Tuning
Remote Support Capabilities
Character Development
Video Editing & Image Creation
Style Augmentation
Factory Simulation
Product Design
Predictive Maintenance
Document Summarization
Audit Compliance
AI Virtual Assistants
Knowledge Base Q&A
Predictive Maintenance
Customer Service

Enterprise are on the Generative AI Journey
Production

Organizations have set aside budget and
are ramping up efforts to build
accelerated infrastructure to support
generative AI in production.
Experimentation
Enterprise application developers kick off
POCs for generative AI applications with
API services and open models including
Llama 2, Mistral, NVIDIA, and others.
Explosion
ChatGPT gets announced late in 2022,
gaining over 100 million users in just two
months. Users of all levels can experience
AI and feel the benefits firsthand.
2022 2023 2024

Enterprises Face Challenges Experimenting with Generative AI
Organizations must choose between ease of use and control
Data and prompts are
shared externally
Managed Generative AI Services Open-Source Deployment
Ongoing maintenance
and updates
Tuning required for
different infrastructure
Infrastructure limited to
managed environment
Limited control for overall
generative AI strategy
Run anywhere across
data center and cloud
Custom code for APIs
and fine-tuned models
Easy to use APIs for
development
Fast path to getting
started with AI
Enterprise Controlled
Environment
Enterprise Controlled
Environment
Securely manage data in
self hosted environment
API

Performance Optimized
Experience and Run Enterprise Generative AI Models Anywhere
Use NVIDIA API catalog to get access to NVIDIA NIM
Experience Models Prototype with APIs Deploy with NIMs
Enterprise-Support
Security Data Privacy
Drug Discovery Visual Content
Text Summarization Speech Generation
NVIDIA NIM

NVIDIA CUDA
Cloud Native Stack
GPU Operator, Network
Operator
Triton Inference Server
cuDF, CV-CUDA, DALI, NCCL,
Post Processing Decoder
Enterprise Management
Health Check, Identity, Metrics,
Monitoring, Secrets Management
Kubernetes
Industry Standard APIs
Text, Speech, Image,
Video, 3D, Biology
Customization Cache
P-Tuning, LORA, Model Weights
Optimized Model
Single GPU, Multi-GPU, Multi-Node
TensorRT and TensorRT-LLM
cuBLAS , cuDNN, In-Flight Batching,
Memory Optimization, FP8 Quantization
100’s of Millions of CUDA GPUs Installed Base
NVIDIA NIM

NVIDIA NIM Optimized Inference Microservices
Accelerated runtime for generative AI
Simplified development of AI application
that can run in enterprise environments
Day 0 support for all generative AI models
providing choice across the ecosystem
Best accuracy for enterprise by enabling tuning
with proprietary data sources
Improved TCO with best latency and throughput
running on accelerated infrastructure
Enterprise software with feature branches,
validation and support
Deploy anywhere and maintain controlof
generative AI applications and data
Optimizedinference engines
NVIDIA NIM
Domain specific code
Support for custom models
Industry standard APIs
DGX &
DGX Cloud
Prebuilt container and Helm chart

NVIDIA NIM is the Fastest Path to AI Inference
Reduces engineering resources required to deploy optimized, accelerated models
NVIDIA NIM Triton + TRT-LLM Opensource
5 minutes ~1 week
Industry standard protocol
OpenAI for LLMs, Google Translate Speech
User creates a shim layer (reducing performance)
or modify Triton to generate custom endpoints
Pre-built TRT-LLM engines for NV and community models
User converts checkpoint to TRTLLM format and creates
and runs sweeps through different parameters to find the
optimal config
Pre-built with TRT-LLM to handle
pre/post processing (tokenization)
User manually sets up + configures
Automated User manually sets up +configures
Supported – P-tuning and LORA, more planned User needs to create custom logic
Pre-validated with QA testing No pre-validation
NVIDIA AI Enterprise - Security and
CVE scanning/patching and tech support
No enterprise support
Deployment Time
API Standardization
Pre-Built Engine
Triton Ensemble/
BLS Backend
Triton Deployment
Customization
Container Validation
Support
Llama 2 Nemotron

Inference Microservices for Generative AI
NVIDIA NIM is the fastest way to deploy AI models on accelerated infrastructure across cloud, data center, and PC
MIXTRAL 8x7B VISTA-3D DIFFDOCKGEMMA 7B FUYU AI GENERATOR KOSMOS 2 AUDIO2FACE ESM FOLD MolMIMNEMO RETRIEVER 3D GENERATOR
NVIDIA API Catalog

Inference Microservices for Generative AI
NVIDIA NIM is the fastest way to deploy AI models on accelerated infrastructure across cloud, data center, and PC
MIXTRAL 8x7B VISTA-3D DIFFDOCKGEMMA 7B FUYU AI GENERATOR KOSMOS 2 AUDIO2FACE ESM FOLD MolMIMNEMO RETRIEVER 3D GENERATOR
NVIDIA API Catalog

LANGUAGE
NIMs
Code Llama
70B
Nemotron-3
22B Persona
Gemma
7B
Llama 2
70B
Mistral
7B
Mixtral
8x7B
VISUAL / MULTIMODAL
NIMs
Deplot Edify.
Getty
Edify.
Shutterstock
FuYu
8B, 55B
Kosmos-2
NeVA SDXL
1.0
SDXL
Turbo
DIGITAL
HUMAN NIMs
Audio2FaceRiva ASR
OPTIMIZATION /
SIMULATION NIMs
cuOpt Earth-2
APPLICATION NIMs
Llama
Guard
Retrieval
Embedding
Retrieval
Reranking
DIGITAL BIOLOGY
NIMs
DeepVariant
DiffDock ESMFold
MolMIM Vista 3D
Adept
110B
Jamba
Cohere
35B
Phi-2
NVIDIA NIM for Every Domain

NVIDIA AI Enterprise
Enterprise-Ready Generative AI with RAG and NVIDIA NIM
Ease the journey from pilot to production
Development Deployment
https://www.nvidia.com/en-us/ai-data-science/ai-workflows/generative-ai-chatbots/

Gen AI for Technician Support
Information retrieval for technical
documents
•Two volume “Aviation Maintenance
Technician Handbook —Airframe” FAA
manual (~1,200 pages in total)
NeMo LLM models
•43B model (4K token limit)
•22B model (16K token limit)
Features
•Ingest (embed) large documents into vector
database for semantic search
•Cite sources in retrieved answers
•Extract and cite images, captions*
•Guardrails (no hallucination)
* on the product roadmap

Gen AI for Document Summarization
Document Summarization
•LLMs excel at understanding and synthesizing text.
•Given a set of documents, LLMs can summarize the text.
•For example, a 183-page NIST publication like the one presented below
Some NeMo LLM models of interest
•43B model (4K token limit)
•22B model (16K token limit)
Features
•Can be finetuned on custom dataset (various ways of adaptation: p-tuning, LoRA, adapters, etc..)
•Can handle arbitrary length documents
•More features to come soon!

Video Search and Summarization Real-time Asset Tracking Customer Assistance
Content Generation Detecting Hazardous Condition Human-robot Interaction
Applying Multi-modal Foundation Models

NVIDIA Optimized Visual Foundation Models
Model Purpose
NV-DINOv2 Vision-only backbone for downstream Vision AI tasks–image classification, detection, segmentation
NV-CLIP
Image-text matching model that can align image features with text features; backbone for downstream Vision AI
tasks–image classification, detection, segmentation
Grounding-DINO Open-vocabulary object detection with text prompts as input
EfficientViT-SAM
Faster, more efficient version of SAM (Segment Any Model), which is a visual FM for segmenting any object based on
different types of visual prompts such as single coordinate or bounding box
VILA Family of visual language model for image and video understanding and Q&A
LITA Visual language model for video understanding and context with spatial and temporal localization
Foundation Pose 6-DoF object pose estimation and tracking, providing the object pose and 3D bounding box
BEVFusion
Sensor fusion model which fuses multiple input sensors–cameras, LiDAR, radar, etc. to create a bird’s eye view of the
scene with 3D bounding box representation of the objects
NeVa Multi-modal visual language model for image understanding and Q&A
LiDARSAM Segment any object based on user provided text prompts on 3D point cloud LiDAR data

Multi-Modal Foundation Backbone - NV-CLIP
Commercially Viable
Trained on ethically sourced data and compares
favorably to other non-commercial public models
Trained on Very Large Dataset
Trained on 700M image-text pairs for text and
image embeddings
Foundation Backbone for Vision AI
Used in many downstream vision tasks like zero-shot
detection/segmentation, VLMs and more
Zero-shot Accuracy (ImageNet-1K)
Model NV-CLIP OpenAI CLIP

ViT-B 70.4 68.6
ViT-H 77.4
††
78.0

- Non-commercial use only



- Trained on 700M image-text pair vs. 2B for CLIP
Available in April 2024
Text
Encoder
Image
Encoder
Colorful cat
Colorful cat
Colorful cat
T
1 T
2 T
3 … T
N
I
1T
1I
1T
2I
1T
3 … I
1T
N
I
2T
1I
2T
2I
2T
3 … I
2T
N
I
3T
1I
3T
2I
3T
3 … I
3T
N
… … … … …
I
NT
1I
NT
2I
NT
3 … I
NT
N
T
1
I
2
I
3

I
N


0
1000
2000
3000
4000
5000
6000
ViT-B ViT-H
Model Inference Performance (FPS)
H100 A100 L40A30 L4A2

Using Foundation Backbones for Downstream CV Task
Class label
Bounding box & labels
Class label/pixel (mask)
Class label, b-box, mask, text
Image
Text, image
Image
Feature
Vector
Data
Classification
Detection
Segmentation
Zero-shot
Image Retrieval
VLMs
Diffusion
Foundation
Backbones
NV-DINO /
NV-CLIP

Fine-Tune in 100 or Less Samples for Image Classification
Dataset
Feature Vector
0.2
0.3
1.4
.
.
.
1.1
Ground Truth Labels
Fine-tune with TAO
Trained Weights
Inference
Prediction
Inference on the
Foundational Model
NV-DINOv2 Foundational Model Trained on >100M Image/Text Pair
50
60
70
80
90
100
10 100 1000
NV-DINOv2
GC-ViT
Few-shot Learning on NV-DINOv2
Train in as few as 10 samples
PCB defect classification
Demo: Foundational Model Fine-tuning
Feature Vector
NV-DINOv2
- Frozen

Player holding basketball
Before Fine-tuning After Fine-tuning
Prompt

Enterprises Gen AI
Enhance the Accuracy & Reliability of Generative AI with RAG
Improveaccuracy and security
Control costs
Increase productivity
Avoid vendor lock-in
User
Rankings
NVIDIA NIM
Prompt Response
Enterprise Data
Image
Office
Docs
Text
PDF