Innovating Inference - Remote Triggering of Large Language Models on HPC Clusters Using Globus Compute

globusonline 73 views 12 slides May 31, 2024
Slide 1
Slide 1 of 12
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12

About This Presentation

Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with ...


Slide Content

Innovating Inference at Exascale
Remote triggering of Large Language Models on HPC clusters using Globus
Compute


Aditya Tanikanti
[email protected]
Computer Scientist
ALCF

Inference at ALCF
Subset of neurons from human brain tissue, reconstructed on
Aurora using the FFN convolutional network and based on electron
microscopy from Harvard, includes a 2-micron scale bar. Larger
volume analysis will continue on Aurora under an INCITE award.
Potential to screen 240 billion molecules in 10 mins on 10K Aurora
nodes for the RTCB Cancer Protein. High performance binding affinity
prediction with a Transformer-based surrogate model, A.Vasan et al.,
To appear in HICOMB 2024
Connectomics and RTCB Cancer protein(referenced below) are few of many model inference examples that run in ALCF clusters.

GPU workloads for Inference
Inference for LLM models, among others, epitomize the cutting-edge advancements in inference capabilities. Our overarching
objective is to democratize access to inference services rooted in the rich tapestry of scientific data for all users on ALCF clusters.
Source: https://www.investors.com/news/technology/ai-stocks-market-shifting-to-inferencing-from-training/

LLM Inference
●Unlocking the potential of Large Language Models (LLMs) such as ChatGPT necessitates harnessing tools akin to vLLM, a
cutting-edge library designed specifically for LLM inference and seamless deployment.
●vLLM boasts compatibility with an array of generative Transformer models found in the HuggingFace Transformers
ecosystem. Below is a curated collection of model architectures currently bolstered by vLLM, accompanied by notable
instances that leverage each architecture's capabilities.











Source: https://docs.vllm.ai/

Single User: Globus Compute for remote inference at ALCF
POLARI
S
Polaris
REMOTE
User credentials
authentication
GPU Node
Elastic based on model
requirements
Model
weights
Endpoint
Setup for running vLLM
& Ray to serve LLM
models
Client App
Users using notebooks
to interact with Globus
tools to execute vLLM.

Single User: Globus Compute for remote inference at ALCF
POLARI
S
Polaris
REMOTE
User credentials
authentication
model
weights

Single User: Globus Compute for remote inference at ALCF
POLARI
S
Polaris
REMOTE
User credentials
authentication
model
weights

Django web portal
Multi User: Globus Compute for remote inference at ALCF
Client App
Users using UI or API to
interact with Django
portal
Polaris
Job Scheduler
Endpoint
Multiple pre registered
endpoints running vLLM
+ Ray with various
models
GPU Nodes
Elastic based on model
requested
Django Portal
Provides UI and API
access to the running
vLLM endpoints

Multi User: Globus Compute for remote inference at ALCF
Django web portal
Polaris
Job Scheduler

Multi User: Globus Compute for remote inference at ALCF
Django web portal
Polaris
Job Scheduler

Summary
At present, our inference service efficiently delivers models from Polaris. However, our ongoing efforts are focused on crafting
computational endpoints tailored to facilitate model deployment from Aurora, AI Testbed and inference clusters.

Step by Step Guide to run vLLM on Polaris:
https://github.com/atanikan/vllm_service
Globus compute notebook to run vLLM remotely:
https://github.com/atanikan/vllm_service/blob/main/inference_using_globus/vLLM_Inference.ipynb
Django web portal to run inference:
https://github.com/argonne-lcf/inference-as-a-service/tree/main

This research used resources of the Argonne Leadership Computing Facility,
which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.