Pratik Mishra delivered insights on architecting scalable, deployable, and resilient AI infrastructure at scale. His discussion on fault tolerance, checkpoint optimization, and the democratization of AI compute through AMD's open ecosystem resonated strongly with the challenges teams face in production ML deployments.
Size: 2.22 MB
Language: en
Added: Oct 02, 2025
Slides: 26 pages
Slide Content
AI at scale: Architecting Scalable, Deployable and Resilient
Infrastructure
Pratik Mishra
AMD
September20, 2025
AlluxioAI/ML Meet-up
San Francisco, CA
2|Pratik Mishra | AMD | September2025| AlluxioAI/ML Infra Meetup | San Francisco, CA
Agenda
Disclaimer:Please refer to the Copyrights and Disclaimer in the presentation. We have tried to cite most relevant sources. We (the authors and associated organization) owe no responsibility
towards the content’s accuracy or claims, and they should be viewed as personal viewpoints/opinions to cater open discussions.
•AI Deployments and Challenges
•Infrastructure, Reliability, and Foundation Model Training
•Conclusion
SDC’24, FMS’25, SDC’25
3|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
AI Infrastructure: Deployments and Challenges
4|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
How to train a dragon model?
Data & 5Vs Data Storage
Data Ingestion Data Preparation
Training
The magic
ETL to training accessible
formats, annotation,
indexing, etc.
Stream bulk “objects”
to clouds/data-
centers
Foundation Model
Deployment
Deploy FM for
down-ward tasks
Model set-up: training strategies
Execution: Run training
Persistence: Save/load checkpoints
Validation & Monitoring
Tasks Users
GPUs GPUs
Downstream tasks
Fine-tuning,
post-training,
inference endpoints
Prompts,
agent interactions,
etc.
UXI
But think deeper: maximize GPU utilization, minimize stalls, optimize throughput and
reduce latency to drive “real” ROI
AI Developer Priorities: Focus on fast model convergence, efficient algorithm design, rapid
deployment to accelerate time-to-market
5|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
AI Tech Stack : 100K birds-eye view
AI Developers and Applications
Pre-trainingFine-tuningInferencePost-trainingAgents
Data Storage and Management
Ingestion ArchiveProcessing Data LakeVectorDBLabeling
FileBlock Object
Compute Infrastructure (GPU, Networks, Memory, Local Storage)
GPU NIC/DPUCPU Frontend + Backend
CSPs &/ On-prem Infrastructure
Training & Inference Frameworks (PyTorch, TensorFlow, vLLM, SGLang)
Distributed AI Compute Managers (Ray, Spark, etc.)
Model Deployment (k8s, slurm) & Container Orchestrators
Multi-Modal Data
What they need to care about?
•The highly simplified AI Tech Stack
•Access to tools, infrastructure, deployments
•Most importantly, access to SOTA GPUs
On top of all that, ecosystems with closed
stacks – limits innovation, flexibility and raises
the barriers for entry.
6|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
7|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Sovereign AI Case-study:
Motif Technologies Multi-modal Training with AMD ecosystem
8|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Motif Technologies Training Infrastructure powered by AMD
Motif Technologies (South Korea) runs multi-modal AI workloads on AMD Instinct MI250
GPUs using AMD-optimized Docker containers with SkyPilot orchestration.
9|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Motif Technologies: AMD Developer Cloud with MI300X
Disclaimer:The performance metrics and results presented are based on partner-provided data and have not been independently verified by AMD. These figures are shared as-is and may vary depending on system configuration, workload characteristics, and optimization levels. AMD makes no
representations or warranties regarding the accuracy or completeness of third-party performance claims.
AI for ALL:
A democratized platform with an open and optimized AI ecosystem and access to SOTA AMD
GPUs – fosters innovations especially for startups, researchers, and emerging markets.
Motif 2.6B on 1xMI250 vs MI300x
5X throughput gains
on 1xMI300x, bigger
batches, etc.
Motif Kernel: https://huggingface.co/Motif-Technologies/activation
10|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Call to Action!
Built by developers for developers.
•AMD is building for you, come build on us.
•Commitment to open AI ecosystem
•Full AI Lifecycle
•Industry leading GPU technology
11|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
AI Infrastructure: Reliability and Scalability
12|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
AI Training Infra Reliability 101: Metrics
•Training Goodput = Actual progress made / total time
•Effective Training Time Ratio (ETTR) = actual training time / total time
•Model FLOPs Utilization (MFU) = FLOPs a model utilizes/ peak HW FLOPs available
•Mean Time Between Failures (MTBF) = total time / # of failures
Achieving high training goodput and maximizing model FLOPs utilization to improve the
Effective Training Time Ratio remains a significant and ongoing challenge.
Failures and Training Efficiency?
13|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Reliability and Training Efficiency @scale
With growing scale of AI deployments, the MTBF decreases significantly.
Therefore, resiliency is the core for achieving Training efficiency and increasing Training Goodput
and ETTR.
# of accelerators
Mean Time Between Failure
(MTBF)
log
-
scale (normalized mins)
node rack-scale cluster-scale data-center scale
(<24 hrs) (<30 mins) (<5 mins)
Projections of AI training systems@scale failures not specific to any accelerator.
Across millions and billions of components across the SW & HW stacks in the data-center hierarchy.
(<1-3mon)(~3-6mon)(yrs)
???????????????????????? ∝??????/(��.�� ??????????????????�??????��??????����)
14|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Fault Tolerance, Training Efficiency and Checkpointing
•Fault-tolerance, resiliency, and recovery are of utmost importance for
Training Efficiency metrics (discussed earlier)
•Critical fault-tolerance mechanism for periodically persisting training
snapshots to enable recovery via rollbacks in the event of failure
•Also: Hardware refresh, Resource re-balancing, post-training, concurrent evaluation, increase accuracy, etc.
With scale and every-lowering MTBFs, the checkpointing frequency, size, and complexity increases
significantly; imposing heavy data-center tax (GPU underutilization).
•Storage community’s poster AI use-case: Checkpointing
15|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Fault Tolerance Tax: Checkpointing
Tprogress_1 Tprogress_nTchkpt_save_1 Tchkpt_save_n Titr_lost Trecovery Tchkpt_load
FToverhead =Tchkpt_save + Titr_lost+ Trecovery+ Tchkpt_load
ETTR = (1-FToverhead)
• Achieving optimal ETTR @ data-center scale is “real” challenge
•Without optimization, systems may spend more time managing failures than actual training
•Trade-off: Excessive checkpoints increases data-center tax & infrequent increases risks (cost)
•Data-center tax: compute, network, storage
Therefore, to achieve optimal ETTR (+goodput) it is essential for reliability mechanisms to strike the
balance of performance, scalability, and cost-effectiveness.
16|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Checkpoint (save) = Serialization + Persistence
SDC’24, SDC’25, FMS’25
Asynchronous chkpt: Main Training thread is alleviated from IO persistence
•Overlaps IO with computation
•Reduces peak-pressure on network and storage by “buffering”
•Still not truly asynchronous (IO verbs!)
Existing implementations need further optimizations to reduce @scale overheads.
Reliable and Unified memory + storage tiering is essential – masking I/O and communication overheads with
computation.
Example: Local NVMe → PFS → Object (or) combinations
Synchronous chkpt: Main Training thread waits till checkpoint is persisted
•Short, periodic, bursty writes.
•Over-subscribes front-end NICs and storage infrastructure
•Leads to GPU stalling to resume training.
17|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Checkpoint (loads) = Loading + Deserialization
•Loading checkpoints is mission critical
•All GPU simultaneously load their states to resume training
•Massive IO amplification compared to save(s)
•Deserialization overheads are massive
•Concurrent loading can de-stabilize entire infrastructure
•Also, downstream tasks – post-training, inference, etc.
•Optimizations
•GPU-GPU network-aware checkpoint loading.
•Metadata optimizations (unpickling), and file-formats
•Predictive storage tiering.
Efficient fault-tolerant checkpointing loading at scale requires GPU–storage path optimizations
and topology-aware strategies to sustain robust infrastructure and high MFU.
18|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Data Movement: The necessary Evil!
The goal is to maximize GPU utilization while ensuring reducing the impact of data-entropy.
Large amounts of data must move across inter/intra nodes, servers,
racks, and even data-centers, in all directions (E-W, N-S).
19|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Fault-Tolerance and Reliability in Cloud AI
Collaborator: Tian Xia (PhD Student), Zhifei Li (Visiting Research Student)
with Dr. Ion Stoica
Sky Computing Labs, UC Berkeley
20|
AI Training in the Cloud
Training Interruptions are common (discussed earlier)
•VM failures due to HW(or) SW failures in allocated
servers
•VM preemptions/re-allocation to different locations
(servers, regions, etc.)
How to recover efficiently to retain the cost-savings while striking the balance
between performance and scalability across cloud networks?
Emerging Use-case: Spot Instances
•Significant cost-effectiveness across regions and clouds
•Useful particularly for offline training jobs
•However, preemptions can happen any moment
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Tian Xia, UC-Berkeley
Tian Xia, UC-Berkeley
21|
Spot-Training Resumption: Checkpoint Migration
Checkpoint migration enables spot-instance recovery by overlapping instance startup with checkpoint
transfer and loading across regions or geographic boundaries.
Lots of dynamically moving parts: Which location, data egress cost, move & load checkpoints.
How to achieve high training throughput and ETTR while being on tight-cost and time-
budget?
Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Tian Xia, UC-Berkeley
22|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Need for a Unified Global Storage System
A unified geo-distributed storage system can reduce north-south data entropy tax while
optimizing compute, network, and storage utilization—balancing infrastructure
constraints for GPU-accelerated AI workloads
23|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Conclusion
Unlocking the full potential of GPU-accelerated AI requires overcoming key barriers.
The community must unite to innovate and strike a balance between performance, scalability, and cost
with an open AI ecosystem—building an inclusive AI future for all.
24|Pratik Mishra | AMD | September 2025 | Alluxio AI/ML Infra Meetup | San Francisco, CA
Thank-you!
Pratik Mishra | AMD