Triton Inference Server Triton Inference Server high-level architecture (Source: NVIDIA Developer, https://developer.nvidia.com/nvidia-triton-inference-server ) Multi framework support (backends) included TensorRT , vLLM , ONNX … CPU offloading with OpenVINO Scheduler receives the incoming Inference Request and determines which Model is the most appropriate for handling the request based on the type of input data and the specific task at hand