Workshop

Ad​vanced

Effortless Scalability: Orchestrating Large Language Model ​Inference with Kubernetes

11:30 - 12:15 PM ET ; Novella Room

In the dynamic landscape of AI/ML, deploying and orchestrating large open-source inference models on ​Kubernetes has become paramount. This talk delves into the intricacies of automating the deployment of ​heavyweight models like Falcon and Llama 2, leveraging Kubernetes Custom Resource Definitions (CRDs) to ​manage large model files seamlessly through container images. The deployment is streamlined with an HTTP ​server facilitating inference calls using the model library.


This session will explore eliminating manual tuning of deployment parameters to fit GPU hardware by ​providing preset configurations. Learn how to auto-provision GPU nodes based on specific model ​requirements, ensuring optimal utilization of resources. We'll discuss empowering users to deploy their ​containerized models effortlessly by allowing them to provide a pod template in the workspace custom ​resource inference field. The controller dynamically, in turn, creates deployment workloads utilizing all GPU ​nodes.

Joinal Ahmed

Joinal is an experienced Data Science professional with a interest on building solutions with quick prototypes, community ​engagements and influencing technology adoption. Brings strong technical understanding, experienced in managing cross ​functional teams of data scientists, data engineers, backend developers and SRE delivering end-to-end ML projects, recruiting and ​mentoring engineers across levels, streamlining ML & Data workflows for high performing teams, setting up best practices, and ​developing highly performant and reliable MLOps & Data platforms supporting end to end ml projects and analytics usecases.

Grab your ticket @ Ticket Tailor