Effortless Scalability: Orchestrating Large Language Model Inference with Kubernetes

Workshop

Advanced

11:30 - 12:15 PM ET ; Novella Room

In the dynamic landscape of AI/ML, deploying and orchestrating large open-source inference models on Kubernetes has become paramount. This talk delves into the intricacies of automating the deployment of heavyweight models like Falcon and Llama 2, leveraging Kubernetes Custom Resource Definitions (CRDs) to manage large model files seamlessly through container images. The deployment is streamlined with an HTTP server facilitating inference calls using the model library.

This session will explore eliminating manual tuning of deployment parameters to fit GPU hardware by providing preset configurations. Learn how to auto-provision GPU nodes based on specific model requirements, ensuring optimal utilization of resources. We'll discuss empowering users to deploy their containerized models effortlessly by allowing them to provide a pod template in the workspace custom resource inference field. The controller dynamically, in turn, creates deployment workloads utilizing all GPU nodes.

Joinal Ahmed

Joinal is an experienced Data Science professional with a interest on building solutions with quick prototypes, community engagements and influencing technology adoption. Brings strong technical understanding, experienced in managing cross functional teams of data scientists, data engineers, backend developers and SRE delivering end-to-end ML projects, recruiting and mentoring engineers across levels, streamlining ML & Data workflows for high performing teams, setting up best practices, and developing highly performant and reliable MLOps & Data platforms supporting end to end ml projects and analytics usecases.

Visit us @ ca.kubehuddle.com

Grab your ticket @ Ticket Tailor