Efficiently Serving LLMs
Travis AddairDeepLearning.AI
In this course, you will learn how auto-regressive large language models generate text one token at a time. You will implement the foundational elements of a modern LLM inference stack in code, including KV caching, continuous batching, and model quantization, and benchmark their impacts on inference throughput and latency. You will explore the details of how LoRA adapters work and learn how batching techniques allow different LoRA adapters to be served to multiple customers simultaneously. Get hands-on with Predibase’s LoRAX framework inference server to see these optimization techniques implemented in a real-world LLM inference server. Knowing more about how LLM servers operate under the hood will greatly enhance your understanding of the options you have to increase the performance and efficiency of your LLM-powered applications.























