Demo: Optimizing Gemma inference on NVIDIA GPUs with TensorRT-LLM

108 Views

Thanks! Share it with your friends!

You disliked this video. Thanks for the feedback!

Published Apr 2, 2024

Even the smallest of Large Language Models are compute intensive significantly affecting the cost of your Generative AI application. Your ability to increase the throughput and reduce latency can make or break many business cases. NVIDIA TensorRT-LLM is an open-source tool that allows you to considerably speed up execution of your models and in this talk we will demonstrate its application to Gemma.

Subscribe to Google for Developers → https://goo.gle/developers

#Gemma #GemmaDeveloperDay

Category: Project
Tags: Google, developers, pr_pr: Core DevRel DEI;

Be the first to comment

Sign in

Create your account

Add Video

Demo: Optimizing Gemma inference on NVIDIA GPUs with TensorRT-LLM

Up Next