Running Google's Gemma LLMs in the browser with MediaPipe Web
Google's Gemma 3 and Gemma 3n large language model (LLM) families offer some of the most powerful LLMs yet for their respective sizes. Moreover, they are multilingual and multimodal by design—Gemma 3 models can handle up to 140 languages, while Gemma 3n supports arbitrary combinations of text, audio, and image inputs by default. In this talk, Tyler Mullen, Staff Software Engineer on the MediaPipe team at Google, discusses how his team leveraged the same technology powering Chrome Built-in AI to bring these model families to the web. He walks through some of the challenges they overcame in order to achieve top speeds while running fully on the user's device. Then, he covers their straightforward API for running these models with only a few lines of code, allowing anyone to quickly and easily build powerful WebAI applications. He concludes by showcasing several practical (and fun) examples of this tech in action.
- Published
- Published Nov 26, 2025
- Uploaded
- Uploaded Jun 13, 2026
- File type
- YouTube
- Queried
- 00
- Source
- youtube.com
Full transcript
Showing the full transcript for this video.
AI-generated transcript with timestamped sections.
[00:05] - Hi, I'm Tyler Mullen, the creator of MediaPipe Web and the web tech lead for Google AI Edge. [00:12] And today I'm gonna be talking about how we're able to run Gemma [00:15] Google's open large language model family entirely in the browser. [00:20] So this year, the Google AI Edge team has been so busy that we had to split our work into two talks. [00:26] I'll be focusing exclusively on the progress we made running large language models, or LLMs for short, [00:32] But if you're interested in running other models, then be sure to check out my colleague's presentation from [00:36] just a bit ago on the brand new LiDAR TJS. [00:40] So for those of you who haven't heard of MediaPipeWeb, [00:42] We're probably most well known for powering Google Meet video effects in the browser, [00:47] and providing the real-time segmentation [00:49] used in Bilibili's bullet screen comments on the web. [00:52] We use technologies like WebGL, [00:54] WebGPU and WebAssembly [00:56] to bring C++ research and machine learning to the browser [00:59] in a cross-platform and scalable way. [01:02] So that's just the tip of the iceberg. So if you're interested in learning more about any of that, please check out my presentation from last year's WebAI Summit. [01:09] where I dive into all those details. [01:11] And you can find that at yt.be/mediapipewebaisummit.com. [01:15] 2024. [01:19] To quote the Google DeepMind website, "Gemma is a collection of lightweight, state-of-the-art, open models, [01:25] built from the same technology that powers our Gemini models. [01:29] And 2025 was a big year for Gemma. We saw the launch of two new model families, Gemma 3 and Gemma 3N,
[01:36] So Gemma 3 provides wide language support in developer-friendly sizes. [01:41] making it ideal for really pushing our text processing capabilities to handle a wide variety of use cases. [01:48] Gemma 3N is a mobile-first architecture optimized for low-latency audio and visual understanding. [01:54] So even though Gemma 3 has vision capabilities too, Gemma 3N seems like a great choice for our first forays into the handling of mixed or personalized. [02:03] multimodal input types. [02:09] So let's look at Gemma 3 in a little more detail. [02:13] As of today, the smallest and largest variants differ in parameter count by a factor of about 100. [02:19] So the smallest, Gemma3 270m, so named because it has 270 million parameters, [02:25] is very fast and easy to fine tune for custom task specific use cases. [02:30] as can be seen in the text to emoji demo [02:32] at the bottom left. [02:34] The largest, Gemma327b, having 27 billion parameters, provides enhanced understanding for sophisticated applications. [02:42] So this was used as the base for training the text only MedGemma 27b model shown at the bottom right. [02:49] currently listing some possible ways to help differentiate between bacterial and viral pneumonia. [02:55] GEMMA3 is also a multilingual model family. [02:58] and is trained on material from over 140 different languages in order to reach a global audience. [03:05] and [03:06] Gemma 3N actually shares architecture with Gemini Nano.
[03:10] So for the very first time, even though I'm usually focused more on web, my small team was able to simultaneously build a Web Gemma 3N runner [03:18] and the newest Gemini Nano GPU implementation [03:21] for Chrome's built-in AI. [03:23] This concurrent development was also made possible by MediaPipe's cross-platform approach. [03:28] Because that way, we could write all our deep engine stuff once in C++. [03:32] and then have the different pieces run wherever we want on all our target platforms. [03:36] If you're interested in learning more about built-in AI on Chrome, be sure to check out Kenji's talk. [03:43] Now, in order to run a new LLM architecture, [03:46] Sometimes we need to add GPU pipelines in and around the low-level transformer stack itself [03:52] And other times, we need to augment our higher level auxiliary system on top of the transformer stack. [03:58] For GEMMA3, [03:59] We had just one transformer architecture feature to add, which improves the model's performance [04:04] with larger working memory or context sizes. [04:07] However, we had a lot of system work to do in order to support this wide variety of models in the family. [04:14] so much, in fact, that we didn't fully finish yet. [04:16] In particular, our Web Inference API currently only supports some of the available QAT models. [04:22] which are models specifically trained to be robust to extra compression [04:26] for smaller file sizes. [04:29] The implementation challenges for running Gemma 3N, on the other hand, were somewhat reversed. So we only needed to support two model variants, the larger called E4B and the smaller called E2B,
[04:40] So we didn't end up having to do much auxiliary system work [04:43] Aside from the multimodality, of course, but that's kind of a separate issue. [04:49] So... [04:50] The low-level architecture itself, though, featured actually a ton of new cutting-edge research, most of them having this common theme of mobile-first efficiency. So that means they're designed to increase speed and reduce compute resource usage, especially [05:08] especially for mobile devices. For example, GPU memory is often constrained there. [05:16] So a good example of this would be the per-layer embeddings caching, which allows for some weights to be efficiently computed and kept on the CPU, [05:24] freeing space for the rest of the model running on the GPU. [05:29] OK, so how do we do? Let's take a look at the results. Now, there's a lot of data packed into just this one table. But the key takeaway here really is just that it's low overhead and fast. [05:40] On a 2024 MacBook Pro, we see that even the largest models can run in the browser and generate content at approaching human reading speeds and using remarkably little CPU memory. [05:51] Generation speed is this highlighted decode column. [05:54] Smaller models are, of course, much faster. [05:57] And input processing speeds, denoted by pre-fill here, are across the board significantly faster than output generation. [06:06] So all of this was made possible by our streaming loading system, which we launched last year
[06:12] to enable us to load models piece by piece. [06:15] So this was necessary for us to load large models in browsers where WebAssembly memory was limited, to only 2 gigabytes or only 4 gigabytes. [06:23] But this year, streaming loading proved invaluable for both Gemma 3 and Gemma 3N. [06:28] For GEMA3, as seen on the previous slide, [06:31] It allowed us to keep our overall CPU footprint [06:34] really tiny, even with these 27 gigabyte models. [06:37] And for Gemma 3n, it enabled us to combine all of these model components into a single file [06:42] and then pick and choose the parts to load on demand. [06:45] So that made it easy for us to offer vision and audio as optional and not required modalities. [06:52] But not every challenge we faced was as easily overcome. [06:56] For example, [06:57] Properly handling 16-bit floats for Gemma 3 models required a brand new system. [07:02] Why do float sizes matter? [07:04] Using smaller float sizes can, of course, reduce GPU memory usage. [07:08] But it also has a huge impact on the model's reading [07:10] or pre-fill speed. [07:12] That's because generating content is often limited to one token at a time, but reading can usually be performed on a lot of tokens simultaneously. So this sort of optimization has a much bigger impact there. [07:23] as can be seen in this table. [07:27] The difficulty comes from a formatting mismatch. [07:30] Gemma 3 models were trained using one format called bfloat16. [07:35] which is a popular choice for machine learning applications and runs natively on many ML devices like most TPUs. [07:40] It trades some precision bytes for exponent bytes,
[07:44] So it has a huge maximum value, which is larger than a 3 with 38 zeros after it. [07:49] However, [07:50] Our GPU inference backends use just regular float 16. And that has a maximum value of only 65,504. [07:58] So models trained using BFloat16 [08:00] especially large ones, [08:02] often produce tons of values internally when they're run, which simply just don't fit into float 16s. [08:09] To try to get the best of both worlds in a relatively simple fashion, we run part of our model in float32, [08:15] and part and float 16 using a special transition system. [08:18] In each layer of our LLM, [08:20] we often have lots of smaller operations in a group, as depicted to the right. [08:25] These are small enough that they don't overflow by themselves, and are also the primary source of speedup from using smaller floats, [08:31] So we just use float 16 for everything here. [08:34] The rest of the layer, shown in the middle, has operations which accumulate, normalize, or combine previous results. And these values alone can already get really big. [08:43] So we run these and the rest of our LLM [08:46] using float32 as kind of our default for everything, just to make sure that we don't overflow. [08:52] OK, enough theory. Let's see the API in some source code. [08:55] And for reference, more complete documentation is available at goo.gull/mediapipellminferenceweb. [09:03] Here's a small JavaScript demo. [09:05] which runs an LLM using MediaPipeWeb. [09:08] It has three URLs. [09:10] One for the MediaPipe JavaScript library, [09:12] One for the MediaPipe WebAssembly files,
[09:14] and one for your particular web LLM model file. [09:18] But aside from the URLs, the logic here is just to load an LLM, [09:21] and then generate a response, logging the result to the developer console. [09:25] Thank you. [09:25] Once an LLM has been loaded, it can be used to generate multiple responses. [09:30] For our web conversions of GEMA3 models, there's one more important detail. [09:34] We use what are called instruction-tuned versions of the models. [09:38] So they expect a certain very specific template to be followed exactly for any queries we give them. [09:42] It's very important we follow this expected pattern [09:45] or else things can break, sometimes in really weird ways. [09:48] For our Gemma 3 models, the extra characters required can be seen in the highlighted lines of code at the bottom. [09:54] So basically, I just add this exact prefix and this exact postfix around my query, and then I'm good. [10:01] And here's an example of the sort of output you can get in this fashion. [10:04] So this was actually taken from one of our new demos, which I'll talk about more shortly. [10:08] It's running a Gemma 3 4B on my MacBook Pro 2023. [10:12] and doing a pretty good job at figuring out this tricky math problem. [10:16] For a list of more available models, you can also browse our web models collection from the LiDAR-T [10:21] community page and hugging face. [10:26] For multimodality with Gemma 3N, [10:28] we need to make two more changes. [10:30] First, we need to enable multimodality in the inference options. [10:34] setting maxNumImages to anything larger than zero [10:37] enables vision. [10:38] and setting support audio to true enables audio. [10:41] Thank you. [10:41] So Gemma 3N can be run with both audio and vision, or just one of these modalities enabled, or can actually even be run without either for text-only processing.
[10:51] Afterwards, we're able to insert vision and audio into our queries. [10:55] For full flexibility, our API just takes an ordered list of these prompt pieces. So you can choose exactly how to interleave different audio, text, and image parts. [11:05] For vision, we support image URLs and most common image, video, or canvas objects. But for audio, only single channel audio buffer and audio file URLs are currently supported. [11:16] And here's the result. [11:17] So on the left, I'm actually passing in my webcams video element directly into Gemma 3N, and then it describes me. [11:24] my backpack, and my background. [11:26] Uh, it-- [11:27] Also additionally surmises that I might be involved in tech, possibly in AI or web-related applications. So right on all counts. [11:37] And on the right, Gemma 3N first transcribes my audio file, [11:41] which says Gemma 3N can handle image and audio inputs. [11:44] and then proceeds to expand on the significance of that a little bit more. [11:54] So now that we've gone through a few small examples, let's see what a full application might look like. [12:00] We're open sourcing two demos, and I'm also hosting them both publicly as hugging face spaces, so you can try them out for yourself today. [12:06] They require a device and browser with sufficient resources and web GPU support. And they also require you to have a Hugging Face account [12:14] and accept the Gemma Terms of Use license on that account. [12:17] So with that out of the way, let's jump right into a demonstration of our Gemma 3 chat suite.
[12:23] So we can see that a few models have been cached locally already, so I can immediately use those even though I'm currently signed up. [12:29] I change the options as desired. [12:31] and then start chatting with the model. [12:33] My first chat with new settings causes the model to load or reload. [12:37] You can see it's pretty fast when it's loading locally from cache, [12:40] But some of these models are very large, again, up to 27 gigabytes. So if the file needs to be downloaded remotely, especially over a slower internet connection, that can take a long time. [12:50] To download a new model, I'll sign in with my Hugging Face account, [12:54] and it's not necessary to give it access to any of my organizations, [12:59] So I don't. But after clicking Authorize, [13:04] I can know. [13:05] and try out any of the other models. [13:07] They will automatically download from Hugging Face [13:11] This is on Chrome. [13:14] And since I've already accepted the Gemma license on my account, [13:17] everything will work perfectly, and then it'll cache itself locally when done. I should also mention that the MedGemma 27B model in particular requires accepting a different license in order to download it. [13:29] And this is our Gemma 3N multimodal demo. [13:32] where you can use text or your microphone to ask questions about what's in your webcam. [13:37] So here, for a good test, I try to keep my webcam feed [13:41] relatively stable, and I first type in a question, asking it what color my shirt is. [13:49] After that, I used my microphone to give it a tougher one, asking it to transcribe all the text and then translate it to English.
[14:01] And it gets that right, too. [14:02] Thank you. [14:04] So this demo is a really good one to check out the source code for, because it's around 400 lines of human-readable JavaScript code. And that includes everything. [14:12] the model caching, the authentication, the webcam and microphone usage, all the UI elements, and the LLM running. [14:21] Since Gemma 3N is a mobile-first architecture, it stands to reason that this demo might even be portable to some mobile devices. [14:28] And indeed, it can run on some mobile browsers, like Chrome on the Pixel 9. [14:32] at least with vision and text. [14:35] Audio is a little trickier. [14:37] To help with running on more resource-constrained devices, we offer a E2B version of the demo, as opposed to the original, which uses the E4B variant of Gemma 3. But otherwise, the two demos are completely identical. [14:51] And that concludes my presentation. Thank you for joining me.
Want to learn more?