Offline vector search with SQLite and EmbeddingGemma

Learn from Rody Davis, Senior Developer Relations Engineer at Google, how to query and embed documents using SQLite and embeddings with EmbeddingGemma and Gemma3. Create an offline RAG system that runs in the browser offline.

Published: Published Nov 26, 2025
Uploaded: Uploaded Jun 13, 2026
File type: YouTube
Queried: 00
Source: youtube.com

Full transcript

Showing the full transcript for this video.

AI-generated transcript with timestamped sections.

0:06-1:44

[00:06] - How's it going? My name is Rohde, and I am a developer relations engineer at Google, working on the AI Workflows team. [00:14] Super excited to be here today at the WebAI Summit. Something I'm very passionate about is on-device [00:20] and local first types of applications. [00:23] And so before we get started, let's talk about vectors and databases. [00:27] Vectors, as you know, can be generated with both hosted and local models. There's a lot of trade-offs that go for each one of those. [00:35] and typically will require [00:37] specific needs of your application. [00:40] So vector stores can grow quite large and often require an API to access. [00:45] And while that's fine for certain types of applications, [00:49] It may not be ideal when you have intermittent network connectivity, for example. [00:54] One important thing about vectors, if you're not familiar with them, is you have to have the same encoder and decoder [01:01] both. [01:02] wherever you use to query and to update documents. [01:07] This is really important because you can't take advantage of a really powerful [01:11] encoder and then a very lightweight decoder, you have to use the same. [01:17] which was kind of frustrating when I was first getting into it. [01:21] And then another thing that [01:23] a server side can really take advantage of is, they can be so much faster because they have a lot of RAM, they're optimized for NVMe storage, [01:30] But I've listed a lot of pros on the server side, but why would you even want them on the client? Well, first of all, you can store the vectors just for the user. You never have to worry about running a query and getting

1:44-3:25

[01:44] some sort of dimensionality for content that's not theirs. You can also just have the advantage of it already being partitioned for that user on the client. [01:54] So with most things that require trade-offs, usually a hybrid approach is more appropriate. [02:01] And in this case, we can use the server side [02:04] to have some nice parallel compute, to be able to batch encode a bunch of vectors. And we can store them. [02:11] inside of Firebase using Firestore vector support, which was added. And of course, there's vector databases that already work [02:20] a lot better for vectors, but one of the nice things about Firestore is it gives us a nice syncing [02:26] modality that we can use on the client, that we can store everything in a bucket per user, [02:31] And really, it's meant to be a fallback when the model isn't downloaded. [02:35] I'm a huge fan of SQLite, and one of the cool things about SQLite is you can load extensions, including vector support. [02:41] So we can actually pull in those vectors from Firestore into SQLite. [02:45] and then you can query them directly on the client. [02:48] But here's where the magic really starts to happen. When you go to Update Models... [02:52] you can use that local encoder and decoder [02:56] to incrementally regenerate new documents. [02:59] This makes it really nice to pull down a massive data set, and then as the user's making changes and edits, [03:04] You get to keep that up to date without having to do that round trip and requiring internet always. [03:10] So, Embedding Gemma is a super awesome encoder and decoder that we have launched, and I really love it. It's about 308 million parameters. It's meant to be run on mobile devices, but

3:25-5:08

[03:25] Just because you can do that doesn't mean you can't use it on the server, which is really awesome, including support for things like Cloud Run, where we make it really easy to launch it [03:33] with Ollama. So you can have a nice fallback API when the model isn't downloaded yet and you just want to have this [03:41] kind of ad hoc experience. [03:43] One of the reasons I like using it is it has 768 dimensions, so it has a very significant amount [03:49] quality for the types of tasks that you can throw at it. [03:53] It's still configurable, and just the whole Gemma family is really awesome. [03:59] But I know a lot of people today have talked about Gemma 3N, and you can totally use that with this, but for this talk it's just going to be on the database side and vector support. [04:09] without LLMs. [04:12] Another cool thing about these models is you can use Transformers.js, which was [04:16] talked about many times today, it allows us to use the CPU and GPU to [04:21] run inference on these encoder models. [04:24] And it also supports the 768 dimension space, [04:28] that embedding Gemma can use and output for the vectors. [04:33] Here's a code snippet on how you would get this running with embedding Gemma. I'm using the Onyx runtime for the embedding Gemma version of it, the 300 million parameter option. [04:45] And here we can just create a simple pipeline that uses feature extraction, as well as being able to take that embedder. [04:52] give it the correct task type, which can be query or document or others listed on the documentation. And then we just kind of normalize the vectors before we return it back. And since we're on the web, it's important to return it as a float32 array, because that's what SQLite's also going to expect here.

5:08-6:42

[05:08] for the storage as well as Firestore. [05:14] So like I said, Firestore supports vectors, which is awesome. It makes it really easy to sync. When Firestore will first load into your application, it'll pull down the documents that you have queried for that user [05:26] And as you make updates, Firestore takes care of all of the work of-- if you update a single document on the server side, [05:33] pull down the incremental patches, as well as making updates, can send it back up to the server. [05:39] so you don't have to manage any complex sync logic on your side. [05:43] But they also launched vector support, which means you can literally add the vector type directly into those documents. [05:49] keeping it co-located with that user and their collections. [05:53] So here's just a simple snippet of how you might do that in Firestore, using the modular JavaScript SDK. You can just create a Firestore application [06:04] using the app that you initialize. And in this case, it's an emoji application. [06:09] and you have the embedding, which you can then add the doc, and then use the vector type. [06:13] which you can import as well from the SDK. [06:17] So, SQLite, huge fan. There's a really cool project called SQLite VEC. If you're not familiar with it, I definitely suggest you give it a look. [06:26] It allows us to use low-level... [06:28] K and N queries directly inside of SQLite by extending the syntax. [06:33] This project has also expanded a lot since the first version. It now has metadata filtering, partitioning, and virtual columns, and so much more.

6:42-8:14

[06:42] But this allows us to create those embeddings directly into SQLite. Now you can also store the blobs of the float 32 directly inside of [06:54] regular tables, but one of the cool things about the virtual tables is it's optimized for those queries, so it doesn't have to scan, do a full table scan every time you do a query. [07:06] Also, SQLite compiles the WASM, and you can add any extensions that you have inside of that. So in this example that I'm going to share on GitHub later, [07:16] It has SQLite VEC pre-installed, but you can totally add your custom ones as well. [07:22] So here's an example of how you might do that in SQLite. We're importing the official SQLite package here from sqlite.org, as well as just pulling down the WASM module. [07:34] You can just create it like another table using the VEC 0 table syntax. And this allows us to have that float [07:41] 768 dimension syntax, and you would obviously change this for the type of encoder and decoder you're using. [07:48] But that's it. You just work with it like a normal SQLite database, if you're familiar with that. [07:53] But this is all happening on the client. It can do massive data sets. It's often that you can run millions of queries in just like a second on the browser. So... [08:02] Definitely suggest giving it a look. [08:05] So when it comes to querying, it's also very similar to SQL. I know this may not be familiar for everyone, but as a mobile developer and--

8:14-9:45

[08:14] someone who likes to build applications, writing SQL queries on the client, knowing it's just the data set, makes it really easy to create the types of views that I want. And in this case, I just query from the emojis embedding table [08:26] I join on that foreign key, and then here's where the magic comes in with the match keyword, which is using the VEC 0 functions as well as the KNN queries with the limit. [08:36] And then we can order it by the distance and then, [08:39] grab it out and present it to the user later. [08:42] So, time for a demo. This is a little bit different take than the other demo from earlier, which was about using embedding Gemma for [08:52] emojis. I want to create a better [08:55] Vector search for a movie emojis. So I took the entire Unicode data set and I vectorize each of the descriptions with the emoji. So as you're typing, [09:05] it returns the emojis that are closest to that. [09:08] embedding space based on what your query is, and each time you [09:12] on KeyPress, it will actually vectorize the query itself. So this model, once it gets downloaded onto the browser, this can happen completely offline. So you can obviously expand this to other applications, where you can have-- [09:27] documents that you pull in for your business data or just specific types of tool calls. Like for example, you can vectorize a thousand tool definitions and only provide a [09:38] maybe five to the model at any given time. [09:41] It really opens up and expands the types of use cases that you can build.

9:45-10:01

[09:45] This code is available on my GitHub. You can check it out at Emojisearch. I am usually pretty available on GitHub and Twitter and LinkedIn, so definitely feel free to reach out, but thanks so much. [10:00] Thank you.

Want to learn more?