Build a content recommendation API using vector search¶
In this guide you'll learn how to calculate vector embeddings using HuggingFace models and use Tinybird to perform vector search to find similar content based on vector distances
GitHub RepositoryTL;DR¶
In this tutorial, you will learn how to:
- Use Python to fetch content from an RSS feed
- Calculate vector embeddings on long form content (blog posts) using SentenceTransformers in Python
- Post vector embeddings to a Tinybird Data Source using the Tinybird Events API
- Write a dynamic SQL query to calculate the closest content matches to a given blog post based on vector distances
- Publish your query as an API and integrate it into a frontend application
Prerequisites¶
To complete this tutorial, you'll need:
- A free Tinybird account
- An empty Tinybird Workspace
- Python >= 3.8
This tutorial does not include a frontend, but we provide an example snippet below on how you might integrate the published API into a React frontend.
1. Setup¶
Clone the demo_vector_search_recommendation
repo. We'll use the repository as reference throughout this tutorial.
Authenticate the Tinybird CLI using your user admin token from your Tinybird Workspace:
cd tinybird tb auth --token $USER_ADMIN_TOKEN
2. Fetch content and calculate embeddings¶
In this tutorial we fetch blog posts from the Tinybird Blog using the Tinybird Blog RSS feed. You can use any rss.xml
feed to fetch blog posts and calculate embeddings from their content.
You can fetch and parse the RSS feed using the feedparser
library in Python, get a list of posts, and then fetch each post and parse the content with the BeautifulSoup
library.
Once you've fetched each post, you can calculate an embedding using the HuggingFace sentence_transformers
library. In this demo, we use the all-MiniLM-L6-v2
model, which maps sentences & paragraphs to a 384 dimensional dense vector space. You can browse other models here.
You can achieve this and the following step (fetch posts, calculate embeddings, and send them to Tinybird) by running load.py
from the code repository. We walk through the function of that script below so you can understand how it works.
from bs4 import BeautifulSoup from sentence_transformers import SentenceTransformer import datetime import feedparser import requests import json timestamp = datetime.datetime.now().isoformat() url = "https://www.tinybird.co/blog-posts/rss.xml" # Update to your preferred RSS feed feed = feedparser.parse(url) model = SentenceTransformer("all-MiniLM-L6-v2") posts = [] for entry in feed.entries: doc = BeautifulSoup(requests.get(entry.link).content, features="html.parser") if (content := doc.find(id="content")): embedding = model.encode([content.get_text()]) posts.append(json.dumps({ "timestamp": timestamp, "title": entry.title, "url": entry.link, "embedding": embedding.mean(axis=0).tolist() }))
3. Post content metadata and embeddings to Tinybird¶
Once you've calculated the embeddings, you can push them along with the content metadata to Tinybird using the Events API.
First, set up some environment variables for your Tinybird host and token with DATASOURCES:WRITE
scope:
export TB_HOST=your_tinybird_host export TB_TOKEN=your_tinybird_token
Next, you'll need to set up a Tinybird Data Source to receive your data. Note that if the Events API doesn't find a Tinybird Data Source by the supplied name, it will create one. But since we want control over our schema, we're going to create an empty Data Source first.
In the tinybird/datasources
folder of the repository, you'll find a posts.datasource
file that looks like this:
SCHEMA > `timestamp` DateTime `json:$.timestamp`, `title` String `json:$.title`, `url` String `json:$.url`, `embedding` Array(Float32) `json:$.embedding[:]` ENGINE ReplacingMergeTree ENGINE_PARTITION_KEY "" ENGINE_SORTING_KEY title, url ENGINE_VER timestamp
This Data Source will receive the updated post metadata and calculated embeddings and deduplicate based on the most up to data retrieval. The ReplacingMergeTree
is used to deduplicate, relying on the ENGINE_VER
setting, which in this case is set to the timestamp
column. This tells the engine that the versioning of each entry is based on the timestamp
column, and only the entry with the latest timestamp will be kept in the Data Source.
The Data Source has the title
column as its primary sorting key, because we will be filtering by title to retrieve the embedding for the current post. Having title
as the primary sorting key makes that filter more performant.
Push this Data Source to Tinybird:
cd tinybird tb push datasources/posts.datasource
Then, you can use a Python script to push the post metadata and embeddings to the Data Source using the Events API:
import os import requests TB_APPEND_TOKEN=os.getenv("TB_APPEND_TOKEN") TB_HOST=os.getenv("TB_HOST") def send_posts(posts): params = { "name": "posts", "token": TB_APPEND_TOKEN } data = "\n".join(posts) # ndjson r = requests.post(f"{TB_HOST}/v0/events", params=params, data=data) print(r.status_code) send_posts(posts)
To keep embeddings up to date, you should retrieve new content on a schedule and push it to Tinybird. In the repository, you'll find a GitHub Action called tinybird_recommendations.yml that fetches new content from the Tinybird blog every 12 hours and pushes it to Tinybird. The Tinybird Data Source in this project uses a ReplacingMergeTree to deduplicate blog post metadata and embeddings as new data arrives.
4. Calculate distances in SQL using Tinybird Pipes.¶
If you've completed steps above, you should have a posts
Data Source in your Tinybird Workspace containing the last fetched timestamp, title, url, and embedding for each blog post fetched from your RSS feed.
You can verify that you have data from the Tinybird CLI with:
tb sql 'SELECT * FROM posts'
This tutorial includes a single-node SQL Pipe to calculate the vector distance of each post to specific post supplied as a query parameter. The Pipe config is contained in the similar_posts.pipe
file in the tinybird/pipes
folder, and the SQL is copied below for reference and explaination.
% WITH ( SELECT embedding FROM ( SELECT 0 AS id, embedding FROM posts WHERE title = {{ String(title) }} ORDER BY timestamp DESC LIMIT 1 UNION ALL SELECT 999 AS id, arrayWithConstant(384, 0.0) embedding ) ORDER BY id LIMIT 1 ) AS post_embedding SELECT title, url, L2Distance(embedding, post_embedding) similarity FROM posts FINAL WHERE title <> {{ String(title) }} ORDER BY similarity ASC LIMIT 10
This query first fetches the embedding of the requested post, and returns an array of 0s in the event an embedding can't be fetched. It then calculates the Euclidean vector distance between each additional post and the specified post using the L2Distance()
function, sorts them by ascending distance, and limits to the top 10 results.
You can push this Pipe to your Tinybird server with:
cd tinybird tb push pipes/similar_posts.pipe
When you push it, Tinybird will automatically publish it as a scalable, dynamic REST API Endpoint that accepts a title
query parameter.
You can test your API Endpoint with a cURL. First, create an envvar with a token that has PIPES:READ
scope for your Pipe. You can get this token from your Workspace UI or in the CLI with tb token
commands.
export TB_READ_TOKEN=your_read_token
Then request your Endpoint:
curl --compressed -H "Authorization: Bearer $TB_READ_TOKEN" https://api.tinybird.co/v0/pipes/similar_posts.json?title='Some blog post title'
You will get a JSON object containing the 10 most similar posts to the post whose title you supplied in the request.
5. Integrate into the frontend¶
Integrating your vector search API into the frontend is relatively straightforward, as it's just a RESTful Endpoint. Here's an example implementation (pulled from the actual code used to fetch related posts in the Tinybird Blog):
export async function getRelatedPosts(title: string) { const recommendationsUrl = `${host}/v0/pipes/similar_posts.json?token=${token}&title=${title}`; const recommendationsResponse = await fetch(recommendationsUrl).then( function (response) { return response.json(); } ); if (!recommendationsResponse.data) return; return Promise.all( recommendationsResponse.data.map(async ({ url }) => { const slug = url.split("/").pop(); return await getPost(slug); }) ).then((data) => data.filter(Boolean)); }
6. See it in action¶
You can see how this looks by checking out any blog post in the Tinybird Blog. At the bottom of each post, you'll find a Related Posts section that's powered by a real Tinybird API using the method described here!
Next steps¶
- Read more about vector search and content recommendation use cases.
- Join the Tinybird Slack Community for additional support.