Using hf tokenizers in Rust
Example of how to use Tokenizers from Huggingface in Rust
· 2 min read
The tokenizers library from Hugging Face provides an efficient way to work with text tokenization in Rust. This guide shows you how to get started with pretrained tokenizers.
Setup #
First, add the tokenizer library to your project:
cargo add tokenizers --features http,hf-hub
Basic Usage #
Here’s a complete example that loads a pretrained tokenizer and processes text:
use tokenizers::Tokenizer;
fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
// Load a pretrained tokenizer
let tokenizer = Tokenizer::from_pretrained("hf-internal-testing/llama-tokenizer", None)?;
let text = "This is a sample string to tokenize";
// Encode the text (false = no special tokens)
let encoding = tokenizer.encode(text, false)?;
// Get token IDs
let token_ids = encoding.get_ids();
println!("Token IDs: {:?}", token_ids);
// Get token text
let tokens = encoding.get_tokens();
println!("Tokens: {:?}", tokens);
println!("Original: {}", text);
println!("Number of tokens: {}", token_ids.len());
let decoded = tokenizer.decode(token_ids, true)?;
println!("Original: {}", text);
println!("Decoded: {}", decoded);
Ok(())
}
Working with Different Models #
You can use various pretrained models:
// GPT-2 tokenizer
let gpt_tokenizer = Tokenizer::from_pretrained("gpt2", None)?;
// BERT tokenizer
let bert_tokenizer = Tokenizer::from_pretrained("bert-base-uncased", None)?;
// Llama tokenizer
let llama_tokenizer = Tokenizer::from_pretrained("hf-internal-testing/llama-tokenizer", None)?;
Configuration #
To change the cache directory for downloaded models, set the HF_HOME environment variable:
export HF_HOME=/path/to/your/cache
Setting environment variables programmatically is not recommended as it requires an unsafe block.
Private Repositories #
If you encounter this error:
Error: RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/google/gemma-3-12b-it/resolve/main/tokenizer.json]))
It means you are not authenticated and may require a token. There are two ways to achieve this:
- Write your token to $HF_HOME/token, usually $HOME/.cache/huggingface
- Within Rust code:
use tokenizers::{Tokenizer, FromPretrainedParameters};
let params = FromPretrainedParameters {
token: Some("<your very secret token>".to_string()),
..Default::default()
};
let tokenizer = Tokenizer::from_pretrained("google/gemma-3-4b-it", Some(params))?;
Note that you may still need to get permission to access the repos.
Branches #
You can specify a specific branch or revision:
use tokenizers::{Tokenizer, FromPretrainedParameters};
let params = FromPretrainedParameters {
revision: "main".to_string(), // or specific commit hash
..Default::default()
};
let tokenizer = Tokenizer::from_pretrained("google/gemma-3-4b-it", Some(params))?;
User-Agent #
Params have another variable called user_agent for customizing the HTTP client user agent string.
use tokenizers::{Tokenizer, FromPretrainedParameters};
let params = FromPretrainedParameters {
user_agent: Some("my-rust-app/1.0".to_string()),
..Default::default()
};
let tokenizer = Tokenizer::from_pretrained("gpt2", Some(params))?;
Summary #
The Hugging Face tokenizers library provides a robust, production-ready solution for text processing in Rust applications. With support for pretrained models, authentication for private repositories, and flexible configuration options, it’s an excellent choice for NLP workflows in Rust.