Skip to main content

High-Performance
Model Streamer

Load models faster by overlapping download, RAM staging, 
and CUDA transfer.

Model Load Time

Hugging face Rust Loader32.04s
Vajra20.14s
Performance Delta59% Faster

Measured from request start to model weights staged in memory on the same model, machine, and network.

GPU loading began at 3.20s

Installation &
Usage

Install the Python SDK

Install the Python package and start streaming models from Python. 

Load Hugging Face .safetensors models directly into PyTorch tensors with one Python call.

Get Started
vajra-shell — 80x24
$ pip install vajra-streamer
from vajra import VajraStreamer, StreamConfig
config = StreamConfig(
auth_token="hf_...",
chunk_size_mb=64,
chunk_workers=16,
gpu_workers=3,
disable_cache=True,
)
url = "meta-llama/Meta-Llama-3-8B"
with VajraStreamer(config) as streamer:
tensors = streamer.load(url)
print(f"Loaded {len(tensors)} tensors")
print(tensors["model.layers.0.self_attn.q_proj.weight"].shape)

Benchmarks

Benchmarked On Meta Llama 3 8B

Transfer Metric

Total Weight Transfer Time

Lower is better
Vajra
hf_transfer
35s25s15s5s0s
32.04s
hf_transfer
20.14s
Vajra

Vajra moved 14.96GB of Llama 3 8B .safetensors weights through the streaming pipeline 59% faster in this run.

Comparison used hf_transfer — HuggingFace's Rust-backed downloader (HF_HUB_ENABLE_HF_TRANSFER=1).

Streaming Timeline

GPU transfer starts before the download finishes

Traditional Loading Path
Download full model with hf_transfer32.04s
GPU work starts after this
Vajra Architecture
Parallel Download & TransferContinuous Streaming Pipeline
20.14s
GPU TRANSFER BEGINS (3.20s)

In this benchmark, Vajra started GPU transfer at 3.20s, while the 14.96GB download/RAM staging path completed in 20.14s. The hf_transfer download completed in 32.04s.

Ready to optimize your model loading?

Get Started Now

Python sdk available on github