Quick Start
This is the smallest normal usage pattern:
from vajra import VajraStreamer, StreamConfig
config = StreamConfig(
auth_token="hf_...", # Required for gated Hugging Face models.
chunk_size_mb=64,
chunk_workers=16,
gpu_workers=3,
disable_cache=False,
log_level=4, # 4 = info
)
with VajraStreamer(config) as streamer:
tensors = streamer.load("Qwen/Qwen2.5-0.5B-Instruct")
for name, tensor in tensors.items():
print(name, tensor.shape, tensor.dtype, tensor.device)
What This Does
streamer.load(...) resolves the Hugging Face repo, finds .safetensors files, streams them through the native library, allocates GPU memory, and returns a dictionary of tensor name to CUDA tensor.
Keep Tensor Work Inside with
The returned tensors point at GPU memory owned by the native library. When the with block exits, VajraStreamer frees that memory.
with VajraStreamer(config) as streamer:
tensors = streamer.load("owner/model")
first_tensor = next(iter(tensors.values()))
print(first_tensor.shape)
If you need a tensor after the with block, clone it while the block is still active:
with VajraStreamer(config) as streamer:
tensors = streamer.load("owner/model")
copied = next(iter(tensors.values())).clone()
# `copied` owns separate memory now.
For the full memory rule, read Tensors and Lifetime.