Tensors and Lifetime

load() normally returns a dictionary of tensor names to CUDA-backed PyTorch tensors:

tensors = streamer.load("owner/model")

for name, tensor in tensors.items():
    print(name, tensor.shape, tensor.dtype, tensor.device)

The keys come from the .safetensors metadata. The values point at GPU memory allocated and owned by the native library.

Zero-Copy Tensors

The returned tensors are zero-copy: zero-copy means PyTorch views memory that already exists instead of copying the tensor bytes into a new allocation.

This keeps peak memory lower, but it also means tensor lifetime depends on the native allocation.

The Lifetime Rule

Use returned tensors inside the with VajraStreamer(...) block.

with VajraStreamer(config) as streamer:
    tensors = streamer.load("owner/model")
    tensor = tensors["model.embed_tokens.weight"]
    print(tensor.shape)

When the with block exits, the native free_model_memory function frees the VRAM arenas and CPU metadata. Any tensor still pointing at that memory becomes invalid.

# Wrong: tensor memory is freed after the with block exits.
with VajraStreamer(config) as streamer:
    tensors = streamer.load("owner/model")

tensor = tensors["model.embed_tokens.weight"]  # dangling pointer

If you need to keep a tensor after the block, copy it before leaving:

with VajraStreamer(config) as streamer:
    tensors = streamer.load("owner/model")
    copied = tensors["model.embed_tokens.weight"].clone()

# `copied` owns separate memory now.

Zero-Copy Tensors​

The Lifetime Rule​

Zero-Copy Tensors

The Lifetime Rule