Data Engineering 18 December 2024 14 min read

Zarr 3 and Icechunk: Git-Like Version Control for Planetary-Scale Arrays

The Zarr 3 specification and Icechunk storage engine are transforming how we manage multi-dimensional scientific data. Transactional semantics meet cloud-native arrays.

ZarrCloud NativeScientific ComputingIcechunkData Lakes

NASA satellite view of Earth showing atmospheric patterns — NASA on Unsplash

Scientific computing has long struggled with a fundamental tension: the need to analyze massive multi-dimensional arrays (satellite imagery, climate simulations, genomic sequences) while maintaining data integrity across distributed teams. Zarr 3 and Icechunk represent a paradigm shift—bringing cloud-native design, transactional semantics, and Git-like version control to petabyte-scale array data.

The Evolution from Zarr 2 to Zarr 3

Zarr emerged as the answer to a critical problem: how do you efficiently store and access chunked, compressed N-dimensional arrays on cloud object storage? The v2 specification proved transformative for organizations like NASA and NOAA, but years of production use revealed opportunities for improvement—better async support, more flexible codecs, and sharding for workloads with many small chunks.

Zarr-Python 3, released in late 2024, isn't just an incremental update. The library was fundamentally rewritten with an asynchronous core, enabling concurrent I/O operations that dramatically improve performance on high-latency cloud storage backends. Where Zarr 2 would issue sequential requests, Zarr 3 can saturate network bandwidth by parallelizing chunk fetches.

The new asynchronous core enables efficient I/O operations and better utilization of system resources. Multiple I/O operations can be performed concurrently, leading to faster data access and reduced latency—especially when data is stored on cloud object storage.

Zarr Development Team

Zarr 3: Key Architectural Changes

Core Improvements

Fully asynchronous core

Built on Python's asyncio for concurrent I/O operations, dramatically faster on cloud storage

Sharding support

Groups multiple chunks into single storage objects, improving performance on small-chunk workloads

Extensible store ABC

New abstract base class for custom storage backends—S3, GCS, Azure, or your own

Custom codec pipelines

Python entry points for defining compression and transformation codecs

Multi-scale features

Native support for pyramids and overviews in array hierarchies

The sharding feature deserves special attention. When working with large arrays and small chunks—common in time-series satellite data—Zarr 2's one-file-per-chunk model overwhelms file systems and object stores. Sharding groups multiple chunks into single storage objects, reducing metadata overhead by orders of magnitude while maintaining random access to individual chunks.

Working with Zarr 3 and Xarray

For most users, the path to Zarr 3 runs through Xarray. The integration is seamless—Xarray's open_zarr function works with both v2 and v3 stores, automatically detecting the format version and handling the complexity of chunked, lazy loading.

zarr_xarray_example.py

import xarray as xr
import zarr

# Open dataset with Zarr-Python 3
ds = xr.open_zarr(
    "s3://climate-data/era5.zarr",
    storage_options={"anon": True},
    consolidated=True
)

# Lazy loading—only metadata fetched
print(ds)  # Shows dimensions, coords, variables

# Compute on specific region—only needed chunks loaded
subset = ds.temperature.sel(
    lat=slice(40, 50),
    lon=slice(-10, 10),
    time="2024"
).mean(dim="time")

result = subset.compute()  # Triggers actual I/O

The lazy evaluation model is crucial for cloud-native workflows. Opening a multi-terabyte dataset takes milliseconds—only metadata is fetched. Computation triggers just the chunk I/O needed for the specific spatial and temporal subset requested.

Icechunk: Transactional Storage for Zarr

Zarr solves the storage format problem, but production data lakes face additional challenges: concurrent writers, data consistency, and the ability to roll back failed pipelines. Icechunk addresses these gaps by layering transactional semantics on top of Zarr 3.

Think of Icechunk as Git for datasets. When you read from an Icechunk store, you're checking out a specific snapshot. Even if others commit changes afterward, your view stays fixed until you explicitly pull updates. This isolation is transformative for reproducible science—analyses can reference exact dataset versions.

Icechunk Capabilities

Transactional commits — All changes committed atomically, no partial writes
Snapshot isolation — Read from fixed points in history while others write
Branch management — Multiple branches can coexist, like Git repositories
Conflict detection — Automatic detection when concurrent writers modify same data
Time travel queries — Access any historical state of your dataset

icechunk_workflow.py

from icechunk import IcechunkStore, StorageConfig
import zarr

# Create Icechunk store on S3
storage = StorageConfig.s3_from_env(
    bucket="my-data-lake",
    prefix="climate-analysis"
)
store = IcechunkStore.create(storage)

# Open as Zarr group
root = zarr.group(store=store)
root.create_dataset("temperature", shape=(1000, 180, 360))

# Commit changes with message
store.commit("Initial temperature array")

# Later: create branch for experimental analysis
store.new_branch("experimental")
root["temperature"][0] = new_data
store.commit("Experimental calibration")

# Switch back to main, original data untouched
store.checkout(branch="main")

Production Readiness: Icechunk 1.0

Icechunk 1.0, released in July 2025, marks the transition from experimental to production-ready. The stability guarantees are explicit: data written by Icechunk 1.0 and greater will forever be readable by future versions. For organizations building long-lived data archives, this commitment is essential.

The underlying implementation is sophisticated. Icechunk maintains a manifest that tracks chunk locations across snapshots, enabling efficient storage through deduplication. When you modify a single chunk, only that chunk is rewritten—the rest of the dataset shares storage with previous snapshots.

Ecosystem Adoption

Zarr has achieved remarkable adoption across Earth science organizations:

NASA — Earth observation data products and climate archives
NOAA — Weather model outputs and forecast ensembles
ECMWF — Global atmospheric reanalysis datasets
Copernicus — Sentinel satellite data distribution
Pangeo — Community-driven cloud-native geoscience

The CMIP6 climate model archive, one of the largest scientific datasets ever assembled, is increasingly distributed in Zarr format. Cloud-optimized versions enable researchers to query specific variables, time periods, and regions without downloading complete model outputs.

VirtualiZarr: Bridging Legacy Archives

Not every organization can re-process their archives to native Zarr. VirtualiZarr offers an alternative: create virtual Zarr stores that reference chunks within existing NetCDF, HDF5, or GRIB files. The virtual references enable Zarr tooling to access legacy data without duplication.

This approach is particularly valuable for space agencies with decades of satellite archives. Data remains in place; only lightweight reference manifests are created. Users get the benefits of Zarr's chunked access patterns on data originally designed for download-and-process workflows.

Performance Characteristics

Benchmarks from the Zarr development team show significant improvements in v3. Asynchronous I/O provides 2-5× speedups on cloud storage compared to synchronous v2 access patterns. Sharding reduces metadata operations by 10-100× for small-chunk workloads.

Icechunk adds minimal overhead—typically single-digit percentages for write operations. The manifest updates are designed to be append-only, with compaction happening in the background. Read performance is essentially unchanged from raw Zarr, as chunk data is stored identically.

Our Perspective

For organizations managing scientific array data, the Zarr 3 + Icechunk combination represents the current state of the art. The cloud-native design aligns with where infrastructure is heading; the transactional semantics solve real problems we've encountered in production pipelines.

The migration path is clear: existing Zarr v2 stores can be accessed by Zarr-Python 3 (v2 compatibility is maintained), new stores should use v3 format, and teams requiring versioning should evaluate Icechunk. The ecosystem is mature enough for production use while remaining under active development.

What excites me most is the convergence with other cloud-native formats. GeoParquet for vector data, COG for raster imagery, and Zarr for multi-dimensional arrays—together they form a coherent stack for modern geospatial infrastructure. The days of proprietary file formats and monolithic GIS databases are numbered.

References & Further Reading

Zarr-Python 3 Release Announcement

Official release notes for Zarr-Python 3

https://zarr.dev/blog/zarr-python-3-release/

Zarr Core Specification v3

Complete Zarr v3 specification documentation

https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html

Icechunk: Transactions and Version Control

Icechunk documentation on transactional semantics

https://icechunk.io/en/latest/version-control/

Earthmover: Why Teams That Use Zarr Need Icechunk

Production use cases for Icechunk

https://earthmover.io/blog/multi-player-mode-why-teams-that-use-zarr-need-icechunk/

Cloud-Optimized Geospatial Formats Guide: Zarr

Comprehensive guide to Zarr for geospatial applications

https://guide.cloudnativegeo.org/zarr/intro.html

VirtualiZarr Documentation

Virtual references for existing archival data

https://virtualizarr.readthedocs.io/en/latest/index.html

Tell us about your project

Say Hello

Our Offices

Canberra
ACT, Australia