Xarray-Beam is a Python library for building Apache Beam pipelines with Xarray datasets.
The project aims to facilitate data transformations and analysis on large-scale multi-dimensional labeled arrays, such as:
- Ad-hoc computation on Xarray data, by dividing a
xarray.Dataset
into many smaller pieces ("chunks"). - Adjusting array chunks, using the Rechunker algorithm.
- Ingesting large, multi-dimensional array datasets into an analysis-ready, cloud-optimized format, namely Zarr (see also Pangeo Forge).
- Calculating statistics (e.g., "climatology") across distributed datasets with arbitrary groups.
For more about our approach and how to get started, read the documentation!
Warning: Xarray-Beam is a sharp tool 🔪
Xarray-Beam is relatively new, and focused on expert users:
- We use it extensively at Google for processing large-scale weather datasets, but there is not yet a vibrant external community.
- It provides low-level abstractions that facilitate writing very large scale data pipelines (e.g., 100+ TB), but by design it requires explicitly thinking about how every operation is parallelized.
Xarray-Beam requires recent versions of immutabledict, Xarray, Dask, Rechunker, Zarr, and Apache Beam. For best performance when writing Zarr files, use Xarray 0.19.0 or later.
Xarray-Beam is an experiment that we are sharing with the outside world in the hope that it will be useful. It is not a supported Google product. We welcome feedback, bug reports and code contributions, but cannot guarantee they will be addressed.
See the "Contribution guidelines" for more.
Contributors:
- Stephan Hoyer
- Jason Hickey
- Cenk Gazen
- Alex Merose