Quick Start

This is a run-through example for how to use this package. We scan a set of netCDF4/HDF5 files, and create a single ensemble, virtual dataset, which can be read in parallel from remote using zarr.

Single file JSONs

This will create a .json file for each of the files defined in urllist. In this case, we simply keep the resultant reference sets in memory, but we could have written them into JSON files. Writing to files is useful, so that we can access the individual datasets, or redo the combine (which is the next step, below).

import kerchunk.hdf
import fsspec

urls = ["s3://" + p for p in [
    'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010000.CHRTOUT_DOMAIN1.comp',
    'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010100.CHRTOUT_DOMAIN1.comp',
    'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010200.CHRTOUT_DOMAIN1.comp',
    'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010300.CHRTOUT_DOMAIN1.comp',
    'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010400.CHRTOUT_DOMAIN1.comp',
    'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010500.CHRTOUT_DOMAIN1.comp',
    'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010600.CHRTOUT_DOMAIN1.comp',
    'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010700.CHRTOUT_DOMAIN1.comp',
    'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010800.CHRTOUT_DOMAIN1.comp',
    'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010900.CHRTOUT_DOMAIN1.comp'
]]
so = dict(
    anon=True, default_fill_cache=False, default_cache_type='first'
)
singles = []
for u in urls:
    with fsspec.open(u, **so) as inf:
        h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)
        singles.append(h5chunks.translate())

Multi-file JSONs

This code uses the output generated above to create a single ensemble dataset, with one set of references pointing to all of the chunks in the individual files.

from kerchunk.combine import MultiZarrToZarr
mzz = MultiZarrToZarr(
    singles,
    remote_protocol="s3",
    remote_options={'anon': True},
    concat_dims=["time"]
)

out = mzz.translate()

Again, out could be written to a JSON file by providing arguments to translate(). Crucially, there is no restriction on where this lives, it can be anywhere that fsspec can read from.

Using the output

This is what a user of the generated dataset would do. This person does not need to have kerchunk installed, or even h5py (the library we used to initially scan the files).

import xarray as xr
ds = xr.open_dataset(
    "reference://", engine="zarr",
    backend_kwargs={
        "storage_options": {
            "fo": out,
            "remote_protocol": "s3",
            "remote_options": {"anon": True}
        },
        "consolidated": False
    }
)
# do analysis...
ds.velocity.mean()

Since the invocation for xarray to read this data is a little involved, we recommend declaring the data set in an intake catalog. Alternatively, you might split the command into multiple lines by first constructing the filesystem or mapper (you will see this in some examples).

Note that, if the combining was done previously and saved to a JSON file, then the path to it should replace out, above, along with a target_options for any additional arguments fsspec might to access it

Example/Tutorial Notebook

A set of tutorials notebooks, presented at the Earth Science Information Partners 2022 Winter Meeting, can be found at the following link, along with links to run the code on free cloud-based notebook environments: https://github.com/lsterzinger/2022-esip-kerchunk-tutorial