Quick Start
This is a run-through example for how to use this package. We scan a set of netCDF4/HDF5 files,
and create a single ensemble, virtual dataset, which can be read in parallel from remote
using zarr
.
Single file JSONs
This will create a .json
file for each of the files defined in urllist
. In this case,
we simply keep the resultant reference sets in memory, but we could have written them into
JSON files. Writing to files is useful, so that we can access the individual datasets, or
redo the combine (which is the next step, below).
import kerchunk.hdf
import fsspec
urls = ["s3://" + p for p in [
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010000.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010100.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010200.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010300.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010400.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010500.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010600.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010700.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010800.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010900.CHRTOUT_DOMAIN1.comp'
]]
so = dict(
anon=True, default_fill_cache=False, default_cache_type='first'
)
singles = []
for u in urls:
with fsspec.open(u, **so) as inf:
h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)
singles.append(h5chunks.translate())
Multi-file JSONs
This code uses the output generated above to create a single ensemble dataset, with one set of references pointing to all of the chunks in the individual files.
from kerchunk.combine import MultiZarrToZarr
mzz = MultiZarrToZarr(
singles,
remote_protocol="s3",
remote_options={'anon': True},
concat_dims=["time"]
)
out = mzz.translate()
Again, out
could be written to a JSON file by providing arguments to
translate()
. Crucially, there is no restriction on where
this lives, it can be anywhere that fsspec can read from.
Using the output
This is what a user of the generated dataset would do. This person does not need to have
kerchunk
installed, or even h5py
(the library we used to initially scan the files).
import xarray as xr
ds = xr.open_dataset(
"reference://", engine="zarr",
backend_kwargs={
"storage_options": {
"fo": out,
"remote_protocol": "s3",
"remote_options": {"anon": True}
},
"consolidated": False
}
)
# do analysis...
ds.velocity.mean()
Since the invocation for xarray to read this data is a little involved, we recommend
declaring the data set in an intake
catalog. Alternatively, you might split the command
into multiple lines by first constructing the filesystem or mapper (you will see this in some
examples).
Note that, if the combining was done previously and saved to a JSON file, then the path to
it should replace out
, above, along with a target_options
for any additional
arguments fsspec might to access it
Example/Tutorial Notebook
A set of tutorials notebooks, presented at the Earth Science Information Partners 2022 Winter Meeting, can be found at the following link, along with links to run the code on free cloud-based notebook environments: https://github.com/lsterzinger/2022-esip-kerchunk-tutorial