API Reference

File format backends

`kerchunk.hdf.SingleHdf5ToZarr`(h5f[, url, ...])	Translate the content of one HDF5 file into Zarr metadata.
`kerchunk.grib2.scan_grib`(url[, common, ...])	Generate references for a GRIB2 file
`kerchunk.fits.process_file`(url[, ...])	Create JSON references for a single FITS file as a zarr group
`kerchunk.tiff.tiff_to_zarr`(urlpath[, ...])	Wraps TIFFFile's fsspec writer to extract metadata as attributes
`kerchunk.netCDF3.NetCDF3ToZarr`(filename[, ...])	Generate references for a netCDF3 file
`kerchunk.hdf4.HDF4ToZarr`(path[, ...])	Experimental: interface to HDF4 archival files

class kerchunk.hdf.SingleHdf5ToZarr(h5f: BinaryIO | str | File | Group, url: str = None, spec=1, inline_threshold=500, storage_options=None, error='warn', vlen_encode='embed', out=None)[source]

Translate the content of one HDF5 file into Zarr metadata.

HDF5 groups become Zarr groups. HDF5 datasets become Zarr arrays. Zarr array chunks remain in the HDF5 file.

Parameters:

h5ffile-like or str: Input HDF5 file. Can be a binary Python file-like object (duck-typed, adhering to BinaryIO is optional), in which case must also provide url. If a str, file will be opened using fsspec and storage_options.
urlstring: URI of the HDF5 file, if passing a file-like object or h5py File/Group
specint: The version of output to produce (see README of this repo)
inline_thresholdint: Include chunks smaller than this value directly in the output. Zero or negative to disable
storage_options: dict: passed to fsspec if h5f is a str
error: “warn” (default) | “pdb” | “ignore” | “raise”
vlen_encode: [“embed”, “null”, “leave”, “encode”]: What to do with VLEN string variables or columns of tabular variables leave: pass through the 16byte garbage IDs unaffected, but requires no codec null: set all the strings to None or empty; required that this library is available at read time embed: include all the values in the output JSON (should not be used for large tables) encode: save the ID-to-value mapping in a codec, to produce the real values at read time; requires this library to be available. Can be efficient storage where there are few unique values.
out: dict-like or None: This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored

Methods

translate([preserve_linked_dsets])

Translate content of one HDF5 file into Zarr storage format.

translate(preserve_linked_dsets=False)[source]

Translate content of one HDF5 file into Zarr storage format.

This method is the main entry point to execute the workflow, and returns a “reference” structure to be used with zarr/kerchunk

No data is copied out of the HDF5 file.

Parameters:

preserve_linked_dsetsbool (optional, default False): If True, translate HDF5 soft and hard links for each h5py.Dataset into the reference structure. Requires h5py version 3.11.0 or later. Will not translate external links or links to h5py.Group objects.

Returns:

dict: Dictionary containing reference structure.

kerchunk.grib2.scan_grib(url, common=None, storage_options=None, inline_threshold=100, skip=0, filter={}) → List[Dict][source]

Generate references for a GRIB2 file

Parameters:

url: str: File location
common_vars: (depr, do not use)
storage_options: dict: For accessing the data, passed to filesystem
inline_threshold: int: If given, store array data smaller than this value directly in the output
skip: int: If non-zero, stop processing the file after this many messages
filter: dict: keyword filtering. For each key, only messages where the key exists and has the exact value or is in the given set, are processed. E.g., the cf-style filter {'typeOfLevel': 'heightAboveGround', 'level': 2} only keeps messages where heightAboveGround==2.

Returns:

list(dict): references dicts in Version 1 format, one per message in the file

kerchunk.fits.process_file(url, storage_options=None, extension=None, inline_threshold=100, primary_attr_to_group=False, out=None)[source]

Create JSON references for a single FITS file as a zarr group

Parameters:

url: str: Where the file is
storage_options: dict: How to load that file (passed to fsspec)
extension: list(int | str) | int | str or None: Which extensions to include. Can be ordinal integer(s), the extension name (str) or if None, uses the first data extension
inline_threshold: int: (not yet implemented)
primary_attr_to_group: bool: Whether the output top-level group contains the attributes of the primary extension (which often contains no data, just a general description)
out: dict-like or None: This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored

Returns:

dict of the references

kerchunk.tiff.tiff_to_zarr(urlpath, remote_options=None, target=None, target_options=None, inline_threshold=0, storage_options=None)[source]

Wraps TIFFFile’s fsspec writer to extract metadata as attributes

Parameters:

urlpath: str: Location of input TIFF
remote_options: dict: pass these to fsspec when opening urlpath
target: str: Write JSON to this location. If not given, no file is output
target_options: dict: pass these to fsspec when opening target
inline_threshold: int: Bytes blocks smaller than this will be inlined
storage_options: dict: same as remote_options, for compatibility. If both are given, remote_options wins.

Returns:

references dict

class kerchunk.netCDF3.NetCDF3ToZarr(filename, storage_options=None, inline_threshold=100, max_chunk_size=0, out=None, **kwargs)[source]

Generate references for a netCDF3 file

Uses scipy’s netCDF3 reader, but only reads the metadata. Note that instances do behave like actual scipy netcdf files, but contain no valid data. Also appears to work for netCDF2, although this is not currently tested.

Methods

translate()

Produce references dictionary

__init__(filename, storage_options=None, inline_threshold=100, max_chunk_size=0, out=None, **kwargs)[source]

Parameters:

filename: str: location of the input
storage_options: dict: passed to fsspec when opening filename
inline_threshold: int: Byte size below which an array will be embedded in the output. Use 0 to disable inlining.
max_chunk_size: int: How big a chunk can be before triggering subchunking. If 0, there is no subchunking, and there is never subchunking for coordinate/dimension arrays. E.g., if an array contains 10,000bytes, and this value is 6000, there will be two output chunks, split on the biggest available dimension. [TBC]
out: dict-like or None: This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored
args, kwargs: passed to scipy superclass ``scipy.io.netcdf.netcdf_file``

translate()[source]

Produce references dictionary

Parameters:

class kerchunk.hdf4.HDF4ToZarr(path, storage_options=None, inline_threshold=100, out=None)[source]

Experimental: interface to HDF4 archival files

Methods

translate([filename, storage_options])

Scan and return references

__init__(path, storage_options=None, inline_threshold=100, out=None)[source]

translate(filename=None, storage_options=None)[source]

Scan and return references

Parameters:

filename: if given, write to this as JSON
storage_options: to interpret filename

Returns:

references

Codecs

`kerchunk.codecs.GRIBCodec`(var[, dtype])	Read GRIB stream of bytes as a message using eccodes
`kerchunk.codecs.AsciiTableCodec`(indtypes, ...)	Decodes ASCII-TABLE extensions in FITS files
`kerchunk.codecs.FillStringsCodec`(dtype[, id_map])	Sets fixed-length string fields to empty
`kerchunk.codecs.VarArrCodec`(dt_in, dt_out, ...)	Variable length arrays in a FITS BINTABLE extension
`kerchunk.codecs.RecordArrayMember`(member, dtype)	Read components of a record array (complex dtype)

class kerchunk.codecs.GRIBCodec(var, dtype=None)[source]

Read GRIB stream of bytes as a message using eccodes

__init__(var, dtype=None)[source]

class kerchunk.codecs.AsciiTableCodec(indtypes, outdtypes)[source]

Decodes ASCII-TABLE extensions in FITS files

__init__(indtypes, outdtypes)[source]

Parameters:

indtypes: list[str]: dtypes of the fields as in the table
outdtypes: list[str]: requested final dtypes

class kerchunk.codecs.FillStringsCodec(dtype, id_map=None)[source]

Sets fixed-length string fields to empty

To be used with HDF fields of strings, to fill in the valules of the opaque 16-byte string IDs.

__init__(dtype, id_map=None)[source]

Note: we must pass id_map using strings, because this is JSON-encoded by zarr.

Parameters:

id_map: None | str | dict(str, str)

class kerchunk.codecs.VarArrCodec(dt_in, dt_out, nrow, types)[source]

Variable length arrays in a FITS BINTABLE extension

__init__(dt_in, dt_out, nrow, types)[source]

class kerchunk.codecs.RecordArrayMember(member, dtype)[source]

Read components of a record array (complex dtype)

__init__(member, dtype)[source]

Parameters:

member: str: name of desired subarray
dtype: list of lists: description of the complex dtype of the overall record array. Must be both parsable by np.dtype() and also be JSON serialisable

class kerchunk.codecs.ZlibCodec[source]

__init__()[source]

Combining

`kerchunk.combine.MultiZarrToZarr`(path[, ...])	Combine multiple kerchunk'd datasets into a single logical aggregate dataset
`kerchunk.combine.merge_vars`(files[, ...])	Merge variables across datasets with identical coordinates
`kerchunk.combine.concatenate_arrays`(files[, ...])	Simple concatenate of one zarr array along an axis
`kerchunk.combine.auto_dask`(urls, ...[, ...])	Batched tree combine using dask.
`kerchunk.combine.drop`(fields)	Generate example preprocessor removing given fields

class kerchunk.combine.MultiZarrToZarr(path, indicts=None, coo_map=None, concat_dims=None, coo_dtypes=None, identical_dims=None, target_options=None, remote_protocol=None, remote_options=None, inline_threshold: int = 500, preprocess=None, postprocess=None, out=None)[source]

Combine multiple kerchunk’d datasets into a single logical aggregate dataset

Parameters:

path – str, list(str) or list(dict) Local paths, each containing a references JSON; or a list of references dicts. You may pass a list of reference dicts only, but then they will not have assicuated filenames; if you need filenames for producing coordinates, pass the list of filenames with path=, and the references with indicts=
indicts – list(dict)
concat_dims – str or list(str) Names of the dimensions to expand with
coo_map –
dict(str, selector) The special key “var” means the variable name in the output, which will be “VARNAME” by default (i.e., variable names are the same as in the input datasets). The default for any other coordinate is data:varname, i.e., look for an array with that name.
Selectors (“how to get coordinate values from a dataset”) can be:
- a constant value (usually str for a var name, number for a coordinate)
- a compiled regex re.Pattern, which will be applied to the filename. Should return exactly one value
- a string beginning “attr:” which will fetch this attribute from the zarr dataset of each path
- a string beginning “vattr:{var}:” as above, but the attribute is taken from the array named var
- ”VARNAME” special value where a dataset contains multiple variables, just use the variable names as given
- ”INDEX” special value for the index of how far through the list of inputs we are so far
- a string beginning “data:{var}” which will get the appropriate zarr array from each input dataset.
- ”cf:{var}”, interpret the value of var using cftime, returning a datetime. These will be automatically re-encoded with cftime, unless you specify an “M8[*]” dtype for the coordinate, in which case a conversion will be attempted.
- a list with the values that are known beforehand
- a function with signature (index, fs, var, fn) -> value, where index is an int counter, fs is the file system made for the current input, var is the variable we are probing may be “var”) and fn is the filename or None if dicts were used as input
coo_dtypes – map(str, str|np.dtype) Coerce the final type of coordinate arrays (otherwise use numpy default)
identical_dims – list[str] Variables that are to be copied across from the first input dataset, because they do not vary.
target_options – dict Storage options for opening path
remote_protocol – str The protocol of the original data
remote_options – dict
inline_threshold – int Size below which binary blocks are included directly in the output
preprocess – callable Acts on the references dict of all inputs before processing. See drop() for an example.
postprocess – callable Acts on the references dict before output. postprocess(dict)-> dict
out – dict-like or None This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored
append – bool If True, will load the references specified by out and add to them rather than starting from scratch. Assumes the same coordinates are being concatenated.

Methods

`append`(path, original_refs[, ...])	Update an existing combined reference set with new references
`translate`([filename, storage_options])	Perform all stages and return the resultant references dict

__init__(path, indicts=None, coo_map=None, concat_dims=None, coo_dtypes=None, identical_dims=None, target_options=None, remote_protocol=None, remote_options=None, inline_threshold: int = 500, preprocess=None, postprocess=None, out=None)[source]

classmethod append(path, original_refs, remote_protocol=None, remote_options=None, target_options=None, **kwargs)[source]

Update an existing combined reference set with new references

There are two main usage patterns:

if the input original_refs is JSON, the combine happens in memory and the output should be written to JSON. This could then be optionally converted to parquet in a separate step
if original_refs is a lazy parquet reference set, then it will be amended in-place

If you want to extend JSON references and output to parquet, you must first convert to parquet in the location you would like the final product to live.

The other arguments should be the same as they were at the creation of the original combined reference set.

NOTE: if the original combine used a postprocess function, it may be that this process functions, as the combine is done “before” postprocessing. Functions that only add information (as as setting attrs) would be OK.

Parameters:

path: list of reference sets to add. If remote/target options would be different: to original_refs, these can be as dicts or LazyReferenceMapper instances
original_refs: combined reference set to be extended
remote_protocol, remote_options, target_options: referring to ``original_refs```
kwargs: to MultiZarrToZarr

Returns:

MultiZarrToZarr

translate(filename=None, storage_options=None)[source]

Perform all stages and return the resultant references dict

If filename and storage options are given, the output is written to this file using ujson and fsspec.

kerchunk.combine.merge_vars(files, storage_options=None)[source]

Merge variables across datasets with identical coordinates

Parameters:

files – list(dict), list(str) or list(fsspec.OpenFile) List of reference dictionaries or list of paths to reference json files to be merged
storage_options – dict Dictionary containing kwargs to fsspec.open_files

kerchunk.combine.concatenate_arrays(files, storage_options=None, axis=0, key_seperator='.', path=None, check_arrays=False)[source]

Simple concatenate of one zarr array along an axis

Assumes that each array is identical in shape/type.

If the inputs are groups, provide the path to the contained array, and all other arrays will be ignored. You could concatentate the arrays separately and then recombine them with merge_vars.

Parameters:

files: list[dict] | list[str]: Input reference sets, maybe generated by kerchunk.zarr.single_zarr
storage_options: dict | None: To create the filesystems, such at target/remote protocol and target/remote options
key_seperator: str: “.” or “/”, how the zarr keys are stored
path: str or None: If the datasets are groups rather than simple arrays, this is the location in the group hierarchy to concatenate. The group structure will be recreated.
check_arrays: bool: Whether we check the size and chunking of the inputs. If True, and an inconsistency is found, an exception is raised. If False (default), the user is expected to be certain that the chunking and shapes are compatible.

kerchunk.combine.auto_dask(urls: List[str], single_driver: type, single_kwargs: dict, mzz_kwargs: dict, n_batches: int, remote_protocol=None, remote_options=None, filename=None, output_options=None)[source]

Batched tree combine using dask.

If you wish to run on a distributed cluster (recommended), create a client before calling this function.

Parameters:

urls: list[str]: input dataset URLs
single_driver: class: class with translate() method
single_kwargs: to pass to single-input driver
mzz_kwargs: passed to ``MultiZarrToZarr`` for each batch
n_batches: int: Number of MZZ instances in the first combine stage. Maybe set equal to the number of dask workers, or a multple thereof.
remote_protocol: str | None
remote_options: dict: To fsspec for opening the remote files
filename: str | None: Ouput filename, if writing
output_options: If filename is not None, open it with these options

Returns:

reference set

kerchunk.combine.drop(fields)[source]: Generate example preprocessor removing given fields

Utilities

`kerchunk.utils.rename_target`(refs, renames)	Utility to change URLs in a reference set in a predictable way
`kerchunk.utils.rename_target_files`(url_in, ...)	Perform URL renames on a reference set - read and write from JSON
`kerchunk.utils.subchunk`(store, variable, factor)	Split uncompressed chunks into integer subchunks on the largest axis
`kerchunk.utils.dereference_archives`(references)	Directly point to uncompressed byte ranges in ZIP/TAR archives
`kerchunk.utils.consolidate`(refs)	Turn raw references into output
`kerchunk.utils.do_inline`(store, threshold[, ...])	Replace short chunks with the value of that chunk and inline metadata
`kerchunk.utils.inline_array`(store[, ...])	Inline whole arrays by threshold or name, replace with a single metadata chunk
`kerchunk.df.refs_to_dataframe`(fo, url[, ...])	Write references as a parquet files store.

kerchunk.utils.rename_target(refs, renames)[source]

Utility to change URLs in a reference set in a predictable way

For reference sets including templates, this is more easily done by using template overrides at access time; but rewriting the references and saving a new file means not having to do that every time.

Parameters:

refs: dict: Reference set
renames: dict[str, str]: Mapping from the old URL (including protocol, if this is how they appear in the original) to new URL

Returns:

dict: the altered reference set, which can be saved

kerchunk.utils.rename_target_files(url_in, renames, url_out=None, storage_options_in=None, storage_options_out=None)[source]

Perform URL renames on a reference set - read and write from JSON

Parameters:

url_in: str: Original JSON reference set
renames: dict: URL renamings to perform (see renate_target)
url_out: str | None: Where to write to. If None, overwrites original
storage_options_in: dict | None: passed to fsspec for opening url_in
storage_options_out: dict | None: passed to fsspec for opening url_out. If None, storage_options_in is used.

Returns:

None

kerchunk.tiff.generate_coords(attrs, shape)[source]

Produce coordinate arrays for given variable

Specific to GeoTIFF input attributes

Parameters:

attrs: dict: Containing the geoTIFF tags, probably the root group of the dataset
shape: tuple[int]: The array size in numpy (C) order

kerchunk.utils.subchunk(store, variable, factor)[source]

Split uncompressed chunks into integer subchunks on the largest axis

Parameters:

store: dict: reference set
variable: str: the named zarr variable (give as /-separated path if deep)
factor: int: the number of chunks each input chunk turns into. Must be an exact divisor of the original largest dimension length.

Returns:

modified store

kerchunk.utils.dereference_archives(references, remote_options=None)[source]

Directly point to uncompressed byte ranges in ZIP/TAR archives

If a set of references have been made for files contained within ZIP or (uncompressed) TAR archives, the “zip://…” and “tar://…” URLs should be converted to byte ranges in the overall file.

Parameters:

references: dict: a simple reference set
remote_options: dict or None: For opening the archives

kerchunk.utils.consolidate(refs)[source]: Turn raw references into output

kerchunk.utils.do_inline(store, threshold, remote_options=None, remote_protocol=None)[source]

Replace short chunks with the value of that chunk and inline metadata

The chunk may need encoding with base64 if not ascii, so actual length may be larger than threshold.

kerchunk.utils.inline_array(store, threshold=1000, names=None, remote_options=None)[source]

Inline whole arrays by threshold or name, replace with a single metadata chunk

Inlining whole arrays results in fewer keys. If the constituent keys were already inlined, this also results in a smaller file overall. No action is taken for arrays that are already of one chunk (they should be in

Parameters:

store: dict/JSON file: reference set
threshold: int: Size in bytes below which to inline. Set to 0 to prevent inlining by size
names: list[str] | None: It the array name (as a dotted full path) appears in this list, it will be inlined irrespective of the threshold size. Useful for coordinates.
remote_options: dict | None: Needed to fetch data, if the required keys are not already individually inlined in the data.

Returns:

amended references set (simple style)

kerchunk.df.refs_to_dataframe(fo, url, target_protocol=None, target_options=None, storage_options=None, record_size=100000, categorical_threshold=10)[source]

Write references as a parquet files store.

The directory structure should mimic a normal zarr store but instead of standard chunk keys, references are saved as parquet dataframes.

Parameters:

fostr | dict: Location of a JSON file containing references or a reference set already loaded into memory.
urlstr: Location for the output, together with protocol. This must be a writable directory.
target_protocolstr: Used for loading the reference file, if it is a path. If None, protocol will be derived from the given path
target_optionsdict: Extra FS options for loading the reference file fo, if given as a path
storage_options: dict | None: Passed to fsspec when for writing the parquet.
record_sizeint: Number of references to store in each reference file (default 10000). Bigger values mean fewer read requests but larger memory footprint.
categorical_thresholdint: Encode urls as pandas.Categorical to reduce memory footprint if the ratio of the number of unique urls to total number of refs for each variable is greater than or equal to this number. (default 10)