API Reference

File format backends

kerchunk.hdf.SingleHdf5ToZarr(h5f[, url, ...])

Translate the content of one HDF5 file into Zarr metadata.

kerchunk.grib2.scan_grib(url[, common, ...])

Generate references for a GRIB2 file

kerchunk.fits.process_file(url[, ...])

Create JSON references for a single FITS file as a zarr group

kerchunk.tiff.tiff_to_zarr(urlpath[, ...])

Wraps TIFFFile's fsspec writer to extract metadata as attributes

kerchunk.netCDF3.NetCDF3ToZarr(filename[, ...])

Generate references for a netCDF3 file

class kerchunk.hdf.SingleHdf5ToZarr(h5f: BinaryIO | str | File | Group, url: str | None = None, spec=1, inline_threshold=500, storage_options=None, error='warn', vlen_encode='embed', out=None)[source]

Translate the content of one HDF5 file into Zarr metadata.

HDF5 groups become Zarr groups. HDF5 datasets become Zarr arrays. Zarr array chunks remain in the HDF5 file.

Parameters:
h5ffile-like or str

Input HDF5 file. Can be a binary Python file-like object (duck-typed, adhering to BinaryIO is optional), in which case must also provide url. If a str, file will be opened using fsspec and storage_options.

urlstring

URI of the HDF5 file, if passing a file-like object or h5py File/Group

specint

The version of output to produce (see README of this repo)

inline_thresholdint

Include chunks smaller than this value directly in the output. Zero or negative to disable

storage_options: dict

passed to fsspec if h5f is a str

error: “warn” (default) | “pdb” | “ignore” | “raise”
vlen_encode: [“embed”, “null”, “leave”, “encode”]

What to do with VLEN string variables or columns of tabular variables leave: pass through the 16byte garbage IDs unaffected, but requires no codec null: set all the strings to None or empty; required that this library is available at read time embed: include all the values in the output JSON (should not be used for large tables) encode: save the ID-to-value mapping in a codec, to produce the real values at read time; requires this library to be available. Can be efficient storage where there are few unique values.

out: dict-like or None

This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored

Methods

translate()

Translate content of one HDF5 file into Zarr storage format.

translate()[source]

Translate content of one HDF5 file into Zarr storage format.

This method is the main entry point to execute the workflow, and returns a “reference” structure to be used with zarr/kerchunk

No data is copied out of the HDF5 file.

Returns:
dict

Dictionary containing reference structure.

kerchunk.grib2.scan_grib(url, common=None, storage_options=None, inline_threshold=100, skip=0, filter={})[source]

Generate references for a GRIB2 file

Parameters:
url: str

File location

common_vars: (depr, do not use)
storage_options: dict

For accessing the data, passed to filesystem

inline_threshold: int

If given, store array data smaller than this value directly in the output

skip: int

If non-zero, stop processing the file after this many messages

filter: dict

keyword filtering. For each key, only messages where the key exists and has the exact value or is in the given set, are processed. E.g., the cf-style filter {'typeOfLevel': 'heightAboveGround', 'level': 2} only keeps messages where heightAboveGround==2.

Returns:
list(dict): references dicts in Version 1 format, one per message in the file
kerchunk.fits.process_file(url, storage_options=None, extension=None, inline_threshold=100, primary_attr_to_group=False, out=None)[source]

Create JSON references for a single FITS file as a zarr group

Parameters:
url: str

Where the file is

storage_options: dict

How to load that file (passed to fsspec)

extension: list(int | str) | int | str or None

Which extensions to include. Can be ordinal integer(s), the extension name (str) or if None, uses the first data extension

inline_threshold: int

(not yet implemented)

primary_attr_to_group: bool

Whether the output top-level group contains the attributes of the primary extension (which often contains no data, just a general description)

out: dict-like or None

This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored

Returns:
dict of the references
kerchunk.tiff.tiff_to_zarr(urlpath, remote_options=None, target=None, target_options=None)[source]

Wraps TIFFFile’s fsspec writer to extract metadata as attributes

Parameters:
urlpath: str

Location of input TIFF

remote_options: dict

pass these to fsspec when opening urlpath

target: str

Write JSON to this location. If not given, no file is output

target_options: dict

pass these to fsspec when opening target

Returns:
references dict
class kerchunk.netCDF3.NetCDF3ToZarr(filename, storage_options=None, inline_threshold=100, max_chunk_size=0, out=None, **kwargs)[source]

Generate references for a netCDF3 file

Uses scipy’s netCDF3 reader, but only reads the metadata. Note that instances do behave like actual scipy netcdf files, but contain no valid data. Also appears to work for netCDF2, although this is not currently tested.

Methods

translate()

Produce references dictionary

__init__(filename, storage_options=None, inline_threshold=100, max_chunk_size=0, out=None, **kwargs)[source]
Parameters:
filename: str

location of the input

storage_options: dict

passed to fsspec when opening filename

inline_threshold: int

Byte size below which an array will be embedded in the output. Use 0 to disable inlining.

max_chunk_size: int

How big a chunk can be before triggering subchunking. If 0, there is no subchunking, and there is never subchunking for coordinate/dimension arrays. E.g., if an array contains 10,000bytes, and this value is 6000, there will be two output chunks, split on the biggest available dimension. [TBC]

out: dict-like or None

This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored

args, kwargs: passed to scipy superclass ``scipy.io.netcdf.netcdf_file``
translate()[source]

Produce references dictionary

Parameters:

Codecs

kerchunk.codecs.GRIBCodec(var[, dtype])

Read GRIB stream of bytes as a message using eccodes

kerchunk.codecs.AsciiTableCodec(indtypes, ...)

Decodes ASCII-TABLE extensions in FITS files

kerchunk.codecs.FillStringsCodec(dtype[, id_map])

Sets fixed-length string fields to empty

kerchunk.codecs.VarArrCodec(dt_in, dt_out, ...)

Variable length arrays in a FITS BINTABLE extension

kerchunk.codecs.RecordArrayMember(member, dtype)

Read components of a record array (complex dtype)

class kerchunk.codecs.GRIBCodec(var, dtype=None)[source]

Read GRIB stream of bytes as a message using eccodes

__init__(var, dtype=None)[source]
class kerchunk.codecs.AsciiTableCodec(indtypes, outdtypes)[source]

Decodes ASCII-TABLE extensions in FITS files

__init__(indtypes, outdtypes)[source]
Parameters:
indtypes: list[str]

dtypes of the fields as in the table

outdtypes: list[str]

requested final dtypes

class kerchunk.codecs.FillStringsCodec(dtype, id_map=None)[source]

Sets fixed-length string fields to empty

To be used with HDF fields of strings, to fill in the valules of the opaque 16-byte string IDs.

__init__(dtype, id_map=None)[source]

Note: we must pass id_map using strings, because this is JSON-encoded by zarr.

Parameters:
id_map: None | str | dict(str, str)
class kerchunk.codecs.VarArrCodec(dt_in, dt_out, nrow, types)[source]

Variable length arrays in a FITS BINTABLE extension

__init__(dt_in, dt_out, nrow, types)[source]
class kerchunk.codecs.RecordArrayMember(member, dtype)[source]

Read components of a record array (complex dtype)

__init__(member, dtype)[source]
Parameters:
member: str

name of desired subarray

dtype: list of lists

description of the complex dtype of the overall record array. Must be both parsable by np.dtype() and also be JSON serialisable

Combining

kerchunk.combine.MultiZarrToZarr(path[, ...])

Combine multiple kerchunk'd datasets into a single logical aggregate dataset

kerchunk.combine.merge_vars(files[, ...])

Merge variables across datasets with identical coordinates

kerchunk.combine.concatenate_arrays(files[, ...])

Simple concatenate of one zarr array along an axis

kerchunk.combine.auto_dask(urls, ...[, ...])

Batched tree combine using dask.

kerchunk.combine.drop(fields)

Generate example preprocessor removing given fields

class kerchunk.combine.MultiZarrToZarr(path, indicts=None, coo_map=None, concat_dims=None, coo_dtypes=None, identical_dims=None, target_options=None, remote_protocol=None, remote_options=None, inline_threshold=500, preprocess=None, postprocess=None, out=None)[source]

Combine multiple kerchunk’d datasets into a single logical aggregate dataset

Parameters:
  • path – str, list(str) or list(dict) Local paths, each containing a references JSON; or a list of references dicts. You may pass a list of reference dicts only, but then they will not have assicuated filenames; if you need filenames for producing coordinates, pass the list of filenames with path=, and the references with indicts=

  • indicts – list(dict)

  • concat_dims – str or list(str) Names of the dimensions to expand with

  • coo_map

    dict(str, selector) The special key “var” means the variable name in the output, which will be “VARNAME” by default (i.e., variable names are the same as in the input datasets). The default for any other coordinate is data:varname, i.e., look for an array with that name.

    Selectors (“how to get coordinate values from a dataset”) can be:
    • a constant value (usually str for a var name, number for a coordinate)

    • a compiled regex re.Pattern, which will be applied to the filename. Should return exactly one value

    • a string beginning “attr:” which will fetch this attribute from the zarr dataset of each path

    • a string beginning “vattr:{var}:” as above, but the attribute is taken from the array named var

    • ”VARNAME” special value where a dataset contains multiple variables, just use the variable names as given

    • ”INDEX” special value for the index of how far through the list of inputs we are so far

    • a string beginning “data:{var}” which will get the appropriate zarr array from each input dataset.

    • ”cf:{var}”, interpret the value of var using cftime, returning a datetime. These will be automatically re-encoded with cftime, unless you specify an “M8[*]” dtype for the coordinate, in which case a conversion will be attempted.

    • a list with the values that are known beforehand

    • a function with signature (index, fs, var, fn) -> value, where index is an int counter, fs is the file system made for the current input, var is the variable we are probing may be “var”) and fn is the filename or None if dicts were used as input

  • coo_dtypes – map(str, str|np.dtype) Coerce the final type of coordinate arrays (otherwise use numpy default)

  • identical_dims – list[str] Variables that are to be copied across from the first input dataset, because they do not vary.

  • target_options – dict Storage options for opening path

  • remote_protocol – str The protocol of the original data

  • remote_options – dict

  • inline_threshold – int Size below which binary blocks are included directly in the output

  • preprocess – callable Acts on the references dict of all inputs before processing. See drop() for an example.

  • postprocess – callable Acts on the references dict before output. postprocess(dict)-> dict

  • out – dict-like or None This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored

  • append – bool If True, will load the references specified by out and add to them rather than starting from scratch. Assumes the same coordinates are being concatenated.

Methods

append(path, original_refs[, ...])

Update an existing combined reference set with new references

translate([filename, storage_options])

Perform all stages and return the resultant references dict

__init__(path, indicts=None, coo_map=None, concat_dims=None, coo_dtypes=None, identical_dims=None, target_options=None, remote_protocol=None, remote_options=None, inline_threshold=500, preprocess=None, postprocess=None, out=None)[source]
classmethod append(path, original_refs, remote_protocol=None, remote_options=None, target_options=None, **kwargs)[source]

Update an existing combined reference set with new references

There are two main usage patterns:

  • if the input original_refs is JSON, the combine happens in memory and the output should be written to JSON. This could then be optionally converted to parquet in a separate step

  • if original_refs is a lazy parquet reference set, then it will be amended in-place

If you want to extend JSON references and output to parquet, you must first convert to parquet in the location you would like the final product to live.

The other arguments should be the same as they were at the creation of the original combined reference set.

NOTE: if the original combine used a postprocess function, it may be that this process functions, as the combine is done “before” postprocessing. Functions that only add information (as as setting attrs) would be OK.

Parameters:
path: list of reference sets to add. If remote/target options would be different

to original_refs, these can be as dicts or LazyReferenceMapper instances

original_refs: combined reference set to be extended
remote_protocol, remote_options, target_options: referring to ``original_refs```
kwargs: to MultiZarrToZarr
Returns:
MultiZarrToZarr
translate(filename=None, storage_options=None)[source]

Perform all stages and return the resultant references dict

If filename and storage options are given, the output is written to this file using ujson and fsspec.

kerchunk.combine.merge_vars(files, storage_options=None)[source]

Merge variables across datasets with identical coordinates

Parameters:
  • files – list(dict), list(str) or list(fsspec.OpenFile) List of reference dictionaries or list of paths to reference json files to be merged

  • storage_options – dict Dictionary containing kwargs to fsspec.open_files

kerchunk.combine.concatenate_arrays(files, storage_options=None, axis=0, key_seperator='.', path=None, check_arrays=False)[source]

Simple concatenate of one zarr array along an axis

Assumes that each array is identical in shape/type.

If the inputs are groups, provide the path to the contained array, and all other arrays will be ignored. You could concatentate the arrays separately and then recombine them with merge_vars.

Parameters:
files: list[dict] | list[str]

Input reference sets, maybe generated by kerchunk.zarr.single_zarr

storage_options: dict | None

To create the filesystems, such at target/remote protocol and target/remote options

key_seperator: str

“.” or “/”, how the zarr keys are stored

path: str or None

If the datasets are groups rather than simple arrays, this is the location in the group hierarchy to concatenate. The group structure will be recreated.

check_arrays: bool

Whether we check the size and chunking of the inputs. If True, and an inconsistency is found, an exception is raised. If False (default), the user is expected to be certain that the chunking and shapes are compatible.

kerchunk.combine.auto_dask(urls: List[str], single_driver: str, single_kwargs: dict, mzz_kwargs: dict, n_batches: int, remote_protocol=None, remote_options=None, filename=None, output_options=None)[source]

Batched tree combine using dask.

If you wish to run on a distributed cluster (recommended), create a client before calling this function.

Parameters:
urls: list[str]

input dataset URLs

single_driver: class

class with translate() method

single_kwargs: to pass to single-input driver
mzz_kwargs: passed to ``MultiZarrToZarr`` for each batch
n_batches: int

Number of MZZ instances in the first combine stage. Maybe set equal to the number of dask workers, or a multple thereof.

remote_protocol: str | None
remote_options: dict

To fsspec for opening the remote files

filename: str | None

Ouput filename, if writing

output_options

If filename is not None, open it with these options

Returns:
reference set
kerchunk.combine.drop(fields)[source]

Generate example preprocessor removing given fields

Utilities

kerchunk.utils.rename_target(refs, renames)

Utility to change URLs in a reference set in a predictable way

kerchunk.utils.rename_target_files(url_in, ...)

Perform URL renames on a reference set - read and write from JSON

kerchunk.utils.subchunk(store, variable, factor)

Split uncompressed chunks into integer subchunks on the largest axis

kerchunk.utils.dereference_archives(references)

Directly point to uncompressed byte ranges in ZIP/TAR archives

kerchunk.utils.consolidate(refs)

Turn raw references into output

kerchunk.utils.do_inline(store, threshold[, ...])

Replace short chunks with the value of that chunk and inline metadata

kerchunk.utils.inline_array(store[, ...])

Inline whole arrays by threshold or name, replace with a single metadata chunk

kerchunk.df.refs_to_dataframe(fo, url[, ...])

Write references as a parquet files store.

kerchunk.utils.rename_target(refs, renames)[source]

Utility to change URLs in a reference set in a predictable way

For reference sets including templates, this is more easily done by using template overrides at access time; but rewriting the references and saving a new file means not having to do that every time.

Parameters:
refs: dict

Reference set

renames: dict[str, str]

Mapping from the old URL (including protocol, if this is how they appear in the original) to new URL

Returns:
dict: the altered reference set, which can be saved
kerchunk.utils.rename_target_files(url_in, renames, url_out=None, storage_options_in=None, storage_options_out=None)[source]

Perform URL renames on a reference set - read and write from JSON

Parameters:
url_in: str

Original JSON reference set

renames: dict

URL renamings to perform (see renate_target)

url_out: str | None

Where to write to. If None, overwrites original

storage_options_in: dict | None

passed to fsspec for opening url_in

storage_options_out: dict | None

passed to fsspec for opening url_out. If None, storage_options_in is used.

Returns:
None
kerchunk.tiff.generate_coords(attrs, shape)[source]

Produce coordinate arrays for given variable

Specific to GeoTIFF input attributes

Parameters:
attrs: dict

Containing the geoTIFF tags, probably the root group of the dataset

shape: tuple[int]

The array size in numpy (C) order

kerchunk.utils.subchunk(store, variable, factor)[source]

Split uncompressed chunks into integer subchunks on the largest axis

Parameters:
store: dict

reference set

variable: str

the named zarr variable (give as /-separated path if deep)

factor: int

the number of chunks each input chunk turns into. Must be an exact divisor of the original largest dimension length.

Returns:
modified store
kerchunk.utils.dereference_archives(references, remote_options=None)[source]

Directly point to uncompressed byte ranges in ZIP/TAR archives

If a set of references have been made for files contained within ZIP or (uncompressed) TAR archives, the “zip://…” and “tar://…” URLs should be converted to byte ranges in the overall file.

Parameters:
references: dict

a simple reference set

remote_options: dict or None

For opening the archives

kerchunk.utils.consolidate(refs)[source]

Turn raw references into output

kerchunk.utils.do_inline(store, threshold, remote_options=None, remote_protocol=None)[source]

Replace short chunks with the value of that chunk and inline metadata

The chunk may need encoding with base64 if not ascii, so actual length may be larger than threshold.

kerchunk.utils.inline_array(store, threshold=1000, names=None, remote_options=None)[source]

Inline whole arrays by threshold or name, replace with a single metadata chunk

Inlining whole arrays results in fewer keys. If the constituent keys were already inlined, this also results in a smaller file overall. No action is taken for arrays that are already of one chunk (they should be in

Parameters:
store: dict/JSON file

reference set

threshold: int

Size in bytes below which to inline. Set to 0 to prevent inlining by size

names: list[str] | None

It the array name (as a dotted full path) appears in this list, it will be inlined irrespective of the threshold size. Useful for coordinates.

remote_options: dict | None

Needed to fetch data, if the required keys are not already individually inlined in the data.

Returns:
amended references set (simple style)
kerchunk.df.refs_to_dataframe(fo, url, target_protocol=None, target_options=None, storage_options=None, record_size=100000, categorical_threshold=10)[source]

Write references as a parquet files store.

The directory structure should mimic a normal zarr store but instead of standard chunk keys, references are saved as parquet dataframes.

Parameters:
fostr | dict

Location of a JSON file containing references or a reference set already loaded into memory.

urlstr

Location for the output, together with protocol. This must be a writable directory.

target_protocolstr

Used for loading the reference file, if it is a path. If None, protocol will be derived from the given path

target_optionsdict

Extra FS options for loading the reference file fo, if given as a path

storage_options: dict | None

Passed to fsspec when for writing the parquet.

record_sizeint

Number of references to store in each reference file (default 10000). Bigger values mean fewer read requests but larger memory footprint.

categorical_thresholdint

Encode urls as pandas.Categorical to reduce memory footprint if the ratio of the number of unique urls to total number of refs for each variable is greater than or equal to this number. (default 10)