API Reference
File format backends
|
Translate the content of one HDF5 file into Zarr metadata. |
|
Generate references for a GRIB2 file |
|
Create JSON references for a single FITS file as a zarr group |
|
Wraps TIFFFile's fsspec writer to extract metadata as attributes |
|
Generate references for a netCDF3 file |
|
Experimental: interface to HDF4 archival files |
- class kerchunk.hdf.SingleHdf5ToZarr(h5f: BinaryIO | str | File | Group, url: str | None = None, spec=1, inline_threshold=500, storage_options=None, error='warn', vlen_encode='embed', out=None)[source]
Translate the content of one HDF5 file into Zarr metadata.
HDF5 groups become Zarr groups. HDF5 datasets become Zarr arrays. Zarr array chunks remain in the HDF5 file.
- Parameters:
- h5ffile-like or str
Input HDF5 file. Can be a binary Python file-like object (duck-typed, adhering to BinaryIO is optional), in which case must also provide url. If a str, file will be opened using fsspec and storage_options.
- urlstring
URI of the HDF5 file, if passing a file-like object or h5py File/Group
- specint
The version of output to produce (see README of this repo)
- inline_thresholdint
Include chunks smaller than this value directly in the output. Zero or negative to disable
- storage_options: dict
passed to fsspec if h5f is a str
- error: “warn” (default) | “pdb” | “ignore” | “raise”
- vlen_encode: [“embed”, “null”, “leave”, “encode”]
What to do with VLEN string variables or columns of tabular variables leave: pass through the 16byte garbage IDs unaffected, but requires no codec null: set all the strings to None or empty; required that this library is available at read time embed: include all the values in the output JSON (should not be used for large tables) encode: save the ID-to-value mapping in a codec, to produce the real values at read time; requires this library to be available. Can be efficient storage where there are few unique values.
- out: dict-like or None
This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored
Methods
translate
([preserve_linked_dsets])Translate content of one HDF5 file into Zarr storage format.
- translate(preserve_linked_dsets=False)[source]
Translate content of one HDF5 file into Zarr storage format.
This method is the main entry point to execute the workflow, and returns a “reference” structure to be used with zarr/kerchunk
No data is copied out of the HDF5 file.
- Parameters:
- preserve_linked_dsetsbool (optional, default False)
If True, translate HDF5 soft and hard links for each h5py.Dataset into the reference structure. Requires h5py version 3.11.0 or later. Will not translate external links or links to h5py.Group objects.
- Returns:
- dict
Dictionary containing reference structure.
- kerchunk.grib2.scan_grib(url, common=None, storage_options=None, inline_threshold=100, skip=0, filter={}) List[Dict] [source]
Generate references for a GRIB2 file
- Parameters:
- url: str
File location
- common_vars: (depr, do not use)
- storage_options: dict
For accessing the data, passed to filesystem
- inline_threshold: int
If given, store array data smaller than this value directly in the output
- skip: int
If non-zero, stop processing the file after this many messages
- filter: dict
keyword filtering. For each key, only messages where the key exists and has the exact value or is in the given set, are processed. E.g., the cf-style filter
{'typeOfLevel': 'heightAboveGround', 'level': 2}
only keeps messages where heightAboveGround==2.
- Returns:
- list(dict): references dicts in Version 1 format, one per message in the file
- kerchunk.fits.process_file(url, storage_options=None, extension=None, inline_threshold=100, primary_attr_to_group=False, out=None)[source]
Create JSON references for a single FITS file as a zarr group
- Parameters:
- url: str
Where the file is
- storage_options: dict
How to load that file (passed to fsspec)
- extension: list(int | str) | int | str or None
Which extensions to include. Can be ordinal integer(s), the extension name (str) or if None, uses the first data extension
- inline_threshold: int
(not yet implemented)
- primary_attr_to_group: bool
Whether the output top-level group contains the attributes of the primary extension (which often contains no data, just a general description)
- out: dict-like or None
This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored
- Returns:
- dict of the references
- kerchunk.tiff.tiff_to_zarr(urlpath, remote_options=None, target=None, target_options=None)[source]
Wraps TIFFFile’s fsspec writer to extract metadata as attributes
- Parameters:
- urlpath: str
Location of input TIFF
- remote_options: dict
pass these to fsspec when opening urlpath
- target: str
Write JSON to this location. If not given, no file is output
- target_options: dict
pass these to fsspec when opening target
- Returns:
- references dict
- class kerchunk.netCDF3.NetCDF3ToZarr(filename, storage_options=None, inline_threshold=100, max_chunk_size=0, out=None, **kwargs)[source]
Generate references for a netCDF3 file
Uses scipy’s netCDF3 reader, but only reads the metadata. Note that instances do behave like actual scipy netcdf files, but contain no valid data. Also appears to work for netCDF2, although this is not currently tested.
Methods
Produce references dictionary
- __init__(filename, storage_options=None, inline_threshold=100, max_chunk_size=0, out=None, **kwargs)[source]
- Parameters:
- filename: str
location of the input
- storage_options: dict
passed to fsspec when opening filename
- inline_threshold: int
Byte size below which an array will be embedded in the output. Use 0 to disable inlining.
- max_chunk_size: int
How big a chunk can be before triggering subchunking. If 0, there is no subchunking, and there is never subchunking for coordinate/dimension arrays. E.g., if an array contains 10,000bytes, and this value is 6000, there will be two output chunks, split on the biggest available dimension. [TBC]
- out: dict-like or None
This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored
- args, kwargs: passed to scipy superclass ``scipy.io.netcdf.netcdf_file``
Codecs
|
Read GRIB stream of bytes as a message using eccodes |
|
Decodes ASCII-TABLE extensions in FITS files |
|
Sets fixed-length string fields to empty |
|
Variable length arrays in a FITS BINTABLE extension |
|
Read components of a record array (complex dtype) |
- class kerchunk.codecs.GRIBCodec(var, dtype=None)[source]
Read GRIB stream of bytes as a message using eccodes
- class kerchunk.codecs.AsciiTableCodec(indtypes, outdtypes)[source]
Decodes ASCII-TABLE extensions in FITS files
- class kerchunk.codecs.FillStringsCodec(dtype, id_map=None)[source]
Sets fixed-length string fields to empty
To be used with HDF fields of strings, to fill in the valules of the opaque 16-byte string IDs.
- class kerchunk.codecs.VarArrCodec(dt_in, dt_out, nrow, types)[source]
Variable length arrays in a FITS BINTABLE extension
Combining
|
Combine multiple kerchunk'd datasets into a single logical aggregate dataset |
|
Merge variables across datasets with identical coordinates |
|
Simple concatenate of one zarr array along an axis |
|
Batched tree combine using dask. |
|
Generate example preprocessor removing given fields |
- class kerchunk.combine.MultiZarrToZarr(path, indicts=None, coo_map=None, concat_dims=None, coo_dtypes=None, identical_dims=None, target_options=None, remote_protocol=None, remote_options=None, inline_threshold: int = 500, preprocess=None, postprocess=None, out=None)[source]
Combine multiple kerchunk’d datasets into a single logical aggregate dataset
- Parameters:
path – str, list(str) or list(dict) Local paths, each containing a references JSON; or a list of references dicts. You may pass a list of reference dicts only, but then they will not have assicuated filenames; if you need filenames for producing coordinates, pass the list of filenames with path=, and the references with indicts=
indicts – list(dict)
concat_dims – str or list(str) Names of the dimensions to expand with
coo_map –
dict(str, selector) The special key “var” means the variable name in the output, which will be “VARNAME” by default (i.e., variable names are the same as in the input datasets). The default for any other coordinate is data:varname, i.e., look for an array with that name.
- Selectors (“how to get coordinate values from a dataset”) can be:
a constant value (usually str for a var name, number for a coordinate)
a compiled regex
re.Pattern
, which will be applied to the filename. Should return exactly one valuea string beginning “attr:” which will fetch this attribute from the zarr dataset of each path
a string beginning “vattr:{var}:” as above, but the attribute is taken from the array named var
”VARNAME” special value where a dataset contains multiple variables, just use the variable names as given
”INDEX” special value for the index of how far through the list of inputs we are so far
a string beginning “data:{var}” which will get the appropriate zarr array from each input dataset.
”cf:{var}”, interpret the value of var using cftime, returning a datetime. These will be automatically re-encoded with cftime, unless you specify an “M8[*]” dtype for the coordinate, in which case a conversion will be attempted.
a list with the values that are known beforehand
a function with signature (index, fs, var, fn) -> value, where index is an int counter, fs is the file system made for the current input, var is the variable we are probing may be “var”) and fn is the filename or None if dicts were used as input
coo_dtypes – map(str, str|np.dtype) Coerce the final type of coordinate arrays (otherwise use numpy default)
identical_dims – list[str] Variables that are to be copied across from the first input dataset, because they do not vary.
target_options – dict Storage options for opening
path
remote_protocol – str The protocol of the original data
remote_options – dict
inline_threshold – int Size below which binary blocks are included directly in the output
preprocess – callable Acts on the references dict of all inputs before processing. See
drop()
for an example.postprocess – callable Acts on the references dict before output. postprocess(dict)-> dict
out – dict-like or None This allows you to supply an fsspec.implementations.reference.LazyReferenceMapper to write out parquet as the references get filled, or some other dictionary-like class to customise how references get stored
append – bool If True, will load the references specified by out and add to them rather than starting from scratch. Assumes the same coordinates are being concatenated.
Methods
append
(path, original_refs[, ...])Update an existing combined reference set with new references
translate
([filename, storage_options])Perform all stages and return the resultant references dict
- __init__(path, indicts=None, coo_map=None, concat_dims=None, coo_dtypes=None, identical_dims=None, target_options=None, remote_protocol=None, remote_options=None, inline_threshold: int = 500, preprocess=None, postprocess=None, out=None)[source]
- classmethod append(path, original_refs, remote_protocol=None, remote_options=None, target_options=None, **kwargs)[source]
Update an existing combined reference set with new references
There are two main usage patterns:
if the input
original_refs
is JSON, the combine happens in memory and the output should be written to JSON. This could then be optionally converted to parquet in a separate stepif
original_refs
is a lazy parquet reference set, then it will be amended in-place
If you want to extend JSON references and output to parquet, you must first convert to parquet in the location you would like the final product to live.
The other arguments should be the same as they were at the creation of the original combined reference set.
NOTE: if the original combine used a postprocess function, it may be that this process functions, as the combine is done “before” postprocessing. Functions that only add information (as as setting attrs) would be OK.
- Parameters:
- path: list of reference sets to add. If remote/target options would be different
to
original_refs
, these can be as dicts or LazyReferenceMapper instances- original_refs: combined reference set to be extended
- remote_protocol, remote_options, target_options: referring to ``original_refs```
- kwargs: to MultiZarrToZarr
- Returns:
- MultiZarrToZarr
- kerchunk.combine.merge_vars(files, storage_options=None)[source]
Merge variables across datasets with identical coordinates
- Parameters:
files – list(dict), list(str) or list(fsspec.OpenFile) List of reference dictionaries or list of paths to reference json files to be merged
storage_options – dict Dictionary containing kwargs to fsspec.open_files
- kerchunk.combine.concatenate_arrays(files, storage_options=None, axis=0, key_seperator='.', path=None, check_arrays=False)[source]
Simple concatenate of one zarr array along an axis
Assumes that each array is identical in shape/type.
If the inputs are groups, provide the path to the contained array, and all other arrays will be ignored. You could concatentate the arrays separately and then recombine them with
merge_vars
.- Parameters:
- files: list[dict] | list[str]
Input reference sets, maybe generated by
kerchunk.zarr.single_zarr
- storage_options: dict | None
To create the filesystems, such at target/remote protocol and target/remote options
- key_seperator: str
“.” or “/”, how the zarr keys are stored
- path: str or None
If the datasets are groups rather than simple arrays, this is the location in the group hierarchy to concatenate. The group structure will be recreated.
- check_arrays: bool
Whether we check the size and chunking of the inputs. If True, and an inconsistency is found, an exception is raised. If False (default), the user is expected to be certain that the chunking and shapes are compatible.
- kerchunk.combine.auto_dask(urls: List[str], single_driver: type, single_kwargs: dict, mzz_kwargs: dict, n_batches: int, remote_protocol=None, remote_options=None, filename=None, output_options=None)[source]
Batched tree combine using dask.
If you wish to run on a distributed cluster (recommended), create a client before calling this function.
- Parameters:
- urls: list[str]
input dataset URLs
- single_driver: class
class with
translate()
method- single_kwargs: to pass to single-input driver
- mzz_kwargs: passed to ``MultiZarrToZarr`` for each batch
- n_batches: int
Number of MZZ instances in the first combine stage. Maybe set equal to the number of dask workers, or a multple thereof.
- remote_protocol: str | None
- remote_options: dict
To fsspec for opening the remote files
- filename: str | None
Ouput filename, if writing
- output_options
If
filename
is not None, open it with these options
- Returns:
- reference set
Utilities
|
Utility to change URLs in a reference set in a predictable way |
|
Perform URL renames on a reference set - read and write from JSON |
|
Split uncompressed chunks into integer subchunks on the largest axis |
|
Directly point to uncompressed byte ranges in ZIP/TAR archives |
Turn raw references into output |
|
|
Replace short chunks with the value of that chunk and inline metadata |
|
Inline whole arrays by threshold or name, replace with a single metadata chunk |
|
Write references as a parquet files store. |
- kerchunk.utils.rename_target(refs, renames)[source]
Utility to change URLs in a reference set in a predictable way
For reference sets including templates, this is more easily done by using template overrides at access time; but rewriting the references and saving a new file means not having to do that every time.
- Parameters:
- refs: dict
Reference set
- renames: dict[str, str]
Mapping from the old URL (including protocol, if this is how they appear in the original) to new URL
- Returns:
- dict: the altered reference set, which can be saved
- kerchunk.utils.rename_target_files(url_in, renames, url_out=None, storage_options_in=None, storage_options_out=None)[source]
Perform URL renames on a reference set - read and write from JSON
- Parameters:
- url_in: str
Original JSON reference set
- renames: dict
URL renamings to perform (see
renate_target
)- url_out: str | None
Where to write to. If None, overwrites original
- storage_options_in: dict | None
passed to fsspec for opening url_in
- storage_options_out: dict | None
passed to fsspec for opening url_out. If None, storage_options_in is used.
- Returns:
- None
- kerchunk.tiff.generate_coords(attrs, shape)[source]
Produce coordinate arrays for given variable
Specific to GeoTIFF input attributes
- Parameters:
- attrs: dict
Containing the geoTIFF tags, probably the root group of the dataset
- shape: tuple[int]
The array size in numpy (C) order
- kerchunk.utils.subchunk(store, variable, factor)[source]
Split uncompressed chunks into integer subchunks on the largest axis
- Parameters:
- store: dict
reference set
- variable: str
the named zarr variable (give as /-separated path if deep)
- factor: int
the number of chunks each input chunk turns into. Must be an exact divisor of the original largest dimension length.
- Returns:
- modified store
- kerchunk.utils.dereference_archives(references, remote_options=None)[source]
Directly point to uncompressed byte ranges in ZIP/TAR archives
If a set of references have been made for files contained within ZIP or (uncompressed) TAR archives, the “zip://…” and “tar://…” URLs should be converted to byte ranges in the overall file.
- Parameters:
- references: dict
a simple reference set
- remote_options: dict or None
For opening the archives
- kerchunk.utils.do_inline(store, threshold, remote_options=None, remote_protocol=None)[source]
Replace short chunks with the value of that chunk and inline metadata
The chunk may need encoding with base64 if not ascii, so actual length may be larger than threshold.
- kerchunk.utils.inline_array(store, threshold=1000, names=None, remote_options=None)[source]
Inline whole arrays by threshold or name, replace with a single metadata chunk
Inlining whole arrays results in fewer keys. If the constituent keys were already inlined, this also results in a smaller file overall. No action is taken for arrays that are already of one chunk (they should be in
- Parameters:
- store: dict/JSON file
reference set
- threshold: int
Size in bytes below which to inline. Set to 0 to prevent inlining by size
- names: list[str] | None
It the array name (as a dotted full path) appears in this list, it will be inlined irrespective of the threshold size. Useful for coordinates.
- remote_options: dict | None
Needed to fetch data, if the required keys are not already individually inlined in the data.
- Returns:
- amended references set (simple style)
- kerchunk.df.refs_to_dataframe(fo, url, target_protocol=None, target_options=None, storage_options=None, record_size=100000, categorical_threshold=10)[source]
Write references as a parquet files store.
The directory structure should mimic a normal zarr store but instead of standard chunk keys, references are saved as parquet dataframes.
- Parameters:
- fostr | dict
Location of a JSON file containing references or a reference set already loaded into memory.
- urlstr
Location for the output, together with protocol. This must be a writable directory.
- target_protocolstr
Used for loading the reference file, if it is a path. If None, protocol will be derived from the given path
- target_optionsdict
Extra FS options for loading the reference file
fo
, if given as a path- storage_options: dict | None
Passed to fsspec when for writing the parquet.
- record_sizeint
Number of references to store in each reference file (default 10000). Bigger values mean fewer read requests but larger memory footprint.
- categorical_thresholdint
Encode urls as pandas.Categorical to reduce memory footprint if the ratio of the number of unique urls to total number of refs for each variable is greater than or equal to this number. (default 10)