References specification ======================== The content of a reference set should match the given description here. ``fsspec``'s ``ReferenceFileSystem`` expects this kind of input. Version 0 --------- Prototype spec for the structure required by ReferenceFileSystem: .. code-block:: json { "key0": "data", "key1": ["protocol://target_url", 10000, 100] } where: * ``key0`` includes data as-is (stored as text) * ``key1`` refers to a data file URL, the offset within the file (in bytes), and the length of the data item (in bytes). For example, Zarr data in this proposed spec might be represented as: .. code-block:: json { ".zgroup": "{\n \"zarr_format\": 2\n}", ".zattrs": "{\n \"Conventions\": \"UGRID-0.9.0\n\"}", "x/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n \"node\"\n ...", "x/.zarray": "{\n \"chunks\": [\n 9228245\n ],\n \"compressor\": null,\n \"dtype\": \" Parquet references ------------------ Since JSON is rather verbose, it is easy with enough chunks to make a references file that is too big: slow to load and heavy on memory. Although the former can be alleviated by compression (I recommend Zstd), the latter cannot. This can become particularly apparent during the combine phase when loading many reference sets. The class `fsspec.implementations.reference.LazyReferenceMapper`_ provides an alternative *implementation*, and its on-disk layout effectively is a new reference spec, and we describe it here. The class itself has a dict mapper interface, just like the rendered references from JSON files; except that it assumes that it is working on a zarr dataset. This is because the references are split into files, and an array's shape/chunk information is used to figure out which reference file to load. .. _fsspec.implementations.reference.LazyReferenceMapper: https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=lazyreference#fsspec.implementations.reference.LazyReferenceMapper The following code .. code-block:: python lz = fsspec.implementations.reference.LazyReferenceMapper.create("ref.parquet") z = zarr.open_group(lz, mode="w") d = z.create_dataset("name", shape=(1,)) d[:] = 1 g2 = z.create_group("deep") d = g2.create_dataset("name", shape=(1,)) d[:] = 1 produces files .. code-block:: text ref.parquet/deep/name/refs.0.parq ref.parquet/name/refs.0.parq ref.parquet/.zmetadata Here, .zmetadata is all of the metadata of all of all subgroups/arrays (similar to zarr "consolidated metadata"), with two top-level fields: "metadata" (dict[str, str] all of the zarr metadata key/values) and "record_size", an integer set during ``.create()``. Each parquet file contains references within the corresponding path to where it is. For example, key "name/0" will be the zeroth reference in "./name/refs.0.parq". If there are multiple dimensions, normal C indexing is used to find the Nth reference, and there are up to "record_size" references (default 10000) in the first file; reference >10000,<=20000 would be in "./name/refs.2.parquet". Each file is (for now) padded to record_size, but they compress really well. Each row of the parquet data contains fields .. code-block:: path: optional str/categorical, remote location URL offset: int, start location of block size: int, number of bytes in block raw: optional bytes, binary data If ``raw`` is populated, this is the data of the key. If ``path`` is populated but size is 0, it is the whole file indicated (like a JSON [url] reference). Otherwise, it is a byte block in the indicated file (like a JSON [url, offset, size] reference). If both ``raw`` and ``path`` are NULL, the key does not exist. We reserve the possibility to store small array data in .zmetadata instead of creating a small/mostly empty parquet file for each.