Fill Value Handling¶
This document describes how virtual-tiff handles fill values when mapping TIFF metadata to the Zarr data model, and the design decisions behind the approach.
The problem¶
TIFF files — especially those converted from NetCDF via GDAL — can carry fill value information in multiple redundant places, all stored as strings:
| Source | TIFF location | Example |
|---|---|---|
gdal_no_data |
GDAL_NODATA tag (42113) | "-9999" |
_FillValue |
GDAL metadata XML (tag 42112) | "-9999" |
missing_value |
GDAL metadata XML (tag 42112) | "-9999" |
| Per-variable prefixed | GDAL metadata XML | "swe#_FillValue", "swe#missing_value" |
Meanwhile, the Zarr data model has a single, typed fill_value field on each array, and CF-aware readers like xarray expect _FillValue attributes to be encoded in a specific format.
Naively passing the TIFF's string attributes through to Zarr causes several failures:
- Type mismatch: xarray's
FillValueCoder.decode()expects base64-encoded bytes for float_FillValueattributes, not a plain string like"-9999". Passing the raw string causes astruct.error. - Zarr fill_value defaults to zero: Without parsing the nodata value, the Zarr
fill_valueis set to the dtype default (e.g.,0.0), which is semantically wrong — uninitialized chunks should return the nodata value, not zero. - Conflicting attributes: If
_FillValueandmissing_valueare both present with inconsistent encoding, xarray raises a "multiple fill values" warning or error.
Two distinct fill value concepts¶
The Zarr and CF data models use "fill value" for two different purposes:
- Zarr
fill_value(storage-level) - The default value returned for uninitialized or missing chunks. Set in
ArrayV3Metadata.fill_value. Zarr handles its own serialization — no special encoding needed. - CF
_FillValueattribute (data-level) - A sentinel value that CF-aware readers (xarray, rioxarray) use to mask individual data points as missing within chunks that do contain data. Stored as an array attribute and must be encoded for xarray's
FillValueCoder.
In the original NetCDF files, these are often the same value (e.g., -9999), serving both roles. The TIFF conversion collapses both into string metadata, losing the distinction. Virtual-tiff must reconstruct it.
The conversion chain¶
Understanding the data flow explains why type information is lost:
NetCDF (typed attributes)
_FillValue: float32(-9999.0)
missing_value: float32(-9999.0)
│
▼ GDAL conversion (gdal_translate, rio.to_raster, etc.)
GeoTIFF (string metadata)
GDAL_NODATA tag: "-9999" ← ASCII string
GDAL metadata XML:
_FillValue: "-9999" ← text in XML
missing_value: "-9999" ← text in XML
swe#_FillValue: "-9999" ← per-variable duplicate
│
▼ virtual-tiff parser
Virtual Zarr store
fill_value: float32(-9999.0) ← properly typed
_FillValue: "AAAAAICHw8A=" ← base64-encoded for xarray
missing_value: -9999.0 ← plain numeric for xarray
Consolidation approach¶
The _consolidate_fill_value function in parser.py handles this mapping. It runs after _get_attributes() extracts the raw TIFF metadata and before ArrayV3Metadata is constructed.
Step 1: Extract¶
Collect string values from the three known fill value attribute keys (_FillValue, missing_value, gdal_no_data). Only string values are collected — if an attribute is already numeric or encoded, it is left untouched.
Step 2: Parse¶
Each string value is normalized and cast to the array's numpy dtype. MSVC-style infinity and NaN representations (e.g., "-1.#INF", "1.#QNAN", "-1.#IND") are normalized to their standard forms ("-inf", "nan") before casting. The cast uses np.dtype(dtype).type(value) — for example, "-9999" for a float32 array becomes np.float32(-9999.0).
If a value cannot be represented in the target dtype (e.g., a GDAL_NODATA value of -32768 on a uint8 array), a ValueError is raised. This is treated as an error rather than silently ignored because an out-of-range nodata value indicates a mismatch between the file's metadata and its storage dtype that should be investigated.
Step 3: Validate¶
All parsed values are compared. If they disagree, a warning is emitted. This can happen when:
- The source file used different nodata values for storage vs. masking
- A conversion tool introduced a discrepancy
- Precision was lost in the string round-trip (e.g.,
float32→ string →float64)
Step 4: Select authoritative value¶
Priority order:
gdal_no_data— the TIFF-native nodata value from the GDAL_NODATA tag. This is the most direct representation of the file's intended nodata._FillValue— CF convention attribute, reconstructed from GDAL metadata XML.missing_value— older CF convention attribute.
The selected value becomes both the Zarr fill_value and the basis for encoded attributes.
Step 5: Encode and clean up¶
_FillValueis encoded viaFillValueCoder.encode()(e.g., base64 for floats) because xarray's Zarr backend decodes this attribute throughFillValueCoder.decode().missing_valueis set as a plain Python numeric (via.item()) because xarray passes this attribute through without decoding — it must already be a native numeric type.gdal_no_datais left as the original string, since neither xarray nor rioxarray process it.- Per-variable prefixed duplicates (e.g.,
swe#_FillValue,swe#missing_value) are removed when they carry the same value as the top-level attribute, to reduce clutter.
Downstream reader compatibility¶
xarray¶
xarray recognizes _FillValue and missing_value as CF masking attributes. Its Zarr backend decodes _FillValue via FillValueCoder but passes missing_value through as-is. Both must decode to the same numeric value, or xarray raises a "multiple fill values" warning.
The use_zarr_fill_value_as_mask option (default: None/False for open_zarr) controls whether the Zarr-level fill_value is also treated as a masking sentinel. With our approach, the Zarr fill_value and the decoded _FillValue attribute are the same value, so both paths produce correct results.
rioxarray¶
rioxarray's nodata property checks attributes in this order: _FillValue, missing_value, fill_value, nodata, then falls back to rasterio's DatasetReader.nodata. It does not recognize gdal_no_data by name. When data is accessed through a virtual Zarr store (no rasterio file manager), rioxarray relies entirely on the attribute chain — so properly encoding _FillValue is essential.
xarray's FillValueCoder dtype support¶
Not all dtypes are supported by xarray's FillValueCoder. The vendored FillValueCoder in virtual-tiff supports the following types:
| dtype kind | Encoding format |
|---|---|
f (float) |
base64-encoded little-endian double |
c (complex) |
list of two base64-encoded little-endian doubles [real, imag] |
iu (integer) |
Python int |
b (boolean) |
Python bool |
U (unicode string) |
Python str |
S (byte string) |
base64-encoded bytes |
Complex dtype support has been added to the vendored FillValueCoder but has not yet been merged into xarray's released version. Structured, compound, and datetime dtypes are not supported and will raise a ValueError during encoding or decoding.
When no fill value is present¶
If none of the three fill value keys are present as string attributes, _consolidate_fill_value returns None and the caller falls back to the dtype's default scalar (e.g., 0.0 for float32, 0 for int16). This is correct for TIFF files that have no nodata concept — the Zarr fill_value is only meaningful for uninitialized chunks, and no _FillValue attribute is emitted.