Provenance and Versioning

Overview

A connectome is not a static object. It evolves continuously as proofreaders correct errors, new segmentation models are applied, annotations are added, and analyses reveal regions needing further review. Without rigorous version control and provenance tracking, it becomes impossible to reproduce a published result, diagnose an unexpected finding, or compare analyses performed at different times.

This document covers the principles and practical systems for maintaining data lineage in connectomics.

Instructor script: why provenance matters

The reproducibility challenge

Consider this scenario: A paper published in 2024 reports that a specific circuit motif is enriched 3.2× in mouse visual cortex. In 2025, another group queries the same dataset and finds only 1.8× enrichment. Is the difference:

(a) A real scientific disagreement about methods? (b) A change in the underlying data — proofreading corrections since 2024 altered the graph? (c) A difference in which version of the synapse detection was used? (d) A software bug in one of the analyses?

Without provenance, answering this question requires extensive detective work. With provenance, you can immediately identify which data version, segmentation version, and synapse detection version each analysis used, and pinpoint where results diverged.

The FAIR principle applied to connectomics

Connectomics data should be Findable, Accessible, Interoperable, and Reproducible (Wilkinson et al. 2016). Provenance is the backbone of reproducibility:

Every analysis result should cite the exact dataset version used
Every dataset version should record the processing pipeline that created it
Every processing pipeline should record its code version, model version, and parameters
Every proofreading edit should record who made it, when, and why

CAVE: Connectome Annotation Versioning Engine

Core architecture

CAVE (Dorkenwald et al. 2022) is the most widely used versioning system for large-scale connectomics. It provides:

Chunked segmentation graph: The segmentation is stored as a graph of supervoxels (small, atomically correct fragments). Proofreading edits (merges and splits) are graph operations — adding or removing edges between supervoxels. The segmentation volume itself is never rewritten.
Annotation tables: Synapses, cell-type labels, and other annotations are stored in database tables with spatial coordinates. Each annotation records which segment it belongs to (via the supervoxel it falls within).
Materialization: Periodically (daily to weekly), CAVE takes a snapshot (“materialization”) that freezes the state of the segmentation graph and all annotation tables. A materialization version is a complete, self-consistent view of the connectome at a specific point in time.

How materialization works

When you “materialize” at version N:

The segmentation graph is resolved: every supervoxel’s current root segment ID is computed by traversing the edit history up to version N.
All annotations are updated: each annotation’s segment ID is recomputed based on the version-N segmentation.
The result is a table where every synapse, every cell label, and every segment is consistent — as if the entire dataset were re-segmented from scratch with all proofreading edits applied.

Key insight: Materialization decouples the time of analysis from the time of proofreading. You can always go back to a specific materialization version and get the exact same results.

Practical usage

# Pseudocode for reproducible analysis
client = CAVEclient("minnie65_public")

# Pin to a specific materialization version
mat_version = 943  # the version used in my paper

# Query the connectivity graph at that exact version
synapses = client.materialize.synapse_query(
    pre_ids=[my_neuron_id],
    materialization_version=mat_version
)

# This query will return the same results today, tomorrow, and in 5 years
# regardless of subsequent proofreading edits

Projects using CAVE

FlyWire (Dorkenwald et al. 2024): Entire Drosophila brain, ~140K neurons
MICrONS (minnie65, minnie35): Mouse visual cortex volumes
Allen Institute datasets: Multiple mouse brain regions

Pipeline provenance

What to record at each stage

For every computational step in the reconstruction pipeline:

Stage	Required provenance
Raw ingest	Microscope instrument ID, acquisition date, operator, imaging parameters (see acquisition-qa.md)
Alignment	Input section IDs, alignment software version (git hash), transform parameters, registration residuals
Segmentation	Input volume version, model artifact ID (hash of trained weights), inference parameters (threshold, chunk size), software version
Agglomeration	Segmentation version, agglomeration parameters (size threshold, affinity threshold), software version
Synapse detection	Input volume + segmentation version, synapse model ID, detection parameters, software version
Proofreading	Editor ID, timestamp, operation type (merge/split), affected supervoxels, before/after state
Analysis	All input data versions (materialization number), analysis code version, parameters, random seeds

Implementation patterns

Option 1: Inline metadata — Each output file/chunk carries its provenance as attributes (HDF5 attributes, Zarr metadata, JSON sidecar files). Simple but can become unwieldy for complex pipelines.

Option 2: Provenance database — A central database records every processing step with inputs, outputs, parameters, and timestamps. Query-friendly but requires infrastructure.

Option 3: Workflow managers — Tools like Nextflow, Snakemake, or Airflow automatically track input/output dependencies and record execution metadata. Best for reproducible pipeline execution.

Recommended practice: Combine all three. Workflow manager for execution tracking, inline metadata for self-describing outputs, and a database for cross-pipeline queries.

Version control for analysis code

The minimum standard

Every analysis script, notebook, or pipeline used to generate a figure or result in a publication should be:

Under git version control — with the exact commit hash recorded alongside the result
Dependency-pinned — exact versions of all libraries (requirements.txt, conda environment.yml, or Docker image hash)
Parameterized — all parameters (thresholds, random seeds, dataset versions) as explicit configuration, not hardcoded values
Deterministic — same inputs + same parameters + same code → same outputs. Pin random seeds. Avoid non-deterministic GPU operations (or document them).

Docker/container reproducibility

For maximum reproducibility, package the entire analysis environment as a Docker container:

FROM python:3.11-slim
RUN pip install caveclient==5.15.0 networkx==3.2.1 numpy==1.26.2
COPY analysis/ /app/analysis/
ENTRYPOINT ["python", "/app/analysis/run_motif_search.py"]

Record the Docker image hash alongside results. Anyone can re-run the analysis years later with the exact same environment.

Worked example: publishing a reproducible connectomics result

Scenario: You’re writing a paper showing that reciprocal connections between layer 2/3 pyramidal cells are 4.2× enriched relative to a degree-preserving null model.

Reproducibility checklist:

Dataset version: “All analyses used MICrONS minnie65_public, CAVE materialization version 943 (2025-01-15).”
Cell selection: “Pyramidal cells identified using cell-type labels from the minnie65_public nucleus detection table, version 943.”
Synapse source: “Synapses from the synapses_pni_2 table, materialized at version 943.”
Thresholds: “We defined connected pairs as those with ≥3 synapses (sensitivity analysis for thresholds 1-10 in Supplementary Figure S3).”
Null model: “Degree-preserving random rewiring (Maslov & Sneppen 2002), 10,000 randomizations, random seed 42.”
Code: “Analysis code available at github.com/lab/reciprocal-motifs, commit abc123.”
Environment: “Docker image lab/reciprocal-motifs:v1.0, sha256:def456.”

With this information, anyone can reproduce the exact result. Without any single element, reproducibility is compromised.

Common misconceptions

Misconception	Reality	Teaching note
“The connectome is finished”	Connectomes are living datasets — proofreading and annotation continue indefinitely	Always cite a specific version
“Git for code is enough”	Code version means nothing without data version and environment version	Track all three together
“Provenance is overhead”	Provenance prevents far more expensive problems: irreproducible results, retracted papers, wasted re-analysis	Build it into the pipeline from day one
“We can always re-run the analysis”	If the data version has changed and you didn’t record which version you used, re-running gives different results	Pin versions at analysis time, not after

References

Dorkenwald S et al. (2022) “CAVE: Connectome Annotation Versioning Engine.” bioRxiv. doi:10.1101/2023.07.26.550598.
Dorkenwald S et al. (2024) “Neuronal wiring diagram of an adult brain.” Nature 634:124-138.
Wilkinson MD et al. (2016) “The FAIR Guiding Principles for scientific data management and stewardship.” Scientific Data 3:160018.
Maslov S, Sneppen K (2002) “Specificity and stability in topology of protein networks.” Science 296(5569):910-913.
Turner NL et al. (2022) “Reconstruction of neocortex: Organelles, compartments, cells, circuits, and activity.” Cell 185(6):1082-1100.