Module 12: Big Data in Connectomics

Lesson Flow

Learn

Goals and Concepts

Start with the capability target and concept set for this module.

Practice

Studio Activity

Apply the ideas in a guided activity tied to realistic outputs.

Check

Assessment Rubric

Use the rubric to verify competency and identify improvement targets.

Teach

Slides and Worksheets

Open the teaching deck, worksheet, and editable slide source.

Interactive Lab

Practice in short loops: checkpoint quiz, microtask decision, and competency progress tracking.

Checkpoint Quiz

Q1. Which output most clearly demonstrates module competency? A measurable artifact linked to an explicit method A general reflection about interest in neuroscience A list of future goals without evidence

Competency is shown through measurable, method-linked evidence.

Q2. What should always accompany a technical claim in this curriculum? A limitation or uncertainty statement A motivational quote A citation count

Every claim should include boundaries and uncertainty.

Q3. What is the best next step after identifying a gap in understanding? Define a concrete practice task and verification criterion Skip to a harder module Ignore the gap if progress is slow

Progress improves when gaps become explicit practice targets.

Microtask Decision

Choose the action that best improves scientific reliability.

Record method, version, and decision rationale before sharing results Share results quickly and document details later Rely on memory for why decisions were made

Progress Tracker

State is saved locally in your browser for this module.

Read capability target
Complete studio activity
Review assessment rubric
Pass checkpoint quiz
Complete microtask
Complete annotation challenge

0% complete

Annotation Challenge

Click the hotspot with the strongest evidence for the requested feature.

Selected hotspot: none

Capability target

Produce a scalable, reproducible query-and-analysis plan for a large connectomics dataset, including storage assumptions, indexing strategy, and provenance capture.

Why this module matters

Connectomics is now data-system-limited as much as algorithm-limited. If learners cannot reason about throughput, storage, and indexing, they cannot execute reliable analyses on real datasets.

Concept set

1) Data architecture is scientific method infrastructure

Technical: storage format, chunking, and indexing influence what questions are tractable.
Plain language: bad architecture can make good science impossible.
Misconception guardrail: compute scale alone does not solve poor data design.

2) Query cost is a research variable

Technical: query plans and index locality affect reproducibility, latency, and iteration speed.
Plain language: how you ask the data matters as much as what you ask.
Misconception guardrail: “it runs eventually” is not acceptable for iterative science.

3) Provenance must be first-class

Technical: every output should include dataset version, query definition, environment, and transform lineage.
Plain language: if you cannot reconstruct your output path, you cannot defend your result.
Misconception guardrail: notebook history alone is insufficient provenance.

Hidden curriculum scaffold

Unwritten engineering expectations in connectomics teams:
- benchmark before optimizing.
- record query versions for every figure table.
- separate exploratory scripts from release pipelines.
How to teach explicitly:
- require query provenance fields in assignments.
- include failure-postmortem mini-reviews.
- grade reproducibility alongside correctness.

Core workflow: scalable query planning

Define analysis question and required data granularity.
Select storage/index strategy aligned to access pattern.
Prototype baseline query and profile bottlenecks.
Add provenance logging and version controls.
Validate reproducibility and publish query package.

60-minute tutorial run-of-show

**00:00-08:00 Architecture framing and failure examples**
**08:00-20:00 Access-pattern to index mapping exercise**
**20:00-34:00 Query profiling and bottleneck diagnosis**
**34:00-46:00 Provenance logging implementation**
**46:00-56:00 Team review of reproducibility gaps**
**56:00-60:00 Competency check and next-step assignment**

Studio activity: petascale query design lab

Scenario: Your team must deliver a weekly motif-analysis report from a multi-terabyte connectomics store.

Tasks

Propose storage/index layout for expected query patterns.
Write or outline two critical queries and estimate performance risk.
Define minimum provenance fields for outputs.
Produce one optimization proposal and one reproducibility safeguard.

Expected outputs

Query architecture sketch.
Baseline vs optimized query plan.
Provenance checklist.

Assessment rubric

Minimum pass
- Query design matches analysis goal and data shape.
- Provenance requirements are explicit and actionable.
- Bottlenecks are identified with one realistic mitigation.
Strong performance
- Separates exploratory and production query paths.
- Quantifies tradeoffs (latency, cost, reproducibility).
- Anticipates failure recovery and rollback needs.
Common failure modes
- Index choices disconnected from query workload.
- Missing version metadata in outputs.
- Optimization attempts without benchmark baseline.

Scale context: real-world numbers

To ground the abstract concepts, here are the data scales learners will encounter:

Dataset	Raw volume	Neurons	Synapses	Storage
MICrONS (minnie65)	1 mm³ mouse V1	~80,000	~500M	~2 PB
H01	~1 mm³ human temporal cortex	~57,000 cells	~150M	~1.4 PB
FlyWire	Whole adult Drosophila brain	~139,255	~54.5M	~100 TB
MouseConnects (planned)	~10 mm³ mouse hippocampus	TBD	TBD	>10 PB

Teaching point: “When your synapse table has 500 million rows, a poorly written query doesn’t just run slowly — it may not finish at all. Architecture decisions determine whether your science is feasible.”

Key tools and formats

Tool/Format	Purpose	When to use
Zarr/N5	Chunked array storage	Volumetric data, cloud-friendly
Neuroglancer precomputed	Multiscale image pyramids	Web browsing of EM/segmentation
CAVEclient	Python API for CAVE tables	Synapse queries, annotation access
CloudVolume	Python API for volumetric data	Image/segmentation chunk access
pandas/Dask	Tabular data manipulation	Synapse tables, annotation analysis
BigQuery/DuckDB	SQL on large tables	Complex joins on synapse/annotation tables

Content library references

Reconstruction pipeline — End-to-end pipeline architecture
Data formats and representations — Volumes, meshes, skeletons, graphs; format specs
Provenance and versioning — CAVE materialization, reproducibility
MICrONS visual cortex — Real-world petascale dataset

Teaching resources

Workflow context: Connectomics Workflow
Dataset context: MouseConnects
Notebook: Dash Synapse Explorer
Quality context: Connectome Quality tool

References

Dorkenwald S et al. (2022) “CAVE: Connectome Annotation Versioning Engine.” bioRxiv.
Januszewski M et al. (2018) “High-precision automated reconstruction of neurons with flood-filling networks.” Nature Methods 15(8):605-610.
Shapson-Coe A et al. (2024) “A petavoxel fragment of human cerebral cortex.” Science 384(6696):eadk4858.
Turner NL et al. (2022) “Reconstruction of neocortex.” Cell 185(6):1082-1100.

Quick practice prompt

Document one query you use with:

data source/version,
expected runtime class,
one provenance field you currently miss.

Teaching Materials

Slide Deck

Classroom-ready deck links for teaching and delivery.

Open slide deck page

Open rendered HTML deck

Download PowerPoint (.pptx)

Activity Worksheet

Learner worksheet aligned to the studio activity and rubric.

Open worksheet

Slide Source

Marp source file for editing and rendering.

course/decks/marp/modules/module12.marp.md