Open the teaching deck, worksheet, and editable slide source.
Interactive Lab
Practice in short loops: checkpoint quiz, microtask decision, and competency progress tracking.
Checkpoint Quiz
Microtask Decision
Choose the action that best improves scientific reliability.
Progress Tracker
State is saved locally in your browser for this module.
0% complete
Annotation Challenge
Click the hotspot with the strongest evidence for the requested feature.
Selected hotspot: none
Capability target
Produce a scalable, reproducible query-and-analysis plan for a large connectomics dataset, including storage assumptions, indexing strategy, and provenance capture.
Why this module matters
Connectomics is now data-system-limited as much as algorithm-limited. If learners cannot reason about throughput, storage, and indexing, they cannot execute reliable analyses on real datasets.
Concept set
1) Data architecture is scientific method infrastructure
Technical: storage format, chunking, and indexing influence what questions are tractable.
Plain language: bad architecture can make good science impossible.
Misconception guardrail: compute scale alone does not solve poor data design.
2) Query cost is a research variable
Technical: query plans and index locality affect reproducibility, latency, and iteration speed.
Plain language: how you ask the data matters as much as what you ask.
Misconception guardrail: “it runs eventually” is not acceptable for iterative science.
3) Provenance must be first-class
Technical: every output should include dataset version, query definition, environment, and transform lineage.
Plain language: if you cannot reconstruct your output path, you cannot defend your result.
Misconception guardrail: notebook history alone is insufficient provenance.
Hidden curriculum scaffold
Unwritten engineering expectations in connectomics teams:
benchmark before optimizing.
record query versions for every figure table.
separate exploratory scripts from release pipelines.
How to teach explicitly:
require query provenance fields in assignments.
include failure-postmortem mini-reviews.
grade reproducibility alongside correctness.
Core workflow: scalable query planning
Define analysis question and required data granularity.
Select storage/index strategy aligned to access pattern.
Prototype baseline query and profile bottlenecks.
Add provenance logging and version controls.
Validate reproducibility and publish query package.
60-minute tutorial run-of-show
**00:00-08:00
Architecture framing and failure examples**
**08:00-20:00
Access-pattern to index mapping exercise**
**20:00-34:00
Query profiling and bottleneck diagnosis**
**34:00-46:00
Provenance logging implementation**
**46:00-56:00
Team review of reproducibility gaps**
**56:00-60:00
Competency check and next-step assignment**
Studio activity: petascale query design lab
Scenario: Your team must deliver a weekly motif-analysis report from a multi-terabyte connectomics store.
Tasks
Propose storage/index layout for expected query patterns.
Write or outline two critical queries and estimate performance risk.
Define minimum provenance fields for outputs.
Produce one optimization proposal and one reproducibility safeguard.
Expected outputs
Query architecture sketch.
Baseline vs optimized query plan.
Provenance checklist.
Assessment rubric
Minimum pass
Query design matches analysis goal and data shape.
Provenance requirements are explicit and actionable.
Bottlenecks are identified with one realistic mitigation.
To ground the abstract concepts, here are the data scales learners will encounter:
Dataset
Raw volume
Neurons
Synapses
Storage
MICrONS (minnie65)
1 mm³ mouse V1
~80,000
~500M
~2 PB
H01
~1 mm³ human temporal cortex
~57,000 cells
~150M
~1.4 PB
FlyWire
Whole adult Drosophila brain
~139,255
~54.5M
~100 TB
MouseConnects (planned)
~10 mm³ mouse hippocampus
TBD
TBD
>10 PB
Teaching point: “When your synapse table has 500 million rows, a poorly written query doesn’t just run slowly — it may not finish at all. Architecture decisions determine whether your science is feasible.”