Lesson Flow

Learn

Goals and Concepts

Start with the capability target and concept set for this module.

Practice

Studio Activity

Apply the ideas in a guided activity tied to realistic outputs.

Check

Assessment Rubric

Use the rubric to verify competency and identify improvement targets.

Interactive Lab

Practice in short loops: checkpoint quiz, microtask decision, and competency progress tracking.

Checkpoint Quiz

Q1. Which output most clearly demonstrates module competency?

Competency is shown through measurable, method-linked evidence.

Q2. What should always accompany a technical claim in this curriculum?

Every claim should include boundaries and uncertainty.

Q3. What is the best next step after identifying a gap in understanding?

Progress improves when gaps become explicit practice targets.

Microtask Decision

Choose the action that best improves scientific reliability.

Progress Tracker

State is saved locally in your browser for this module.

0% complete

Annotation Challenge

Click the hotspot with the strongest evidence for the requested feature.

Connectomics training scene

Selected hotspot: none

Capability target

Produce a reproducible preprocessing release that transforms raw or intermediate connectomics outputs into analysis-ready data, with explicit quality gates and full provenance. Students will be able to identify the specific cleaning operations that shape biological conclusions, justify every threshold decision, and document their preprocessing pipeline so that another researcher can audit and reproduce it.

Why this module matters

Most downstream failures in connectome analysis are not model failures first; they are data-quality and preprocessing failures. A synapse table with unfiltered false positives inflates connectivity estimates. A neuron table that includes tiny orphan fragments skews degree distributions. A graph built without handling volume-boundary neurons misrepresents the network. Every preprocessing decision — what to filter, what threshold to set, what to include or exclude — directly shapes the biological conclusions that follow. This module teaches how to clean data without erasing signal, and how to document each transformation so conclusions remain defensible.

Concept set

1) Data cleaning in connectomics: what needs fixing and why

2) Threshold decisions that shape analysis

3) Cleaning vs distortion

4) Provenance as a scientific requirement

5) Documenting cleaning decisions for reproducibility

6) QC metrics must be decision-linked

Core workflow: preprocessing for connectomics

  1. Ingest and integrity validation
    • Confirm file completeness, schema conformance, and version compatibility.
    • Log dataset identifiers, CAVE materialization version, and checksums.
    • Verify that the synapse table, segment table, and cell-type annotations refer to the same materialization.
  2. Artifact and anomaly screening
    • Compute segment size distribution and flag outliers (extremely large segments may be merge errors; extremely small segments may be debris).
    • Compute synapse confidence score distribution and identify the threshold region.
    • Check for duplicate segment IDs, conflicting cell-type labels, and missing foreign keys.
    • Identify boundary neurons by mesh-bounding-box intersection.
    • Triage issues by likely biological impact: high-impact issues block analysis; low-impact issues are documented and accepted.
  3. Cleaning transforms
    • Apply synapse confidence threshold with documented rationale.
    • Remove orphan segments (zero synapses as both pre and post).
    • Apply segment size threshold with documented rationale.
    • Flag or remove boundary neurons with documented policy.
    • Resolve duplicate IDs and label conflicts.
    • Normalize units (e.g., convert voxel coordinates to nanometers using dataset resolution metadata).
  4. QC and drift checks
    • Compare pre/post distributions: synapse count per neuron, segment size, graph density, degree distribution.
    • Verify that cleaning did not selectively remove a specific cell type or spatial region.
    • Check that graph topology statistics (clustering coefficient, connected components) are consistent with expectations.
  5. Release packaging
    • Publish analysis-ready tables plus: preprocessing decision log, transform code with commit hash, QC metric report with threshold justifications, known limitations and residual risks.

Studio activity: preprocessing release simulation

Scenario: Your team receives a connectomics export from MICrONS minnie65 (CAVE materialization v795) containing: a synapse table (4.2 million rows) with confidence scores, a segment table (120,000 segments) with volumes, and a cell-type annotation table (8,400 classified neurons). Initial inspection reveals: 12% of synapses have confidence scores below 30, 35,000 segments have fewer than 2 synapses, 847 segments intersect the volume bounding box, and 23 segment IDs appear in the synapse table but not in the segment table.

Tasks

  1. Artifact triage: classify each issue (low-confidence synapses, small segments, boundary neurons, orphan IDs) by likely biological impact and propose a cleaning policy for each.
  2. Threshold justification: for synapse confidence and segment size thresholds, propose two candidate values each and argue for your preferred choice. Explain what biological signal you might lose at each threshold.
  3. Implement preprocessing pipeline: write pseudocode or notebook-level steps for the full cleaning workflow, from ingest through release.
  4. QC comparison: compute (or estimate) pre/post metrics: total synapse count, total segment count, mean degree, graph density, and the fraction of each cell type remaining after cleaning.
  5. Release note: produce a one-page release note that includes: input dataset version, all thresholds and parameters, code reference, QC metrics with pass/fail calls, and known residual risks (e.g., “boundary neurons were excluded, which may underrepresent connectivity of neurons near volume edges”).

Expected outputs

Assessment rubric

Content library cross-references

Teaching resources

60-minute tutorial run-of-show

Materials

Timing and instructor script

00:00-08:00 | Setup and target framing Instructor presents the scenario: “You have received a connectomics export. Before you can analyze it, you must clean it. But every cleaning decision changes your results. Today we learn to clean responsibly.” Display the raw data summary statistics. Define the release objective: an analysis-ready synapse table and neuron table with documented provenance. Define non-negotiable quality gates: no duplicate IDs, no unresolved foreign keys, all thresholds documented.

08:00-18:00 | Instructor modeling: ingest and anomaly screening Live demonstration: load the synapse table, compute the confidence score distribution, identify the bimodal peak (true synapses vs false positives). Show the segment size distribution on a log scale, point out the debris tail. Check for orphan IDs. Key script line: “Before you touch the data, understand its shape. The distribution plot is your first diagnostic tool.”

18:00-32:00 | Team preprocessing design Teams of 3-4 draft cleaning rules for each identified issue. Each team must produce a preprocessing decision table with columns: issue, proposed action, threshold, rationale, estimated impact. Instructor circulates, challenging threshold choices: “Why 50 and not 40? What do you lose at 50 that you keep at 40?”

32:00-44:00 | QC pass Teams compute (or estimate from the provided distributions) pre/post metrics: total synapse count, total segment count, mean synapses per neuron, fraction of each cell type remaining. Teams make a release/no-release decision based on their quality gates. Instructor asks: “Did cleaning change the relative representation of cell types? If it did, that is a bias you must report.”

44:00-54:00 | Cross-team review Teams swap preprocessing decision tables and QC reports. Each team audits the other’s transform log for: missing rationale, unjustified thresholds, potential biological signal loss, and reproducibility gaps. Teams write two specific improvement suggestions.

54:00-60:00 | Competency checkpoint Each team submits one release note with: dataset version, all thresholds and parameters, QC metrics with pass/fail, and at least one documented residual risk. Instructor reviews one example live.

Success criteria for this session

Evidence anchors from connectomics practice

Key papers to use in this module

Key datasets to practice on

Competency checks

Quick practice prompt

Take one connectomics table (real or mock) and write:

  1. Three cleaning rules with rationale tied to specific data artifacts.
  2. Two QC thresholds with associated pass/fail actions and biological justification.
  3. One sensitivity analysis: what happens to your key metric if you relax or tighten your primary threshold by 20%?
  4. One limitation that remains after preprocessing, stated concretely enough to guide interpretation.

Teaching Materials

Activity Worksheet

Learner worksheet aligned to the studio activity and rubric.

Open worksheet

Slide Source

Marp source file for editing and rendering.

course/decks/marp/modules/module18.marp.md

Related Content