Module 18: Data Cleaning and Preprocessing

Teaching Deck

Learning Objectives

Diagnose common connectomics data-quality issues before analysis
Apply reproducible preprocessing steps with documented decision rules
Quantify preprocessing impact with auditable QC metrics
Produce an analysis-ready dataset package with provenance metadata

Session Outcomes

Learners can complete the module capability target.
Learners can produce one evidence-backed artifact.
Learners can state one limitation or uncertainty.

Agenda (60 min)

0-10 min: Frame and model
10-35 min: Guided practice
35-50 min: Debrief and misconception correction
50-60 min: Competency check + exit ticket

Capability Target

Produce a reproducible preprocessing release that transforms raw or intermediate connectomics outputs into analysis-ready data, with explicit quality gates and full provenance.

Concept Focus

1) Cleaning vs distortion

Technical: preprocessing should reduce known artifacts/noise while preserving biologically meaningful structure.
Plain language: fix mistakes, do not “polish away” the biology.
Misconception guardrail: more filtering is not always better.

Core Workflow

See module page for details.

60-Minute Run-of-Show

One noisy connectomics table (missing values, duplicated IDs, inconsistent units).
Shared preprocessing decision sheet.
QC dashboard template (pre/post metrics).
**00:00-08:00 Setup and target**
Define release objective and non-negotiable quality gates.
**08:00-18:00 Instructor modeling**
Live demonstration of ingest checks and anomaly triage logic.
**18:00-32:00 Team preprocessing design**
Teams draft cleaning rules and escalation criteria.
**32:00-44:00 QC pass**
Teams compute/estimate pre-post metrics and decide release/no-release.
**44:00-54:00 Cross-team review**
Teams audit each other’s transform logs for reproducibility gaps.
**54:00-60:00 Competency checkpoint**
Submit one release note with provenance, thresholds, and residual risk.
Cleaning decisions are deterministic and documented.
QC thresholds are tied to operational actions.
Release note exposes at least one unresolved interpretation risk.

Misconceptions to Watch

Misconception guardrail: more filtering is not always better.
Misconception guardrail: version-control notes alone are insufficient without data lineage.
Misconception guardrail: reporting metrics without thresholds is not quality control.

Studio Activity

Activity Output Checklist

Evidence-linked artifact submitted.
At least one limitation or uncertainty stated.
Revision point captured from feedback.

Assessment Rubric

Minimum pass
Cleaning decisions are explicit and reproducible.
QC metrics include thresholds tied to actions.
Release package includes provenance metadata.
Strong performance
Distinguishes low-risk cleanup from biologically sensitive transforms.
Quantifies and explains pre/post changes clearly.
Documents limitations and unresolved risks transparently.
Common failure modes
Silent ad-hoc edits with no transform log.
Aggressive filtering that removes biologically meaningful variation.
Metrics reported without operational thresholds.

Exit Ticket

Take one connectomics table (real or mock) and write:

Three cleaning rules with rationale.
Two QC thresholds and associated actions.
One limitation that remains after preprocessing.

References (Instructor)

Wilkinson et al., 2016. The FAIR Guiding Principles for scientific data management and stewardship.
Peng, 2011. Reproducible Research in Computational Science.
MICrONS and related connectomics workflow documentation.

Teaching Materials

Module page: /modules/module18/
Slide page: /modules/slides/module18/
Worksheet: /assets/worksheets/module18/module18-activity.md