EM Connectomics Bibliometrics Pipeline — COMPLETE

Session Summary

Objective: Execute complete end-to-end pipeline with quality improvements and novel analysis.

Status: ✅ COMPLETE AND TESTED


What Was Fixed/Improved

1. Critical Bug: Graph Edges Missing

2. Corpus Quality: Duplicate Papers

3. Data Quality: Spot Checks


Final Outputs (Ready to Use)

Visualizations

Journal Club Selection

Reading List

Metrics & Analysis


Novel Analysis: Paper Roles via Directed Edges

Key Insight

Directed citation edges reveal different paper importance signatures:

Landmark/Foundational Papers (High In-Degree)

Methods/Infrastructure Papers (High Out-Degree)

Applied Biology (Moderate Both)

Results

The Superpower: High out-degree shows methods papers integrate critical infrastructure (CV, ML, EM techniques) even if they don’t get cited as much within connectomics.


Quality Metrics

Corpus

Citations Merged

Networks

Journal Club


Files Location

All outputs in: scripts/bibliometrics/output/

Key files:


Ready For

✅ Journal club page updates (use journal_club_final_strict) ✅ Comparison with hand-curated 200-paper set ✅ Website publication (field_map.html) ✅ Further analysis (all metrics in paper_rankings.json) ✅ In-depth paper role studies (use in/out degree metrics)


Pipeline Steps Executed

  1. ✅ Load & validate corpus
  2. ✅ Deduplication (title/author similarity + arXiv detection)
  3. ✅ Graph building (4 networks)
  4. ✅ Metrics computation (PageRank, HITS, betweenness, communities)
  5. ✅ HTML visualization generation
  6. ✅ K-core analysis
  7. ✅ Reading list generation
  8. ✅ Journal club selection (strict filtering)
  9. ✅ Paper role analysis (in/out degree)

Implementation Notes

Deduplication Strategy

Visualization Fixes

Metrics Robustness


Next Steps


Generated: March 31, 2026 Corpus: 7,504 papers (EM connectomics related) Analysis: Complete bibliometric analysis with network metrics