EM Connectomics Bibliometrics Pipeline — COMPLETE
Session Summary
Objective: Execute complete end-to-end pipeline with quality improvements and novel analysis.
Status: ✅ COMPLETE AND TESTED
What Was Fixed/Improved
1. Critical Bug: Graph Edges Missing
- Problem: All network visualizations showed 0 edges
- Root Cause:
json_graph.node_link_data()not serializing edges - Solution: Manual node-link serialization with “links” format
- Result: 94,223 citation edges now visible in all graphs ✓
2. Corpus Quality: Duplicate Papers
- Problem: 7,925 papers had duplicates (preprints + published)
- Strategy: Two-pass merging
- Pass 1: Title + author similarity → identified 536 groups, merged 420 papers
- Pass 2: ArXiv detection → merged arXiv into published versions
- Citation Merging: 129,757 citations consolidated from removed duplicates
- Result: 7,504 papers with accurate merged citation counts ✓
3. Data Quality: Spot Checks
- Kasthuri 2013 arXiv preprint → merged into conference version ✓
- Kasthuri 2015 Cell “Saturated Reconstruction” preserved (1,125 citations) ✓
- 1 DOI duplicate with Unicode encoding fixed ✓
Final Outputs (Ready to Use)
Visualizations
- field_map.html (581 KB)
- Interactive D3 network with 3,576 edges (top 500 papers)
- Citation + co-authorship networks
- Community coloring
- Top papers/authors panels
- kcore_map.html (1.4 MB)
- K-core shell visualization
- Concentric regions by k-core value
- Drosophila connectomics emphasis
- evolution_graph.html (18 KB)
- Field timeline visualization
Journal Club Selection
- journal_club_final_strict.json (70 KB)
- 131 papers with strict connectomics filtering
- 22 core papers (≥0.2 threshold):
- 9 Platinum (≥0.40): Landmark works
- 6 Gold (0.30-0.40): Strong signal
- 7 Silver (0.20-0.30): Solid contributions
- 109 Bronze papers (reference tier)
- Methodology: 80% PageRank+HITS+betweenness / 20% k-core
- journal_club_final_strict.md (28 KB)
- Human-readable markdown version
- Organized by tier and community
- DOI links to papers
Reading List
- reading_list.json (787 KB)
- 500 papers, topologically sorted
- 5 phases:
- Phase 0: 89 orientation reviews
- Phase 1: 179 foundational papers (pre-2010)
- Phase 2: 171 core methods (2010-2020)
- Phase 3: 3 landmark datasets
- Phase 4: 58 frontier papers (2021+)
- reading_list.md (124 KB)
- Markdown version with communities
Metrics & Analysis
- paper_rankings.json (2,000 papers)
- PageRank, HITS, betweenness, composite scores
- Top paper: Drosophila connectome (0.912)
- paper_role_analysis.json (2,000 papers)
- In-degree: How many cite this paper (impact)
- Out-degree: How many this paper cites (scope)
- In/Out ratio: Characterizes paper role
- Identifies landmark datasets, foundational papers, methods papers, reviews
- corpus_final.json (7,504 papers)
- Clean, deduplicated corpus
- Merged citations from all versions
- Provenance tracked in
merged_versionsfield
Novel Analysis: Paper Roles via Directed Edges
Key Insight
Directed citation edges reveal different paper importance signatures:
Landmark/Foundational Papers (High In-Degree)
- Example: C. elegans connectome (631 citations, 0 references within corpus)
- Purely foundational — everything else built on this
- Examples: Fiji image analysis, Drosophila connectome
Methods/Infrastructure Papers (High Out-Degree)
- Example: Graph analysis review (140 references)
- Synthesize and explain field knowledge
- Critical but might be undervalued by citation-only metrics
Applied Biology (Moderate Both)
- Use methods on biological questions
- Balanced impact
Results
- 1 landmark dataset
- 131 foundational papers
- 51 synthesis/review papers
- 20 methodological papers
- 195 applied biology papers
- 1,602 emerging/specialized papers
The Superpower: High out-degree shows methods papers integrate critical infrastructure (CV, ML, EM techniques) even if they don’t get cited as much within connectomics.
Quality Metrics
Corpus
- Original: 7,925 papers
- Duplicates removed: 420 (5.3%)
- ArXiv preprints merged: 1
- Final: 7,504 papers
Citations Merged
- Total: 129,757 citations from removed duplicates
- These are now reflected in final citation counts
Networks
- Citation graph: 85,589 edges (after dedup)
- Co-citation graph: 168,154 edges
- Coupling graph: 1,268,198 edges
- Co-authorship graph: 511,670 edges
Journal Club
- Core selection: 22 papers (Platinum + Gold + Silver)
- Methodology: Evidence-based, not representative
- Filtering: Strict connectomics relevance check
Files Location
All outputs in: scripts/bibliometrics/output/
Key files:
- Visualizations:
field_map.html,kcore_map.html,evolution_graph.html - Journal club:
journal_club_final_strict.{json,md} - Reading list:
reading_list.{json,md} - Analysis:
paper_rankings.json,paper_role_analysis.json,corpus_final.json - Logs:
dedup_log.json,arxiv_merge_log.json
Ready For
✅ Journal club page updates (use journal_club_final_strict) ✅ Comparison with hand-curated 200-paper set ✅ Website publication (field_map.html) ✅ Further analysis (all metrics in paper_rankings.json) ✅ In-depth paper role studies (use in/out degree metrics)
Pipeline Steps Executed
- ✅ Load & validate corpus
- ✅ Deduplication (title/author similarity + arXiv detection)
- ✅ Graph building (4 networks)
- ✅ Metrics computation (PageRank, HITS, betweenness, communities)
- ✅ HTML visualization generation
- ✅ K-core analysis
- ✅ Reading list generation
- ✅ Journal club selection (strict filtering)
- ✅ Paper role analysis (in/out degree)
Implementation Notes
Deduplication Strategy
- Two-pass approach: broad similarity detection + targeted arXiv removal
- Citation merging: SUM all versions (not max)
- Provenance tracking:
merged_versionsfield preserves history
Visualization Fixes
- NetworkX
node_link_data()replaced with manual serialization - Format:
{"nodes": [...], "links": [...]}for D3.js compatibility - All edge rendering functions updated
Metrics Robustness
- PageRank + HITS hub/authority + betweenness (network structure)
- K-core (local density, structural position)
- Composite score (weighted combination)
- In/Out degree (role characterization)
Next Steps
- Compare against hand-curated 200-paper set
- Gather user feedback on journal club selections
- Update project website with field_map.html
- Consider additional visualizations (community layout, etc.)
- Deeper analysis of methods paper infrastructure
Generated: March 31, 2026 Corpus: 7,504 papers (EM connectomics related) Analysis: Complete bibliometric analysis with network metrics