EM Connectomics Bibliometrics: Methodology & Pipeline

Complete Technical Documentation for Team Review


1. PIPELINE OVERVIEW

1.1 High-Level Data Flow

OpenAlex Query
    ↓
Raw Data (~8,000 papers)
    ↓
Deduplication (420 duplicates removed)
    ↓
Author Merging (17 name merges applied)
    ↓
Corpus: 7,503 papers, 35,641 authors
    ↓
Citation Network Analysis
    ├─ PageRank computation (2,000 papers)
    ├─ HITS hubs/authorities
    ├─ Betweenness centrality
    ├─ K-core decomposition
    └─ Degree statistics (ALL papers)
    ↓
Paper Role Classification (10 categories)
    ↓
Author Rankings (35,641 authors)
    ↓
Key Experts Analysis (top 50)
    ├─ Career arcs (15 expert trajectories)
    ├─ Synthesis/infrastructure papers
    └─ Important work by role
    ↓
Composite Scoring (80% PageRank + 20% k-core)
    ↓
Journal Club Selection (4 thresholds: 10, 15, 20, 30)
    ↓
Visualizations (6 HTML files + 1 dashboard)
    ↓
Documentation & QA

1.2 Code Structure

scripts/bibliometrics/
├── 01_query_and_fetch.py         (OpenAlex query, raw data)
├── 02_build_graphs.py            (citation network construction)
├── 03_deduplicate_corpus.py      (duplicate removal)
├── 04_compute_metrics.py         (PageRank, HITS, betweenness)
├── 05_generate_full_visualizations.py
├── 06_compute_kcore.py           (k-core decomposition)
├── 07_community_detection.py     (Louvain algorithm)
├── 08_author_rankings.py         (author scoring)
├── 09_generate_journal_club_thresholds.py
├── 10_paper_role_analysis_granular.py
├── 11_extend_metrics_to_all_papers.py
├── 12_composite_scoring.py       (final importance score)
├── 13_synthesis_papers_analysis.py
├── 14_career_arcs_analysis.py
├── 15_apply_author_merges_to_corpus.py
├── 16_key_experts_analysis.py
├── 18_career_arcs_analysis.py
├── 19_seed_list_comparison.py
├── 20_apply_enrichment_decisions.py
└── output/                       (all JSON files, HTML visualizations)

2. METRIC COMPUTATIONS

2.1 Citation Network & Basic Metrics

Network Structure:

In-Degree: Citations received

Out-Degree: Citations given

2.2 PageRank Algorithm

Implementation:

Formula:

PR(p) = (1 - d) / N + d * Σ(PR(t) / |t|)
  where:
    d = 0.85 (damping factor)
    N = 7,503 (total papers)
    t = papers that cite p
    |t| = out-degree of t

Interpretation:

Limitations:

Hubs: Papers that cite many important papers (high out-degree to authorities)

Authorities: Papers cited by important hubs (high in-degree from hubs)

Usage:

2.4 Betweenness Centrality

Definition: How many shortest paths between other papers pass through this paper?

Interpretation:

Limitations:

2.5 K-Core Decomposition

Definition: Maximal subgraph where every node has degree ≥ k within the subgraph

Process:

  1. Remove all nodes with degree < 1
  2. Remove all nodes with degree < 2 (in remaining graph)
  3. Continue until no more removals possible
  4. Each node’s k-value = last iteration it survived

Range: 0–32 (empirically observed in our network)

Interpretation:

Usage in Pipeline:

2.6 Composite Importance Score

Formula:

composite_score = 0.80 × normalized_pagerank + 0.20 × normalized_kcore

Normalization:

Role-Based Boost:

if out_degree > 37 (top 5% threshold):
    composite_score += 0.10  (methods/infrastructure boost)

Rationale:

Result Distribution:


3. PAPER ROLE CLASSIFICATION

3.1 Ten Granular Roles

Role Assignment Logic:

Role In-Degree Out-Degree % of Corpus Interpretation
landmark_influential ≥150 Any 0.3% Most cited, foundational
landmark_connectome ≥300 <20 0.0% Connectome papers only, narrow
foundational 50–149 <37 1.6% Cited heavily, not broad synthesis
methods_infrastructure <50 ≥37 1.6% High citations given, implementation focus
well_cited 30–49 <37 3.2% Moderately cited, specific
synthesis_integration 20–29 20–36 5.4% Integrative work, medium breadth
cited_technical 20–29 <20 7.7% Technical papers, narrow scope
balanced_contribution 10–19 10–19 7.9% Even influence both directions
methodological_focus 10–19 Any 24.0% Active research, methods-oriented
active_contributor 1–9 Any 45.4% Emerging/narrow work, limited impact yet
orphaned_paper 0 Any 2.9% No citations received (data artifact?)

3.2 Rationale for Thresholds

In/Out-degree cutoffs (10 minimum):

Out-degree >= 37 (methods/infrastructure):

Example Classifications:

3.3 Known Biases

  1. Recency bias: New papers have low in-degree even if impactful
    • Mitigation: Use recent_pagerank for papers <3 years old
  2. Field bias: Connectomics papers often narrower scope than reviews
    • Mitigation: In/out-degree captures scope independently of field
  3. Preprint treatment: Some preprints not yet cited widely
    • Mitigation: Keep preprints in corpus, but flag in role assignment
  4. Citation patterns: Different fields cite differently (E&M vs. CS)
    • Limitation: Pipeline assumes uniform citation patterns
    • Accepted: This is inherent to cross-domain analysis

4. AUTHOR MERGING STRATEGY

4.1 Why Merge?

Problem: Same author appears under multiple name variants

Solution: Apply 17 explicit merges based on domain knowledge

4.2 Merge Table

Canonical Name Aliases Papers Merged Count
William Gray-Roncal Will Gray-Roncal, W. Gray-Roncal, Gray-Roncal William 28
H. Sebastian Seung Seung HS, Sebastian Seung, H.S. Seung 64
Gregory S.X.E. Jefferis Jefferis G.S.X.E., G. Jefferis, Gregory Jefferis 60
Shin-ya Takemura Takemura Shin-ya, Takemura S., Shinichi Takemura 39
Davi D. Bock Bock DD, Davi Bock, D.D. Bock 56
Moritz Helmstaedter Helmstaedter M., Moritz H., M. Helmstaedter 43
(and 11 others)

4.3 Implementation

Script: 15_apply_author_merges_to_corpus.py

Process:

  1. Load corpus_final.json (7,503 papers)
  2. For each paper, for each author, apply merge table
  3. Update author names and co-author networks
  4. Recompute author rankings with merged names
  5. Save corpus_merged.json and author_rankings_merged.json

Impact:

4.4 Known Limitations


5. JOURNAL CLUB SELECTION STRATEGY

5.1 Philosophy

Goal: Identify “core papers” that best represent the field’s knowledge

NOT based on:

BASED on:

5.2 Threshold Selection

Four thresholds (10, 15, 20, 30):

Threshold Papers Platinum Gold Silver Bronze Philosophy
10 64 10 5 7 42 Very inclusive, all active research
15 33 10 5 7 11 Selective, high impact
20 22 10 5 7 0 RECOMMENDED — hand-curated feel
30 15 10 5 0 0 Ultra-core, elite papers only

Threshold 20 is recommended because:

5.3 Tier Assignment

Tier Logic:

if composite_score >= 0.8:  Platinum (elite)
elif composite_score >= 0.5:  Gold (high impact)
elif composite_score >= 0.3:  Silver (solid contribution)
else:  Bronze (foundational/supporting)

Platinum Papers (10):

Gold Papers (5):

Silver Papers (7):

Bronze Papers:

5.4 In-Degree/Out-Degree Inclusion

Why both directions matter:

Example use:


6. VISUALIZATION DESIGN

6.1 Main Visualizations

index.html — Unified Dashboard

field_map_full.html — Citation Network

Design Rationale:

coauthor_map_full.html — Collaboration Network

Design Rationale:

kcore_map.html — Structural Density

evolution_graph_full.html — Field Timeline

6.2 Design Principles

1. All papers included (not limited to top 500)

2. Force-directed layouts over hierarchical

3. Directed vs. undirected edges

4. Color schemes

6.3 Performance & Scale

File Sizes:

Optimization Techniques:

Browser Requirements:


7. KNOWN LIMITATIONS & BIASES

7.1 Data Limitations

1. OpenAlex Coverage

2. Citation Counting

3. Author Name Normalization

7.2 Methodological Biases

1. PageRank Bias

2. K-Core Bias

3. Journal Club Selection

4. Paper Role Classification

7.3 How We Mitigate

1. Multiple Metrics

2. Role Classification

3. Expert Validation

4. Transparency


8. QUALITY ASSURANCE CHECKLIST

Before “Ready for Review”

Ongoing Validation


9. FUTURE IMPROVEMENTS

9.1 Short-Term (Next Sprint)

  1. Automated Author Merging
    • Implement fuzzy name matching (edit distance < 2)
    • Validate matches against co-authorship patterns
    • Flag ambiguous cases for manual review
  2. Temporal Metrics
    • Compute PageRank at yearly snapshots
    • Track metric evolution over time
    • Identify “rising stars” vs. “established leaders”
  3. Cross-Validation
    • Compare rankings against citation counts alone
    • Compare against human expert rankings
    • Sensitivity analysis on weights (80/20 optimal?)

9.2 Medium-Term (Next Quarter)

  1. Methods Paper Weighting
    • Should methods papers be weighted differently in PageRank?
    • Consider temporal discount (older methods less important)
    • Boost infrastructure papers empirically
  2. Community Detection Robustness
    • Try alternative algorithms (Leiden, GN modularity)
    • Compare community assignments
    • Validate against hand-labeled communities
  3. Seed List Enrichment
    • Add 15 missing papers (when ready)
    • Verify bioRxiv preprints are published
    • Update metrics and visualizations

9.3 Long-Term (Methodology Review)

  1. Composite Score Optimization
    • Grid search over weights (0–100% PageRank vs. k-core)
    • Optimize against expert rankings
    • Document trade-offs
  2. Role Classification Refinement
    • Unsupervised clustering of papers by in/out degree
    • Compare to hand-assigned roles
    • Adjust thresholds based on findings
  3. Field-Specific Normalization
    • Should connectomics papers be scored differently?
    • Account for field size and citation norms
    • Implement per-field PageRank

10. REFERENCES & DOCUMENTATION

Key Papers on Methodology

Data Sources

Configuration

Global Parameters:

DAMPING_FACTOR = 0.85           # PageRank
PAGERANK_ITERATIONS = 100
KCORE_ITERATIONS = 1000
COMPOSITE_WEIGHT_PAGERANK = 0.80
COMPOSITE_WEIGHT_KCORE = 0.20
METHODS_PAPER_THRESHOLD_OUTDEGREE = 37  # Top 5%
IN_OUT_DEGREE_MINIMUM = 10              # Active research threshold

Appendix: Data Dictionary

corpus_final.json

{
  "openalex_id": "string (unique paper ID)",
  "doi": "string (DOI, may be null)",
  "title": "string",
  "year": "integer (1977–2026)",
  "authors": [{
    "name": "string",
    "author_position": "string (first, middle, last)",
    "ror": "string (institution ID, may be null)"
  }],
  "cited_by_count": "integer",
  "publication_date": "string (ISO 8601)",
  "type": "string (journal-article, preprint, etc.)",
  "concepts": [{
    "display_name": "string",
    "level": "integer"
  }]
}

paper_rankings_all.json

[{
  "openalex_id": "string",
  "title": "string",
  "year": "integer",
  "rank": "integer (1–7503)",
  "composite_score": "float (0–1)",
  "pagerank": "float",
  "hits_hub": "float",
  "hits_authority": "float",
  "betweenness": "float",
  "in_degree": "integer",
  "out_degree": "integer",
  "k_core": "integer (0–32)",
  "inferred_role": "string (10 categories)",
  "total_citations": "integer"
}]

author_rankings.json

[{
  "name": "string",
  "rank": "integer (1–35641)",
  "paper_count": "integer",
  "co_author_count": "integer",
  "composite_score": "float",
  "merged": "boolean (was name merged?)"
}]

Last Updated: 2026-03-31
Pipeline Version: 2.0 (end-to-end, all papers, all metrics)
Status: Ready for team review and methodology discussion