Enrichment & Recomputation Plan
Current State
✅ Phase 1 Complete:
- Author merging applied (17 expert name merges)
- Paper role categories refined (granular, lower thresholds)
- Key experts identified (top 50 researchers)
- Career arcs extracted (publication trajectories)
- Synthesis papers analyzed (references checked)
Result: All papers cited by synthesis experts already in corpus — no enrichment candidates needed (good news!)
Phase 2: Seed List Comparison & Strategic Enrichment
Step 1: Load User Seed List
- Load the original 200 hand-curated papers
- Compare against current corpus (7,503 papers)
- Identify missing papers from seed list
Step 2: Cross-Reference with Key Experts
- Check which seed papers are authored by top 50 experts
- Check which seed papers are cited by synthesis experts
- Identify missing papers by key authors
Step 3: Strategic Enrichment Decision
Framework: Confidence >0.5 threshold, with justification
- Papers cited by 2+ expert synthesis papers ✓ (already in corpus)
- Papers from seed list by key experts (TO BE ADDED)
- Papers with explicit domain expert notation (TO BE REVIEWED)
Flag for manual review:
- Decision: INCLUDE / EXCLUDE / MAYBE
- Confidence score: 0.0-1.0
- Reason: Why we flagged this paper
Phase 3: Final Recomputation Pass
After enrichment is approved, recompute everything:
- Paper Metrics (all papers)
- PageRank on citation graph (85,000+ edges)
- HITS hub/authority
- Betweenness centrality
- K-core decomposition
- Composite scores
- Author Rankings (all 35,797 authors)
- Productivity (paper count after merges)
- Collaboration breadth (co-author count)
- PageRank on co-authorship network
- Updated with merged names
- Network Analysis
- Community detection (Louvain) — may change with new papers
- Citation neighborhoods
- Collaboration patterns
- Journal Club (new selections)
- Recompute on updated paper rankings
- 4 thresholds: 0.10, 0.15, 0.20, 0.30
- Re-evaluate role-weighted scores
- Visualizations
- field_map_full.html (updated citation network)
- coauthor_map_full.html (updated collaboration network)
- career_arcs visualization (publication trajectories)
- kcore_map.html (updated structural density)
- evolution_graph_full.html (temporal evolution)
- Career Arc Visualization (NEW)
- Timeline plot: publications per year × expert
- Color by paper role (landmark, foundational, synthesis, etc.)
- Collaboration evolution (co-authors over time)
- Impact trajectory (citations accumulating over time)
Execution Order
1. Load seed list
↓
2. Compare with corpus → missing papers list
↓
3. Create enrichment candidates with justification
↓
4. Manual review → approve/reject decisions
↓
5. Add approved papers to corpus
↓
6. Recompute all metrics (Step-by-step)
- PageRank, HITS, betweenness
- K-core decomposition
- Communities
- Author rankings
↓
7. Regenerate visualizations
- Updated graphs with new edges
- Journal club at all thresholds
- Career arc plots
↓
8. Final validation
- Check for new anomalies
- Verify citation counts
- Spot-check key authors
Data Flow Diagram
User Seed List (200 papers)
↓
Compare with Corpus (7,503)
↓
Missing Papers
↓
Key Expert Analysis
↓
Enrichment Candidates (flagged)
↓
Manual Review → APPROVED
↓
Add to Corpus → Corpus_enriched.json
↓
Recompute Metrics (all papers)
↓
Update Rankings, K-core, Communities
↓
Regenerate Visualizations
↓
Final Output (complete, enriched pipeline)
Key Decisions to Make
- Enrichment threshold: Confidence >0.5 confirmed?
- Seed list scope: Compare all 200, or just key ones?
- Author merge confidence: >0.5 threshold for name merges?
- Career arc detail: Show all 50 experts, or top 15?
Files to Create/Update
New files:
seed_list_comparison.json— Missing papers from seedenrichment_decisions_log.json— Flagged papers with approvalscorpus_enriched.json— Updated corpus with new paperscareer_arcs_visualization.html— Interactive timeline
Updated files:
paper_rankings_all.json— Recomputed metricsauthor_rankings.json— Recomputed with full corpusjournal_club_threshold_*.json— New selections- All visualization HTML files
Timeline
- Phase 2 (Enrichment): 1 step — load seed list, compare, flag
- Phase 3 (Recomputation): 6-7 steps — can parallelize some metrics
- Total: Ready when you approve enrichment candidates
Next Action
→ Load user seed list and compare against corpus (7,503 papers)