AI Judge Evaluation Methodology
Understanding how we score and compare consciousness theories
Current Evaluation
Overview
The pentagonal radar chart scores reflect how well each theory's answers address key questions about consciousness. These are not judgments about which theory is "correct"—they measure explanatory clarity and completeness within the theory's own framework.
The 5 Questions
All 15 theories answer the same 5 standardized questions, designed to be theory-neutral and centered on the hard problem. Each question elicits concrete, testable claims rather than meta-theoretical positioning.
- The hard problem: why does physical processing give rise to subjective experience?
- Mechanism: What specific mechanism distinguishes conscious processing from unconscious processing of equal complexity?
- Empirical dissociations: How does this theory account for known dissociations between consciousness and neural processing—such as anesthesia abolishing consciousness while preserving neural activity, split-brain cases, and blindsight?
- Boundary conditions: What is the smallest or simplest system this theory predicts could be conscious, and how would that be tested?
- Falsifiability: What specific, falsifiable predictions does this theory make that could distinguish it from competing theories, and what evidence would falsify it?
The 5 Radar Dimensions
Each theory is scored on 5 dimensions (scale 1–5), visualized as a pentagonal radar chart. Each dimension maps directly to one question, and every score includes a rationale explaining the reasoning.
| Dimension | Question | What It Measures |
|---|---|---|
| Hard Problem | Q1 | Does it explain why experience exists? Not just when or where. |
| Mechanism | Q2 | Is the proposed mechanism concrete? Does it fit known phenomena? |
| Empirical Fit | Q3 | Does it account for known dissociations (anesthesia, split-brain, blindsight)? |
| Boundary Precision | Q4 | Testable claims about where consciousness starts and stops. |
| Falsifiability | Q5 | What would break it? Predictions that distinguish it from competitors. |
Scores reflect how well the theory has been explained on the platform, not its inherent quality. A high score means clear, specific answers—not that the theory is correct.
The 15 Evaluated Theories
| Code | Full Name | Category |
|---|---|---|
| IIT | Integrated Information Theory | Physicalist |
| GWT | Global Workspace Theory | Functionalist |
| HOT | Higher-Order Thought Theory | Functionalist |
| PP | Predictive Processing | Functionalist |
| PCT | Perceptual Control Theory | Functionalist |
| ILLU | Illusionism | Eliminativist |
| ORCH | Orchestrated Objective Reduction | Emergentist |
| DLCT | Dual-Level Causal Theory | Emergentist |
| IRRUP | Irruption Theory | Emergentist |
| FEP | Free Energy Principle | Physicalist |
| RPT | Recurrent Processing Theory | Physicalist |
| AST | Attention Schema Theory | Functionalist |
| NPS | Neurophenomenal Structuralism | Physicalist |
| EC | Embodied Cognition | Embodied |
| DIT | Dendritic Integration Theory | Physicalist |
Scoring Process
- PDF Extraction: Academic papers are processed using OpenAI gpt-4o-2024-11-20 to extract key claims and mechanisms.
- Answer Generation: Claude Opus 4.6 generates answers to each question based solely on extracted source material.
- Metrics Scoring: Claude Opus 4.6 scores each dimension with a 1–5 score and a rationale explaining the reasoning.
- Calibration Check: Anchor points are verified to prevent score drift.
- Community Review: All AI-generated content can be challenged, refined, and overridden by the research community.
Calibration & Fairness
Fairness in Content Generation
- Identical prompts for each theory (only the theory name changes)
- Same questions asked of every theory (5 standardized questions)
- Same rubric applied to all evaluations
- No theory-specific modifications to prompts or evaluation criteria
- Source-grounded: All content generated from academic PDF sources
Theory Selection
For details on how we decide which theories to include and exclude, see our Theory Selection Criteria page.
Important Caveats
- Answer Quality, Not Theory Truth: A high score means the theory provides clear, specific answers—not that the theory is correct.
- Source-Grounded: All scores derive from academic source material, not general knowledge.
- Trade-offs Are Real: Some theories intentionally sacrifice one dimension for another.
- Improvable: As source materials are added or answers refined, scores can improve.
Version History
| Version | Date | Changes |
|---|---|---|
| 4.0 | Feb 20, 2026 | Reduced to 5 questions centered on the hard problem. 5 radar dimensions now map 1:1 to questions (Hard Problem, Mechanism, Empirical Fit, Boundary Precision, Falsifiability). Pentagonal radar chart. |
| 3.0 | Feb 19, 2026 | Removed 5 theories that failed selection criteria. Added 4 empirically-grounded theories (AST, NPS, EC, DIT). Removed leaderboard/ranking system in favor of radar chart profiles. Added Embodied category. |
| 2.1 | Dec 11, 2024 | Fair regeneration complete: all theories regenerated from PDFs with identical prompts. New 8 dimensions with rationales stored in database. |
| 2.0 | Dec 9, 2024 | Overhaul: 8 dimensions derived from 15 curated questions. Added FEP, RPT theories. |
| 1.0 | Dec 3, 2024 | Initial release with 24 questions and original dimension labels. |