AI Judge Methodology | ElephantRoom

Overview

The pentagonal radar chart scores reflect how well each theory's answers address key questions about consciousness. These are not judgments about which theory is "correct"—they measure explanatory clarity and completeness within the theory's own framework.

The 5 Questions

All 15 theories answer the same 5 standardized questions, designed to be theory-neutral and centered on the hard problem. Each question elicits concrete, testable claims rather than meta-theoretical positioning.

The hard problem: why does physical processing give rise to subjective experience?
Mechanism: What specific mechanism distinguishes conscious processing from unconscious processing of equal complexity?
Empirical dissociations: How does this theory account for known dissociations between consciousness and neural processing—such as anesthesia abolishing consciousness while preserving neural activity, split-brain cases, and blindsight?
Boundary conditions: What is the smallest or simplest system this theory predicts could be conscious, and how would that be tested?
Falsifiability: What specific, falsifiable predictions does this theory make that could distinguish it from competing theories, and what evidence would falsify it?

The 5 Radar Dimensions

Each theory is scored on 5 dimensions (scale 1–5), visualized as a pentagonal radar chart. Each dimension maps directly to one question, and every score includes a rationale explaining the reasoning.

Dimension	Question	What It Measures
Hard Problem	Q1	Does it explain why experience exists? Not just when or where.
Mechanism	Q2	Is the proposed mechanism concrete? Does it fit known phenomena?
Empirical Fit	Q3	Does it account for known dissociations (anesthesia, split-brain, blindsight)?
Boundary Precision	Q4	Testable claims about where consciousness starts and stops.
Falsifiability	Q5	What would break it? Predictions that distinguish it from competitors.

Scores reflect how well the theory has been explained on the platform, not its inherent quality. A high score means clear, specific answers—not that the theory is correct.

The 15 Evaluated Theories

Code	Full Name	Category
IIT	Integrated Information Theory	Physicalist
GWT	Global Workspace Theory	Functionalist
HOT	Higher-Order Thought Theory	Functionalist
PP	Predictive Processing	Functionalist
PCT	Perceptual Control Theory	Functionalist
ILLU	Illusionism	Eliminativist
ORCH	Orchestrated Objective Reduction	Emergentist
DLCT	Dual-Level Causal Theory	Emergentist
IRRUP	Irruption Theory	Emergentist
FEP	Free Energy Principle	Physicalist
RPT	Recurrent Processing Theory	Physicalist
AST	Attention Schema Theory	Functionalist
NPS	Neurophenomenal Structuralism	Physicalist
EC	Embodied Cognition	Embodied
DIT	Dendritic Integration Theory	Physicalist

Scoring Process

PDF Extraction: Academic papers are processed using OpenAI gpt-4o-2024-11-20 to extract key claims and mechanisms.
Answer Generation: Claude Opus 4.6 generates answers to each question based solely on extracted source material.
Metrics Scoring: Claude Opus 4.6 scores each dimension with a 1–5 score and a rationale explaining the reasoning.
Calibration Check: Anchor points are verified to prevent score drift.
Community Review: All AI-generated content can be challenged, refined, and overridden by the research community.

Calibration & Fairness

Fairness in Content Generation

Identical prompts for each theory (only the theory name changes)
Same questions asked of every theory (5 standardized questions)
Same rubric applied to all evaluations
No theory-specific modifications to prompts or evaluation criteria
Source-grounded: All content generated from academic PDF sources

Theory Selection

For details on how we decide which theories to include and exclude, see our Theory Selection Criteria page.

Important Caveats

Answer Quality, Not Theory Truth: A high score means the theory provides clear, specific answers—not that the theory is correct.
Source-Grounded: All scores derive from academic source material, not general knowledge.
Trade-offs Are Real: Some theories intentionally sacrifice one dimension for another.
Improvable: As source materials are added or answers refined, scores can improve.

Version History

Version	Date	Changes
4.0	Feb 20, 2026	Reduced to 5 questions centered on the hard problem. 5 radar dimensions now map 1:1 to questions (Hard Problem, Mechanism, Empirical Fit, Boundary Precision, Falsifiability). Pentagonal radar chart.
3.0	Feb 19, 2026	Removed 5 theories that failed selection criteria. Added 4 empirically-grounded theories (AST, NPS, EC, DIT). Removed leaderboard/ranking system in favor of radar chart profiles. Added Embodied category.
2.1	Dec 11, 2024	Fair regeneration complete: all theories regenerated from PDFs with identical prompts. New 8 dimensions with rationales stored in database.
2.0	Dec 9, 2024	Overhaul: 8 dimensions derived from 15 curated questions. Added FEP, RPT theories.
1.0	Dec 3, 2024	Initial release with 24 questions and original dimension labels.

AI Judge Evaluation Methodology

Current Evaluation