Back to Compare

AI Judge Evaluation Methodology

Understanding how we score and compare consciousness theories

Current Evaluation

Model: Claude Opus 4.6
Theories: 15
Questions: 5
Rubric Version: 4.0
Last Updated: Feb 20, 2026

Overview

The pentagonal radar chart scores reflect how well each theory's answers address key questions about consciousness. These are not judgments about which theory is "correct"—they measure explanatory clarity and completeness within the theory's own framework.

The 5 Questions

All 15 theories answer the same 5 standardized questions, designed to be theory-neutral and centered on the hard problem. Each question elicits concrete, testable claims rather than meta-theoretical positioning.

  1. The hard problem: why does physical processing give rise to subjective experience?
  2. Mechanism: What specific mechanism distinguishes conscious processing from unconscious processing of equal complexity?
  3. Empirical dissociations: How does this theory account for known dissociations between consciousness and neural processing—such as anesthesia abolishing consciousness while preserving neural activity, split-brain cases, and blindsight?
  4. Boundary conditions: What is the smallest or simplest system this theory predicts could be conscious, and how would that be tested?
  5. Falsifiability: What specific, falsifiable predictions does this theory make that could distinguish it from competing theories, and what evidence would falsify it?

The 5 Radar Dimensions

Each theory is scored on 5 dimensions (scale 1–5), visualized as a pentagonal radar chart. Each dimension maps directly to one question, and every score includes a rationale explaining the reasoning.

DimensionQuestionWhat It Measures
Hard ProblemQ1Does it explain why experience exists? Not just when or where.
MechanismQ2Is the proposed mechanism concrete? Does it fit known phenomena?
Empirical FitQ3Does it account for known dissociations (anesthesia, split-brain, blindsight)?
Boundary PrecisionQ4Testable claims about where consciousness starts and stops.
FalsifiabilityQ5What would break it? Predictions that distinguish it from competitors.

Scores reflect how well the theory has been explained on the platform, not its inherent quality. A high score means clear, specific answers—not that the theory is correct.

The 15 Evaluated Theories

CodeFull NameCategory
IITIntegrated Information TheoryPhysicalist
GWTGlobal Workspace TheoryFunctionalist
HOTHigher-Order Thought TheoryFunctionalist
PPPredictive ProcessingFunctionalist
PCTPerceptual Control TheoryFunctionalist
ILLUIllusionismEliminativist
ORCHOrchestrated Objective ReductionEmergentist
DLCTDual-Level Causal TheoryEmergentist
IRRUPIrruption TheoryEmergentist
FEPFree Energy PrinciplePhysicalist
RPTRecurrent Processing TheoryPhysicalist
ASTAttention Schema TheoryFunctionalist
NPSNeurophenomenal StructuralismPhysicalist
ECEmbodied CognitionEmbodied
DITDendritic Integration TheoryPhysicalist

Scoring Process

  1. PDF Extraction: Academic papers are processed using OpenAI gpt-4o-2024-11-20 to extract key claims and mechanisms.
  2. Answer Generation: Claude Opus 4.6 generates answers to each question based solely on extracted source material.
  3. Metrics Scoring: Claude Opus 4.6 scores each dimension with a 1–5 score and a rationale explaining the reasoning.
  4. Calibration Check: Anchor points are verified to prevent score drift.
  5. Community Review: All AI-generated content can be challenged, refined, and overridden by the research community.

Calibration & Fairness

Fairness in Content Generation

  • Identical prompts for each theory (only the theory name changes)
  • Same questions asked of every theory (5 standardized questions)
  • Same rubric applied to all evaluations
  • No theory-specific modifications to prompts or evaluation criteria
  • Source-grounded: All content generated from academic PDF sources

Theory Selection

For details on how we decide which theories to include and exclude, see our Theory Selection Criteria page.

Important Caveats

  • Answer Quality, Not Theory Truth: A high score means the theory provides clear, specific answers—not that the theory is correct.
  • Source-Grounded: All scores derive from academic source material, not general knowledge.
  • Trade-offs Are Real: Some theories intentionally sacrifice one dimension for another.
  • Improvable: As source materials are added or answers refined, scores can improve.

Version History

VersionDateChanges
4.0Feb 20, 2026Reduced to 5 questions centered on the hard problem. 5 radar dimensions now map 1:1 to questions (Hard Problem, Mechanism, Empirical Fit, Boundary Precision, Falsifiability). Pentagonal radar chart.
3.0Feb 19, 2026Removed 5 theories that failed selection criteria. Added 4 empirically-grounded theories (AST, NPS, EC, DIT). Removed leaderboard/ranking system in favor of radar chart profiles. Added Embodied category.
2.1Dec 11, 2024Fair regeneration complete: all theories regenerated from PDFs with identical prompts. New 8 dimensions with rationales stored in database.
2.0Dec 9, 2024Overhaul: 8 dimensions derived from 15 curated questions. Added FEP, RPT theories.
1.0Dec 3, 2024Initial release with 24 questions and original dimension labels.