There is something quietly consequential about a company deciding, on its own terms, how to measure its own progress toward one of the most transformative technologies in human history. That is precisely what OpenAI has done with its newly introduced framework for measuring progress toward artificial general intelligence, paired with a Kaggle hackathon inviting outside researchers to build the evaluations that will populate it.
The framework is cognitive in nature, meaning it attempts to map AI capabilities against human-like reasoning, problem-solving, and learning benchmarks. On the surface, this looks like responsible science: define your terms, measure your progress, invite scrutiny. But the deeper architecture of the move deserves more careful attention than the press release invites.
The question of what AGI actually is has never been settled. Researchers across academia, government, and industry hold genuinely different views. Some define AGI as a system that can perform any intellectual task a human can. Others require that it generalize across domains without task-specific training. Still others insist that consciousness or embodiment must be part of the picture. OpenAI's framework sidesteps this philosophical thicket by operationalizing the question: AGI becomes whatever the benchmarks say it is.
This is not a trivial move. Benchmark-setting in AI has historically shaped research priorities, funding flows, and public perception in ways that outlast the benchmarks themselves. When ImageNet defined computer vision progress in the 2010s, the entire field reorganized around it. Institutions that controlled the benchmark effectively controlled the narrative of what counted as a breakthrough. OpenAI, by launching its own cognitive framework and simultaneously seeding a Kaggle community to build evaluations within that framework, is attempting something similar at a far higher stakes level.
The Kaggle hackathon element is particularly worth examining. Crowdsourcing benchmark construction creates a veneer of democratic participation while the underlying framework, the conceptual architecture that determines what kinds of tasks even get evaluated, remains proprietary. Thousands of contributors may build the walls of a house whose floor plan they never approved.
There is a systems-level dynamic embedded in this arrangement that deserves explicit naming. When a lab both develops an AI system and controls the metrics by which that system's progress is judged, it creates a closed feedback loop with no external corrective mechanism. The lab can, consciously or not, design evaluations that its systems are already good at, declare progress, attract investment, and use that investment to build the next generation of systems, which then perform well on slightly updated versions of the same evaluations.
This is not a hypothetical risk. The history of AI benchmarks is littered with examples of systems that achieved superhuman performance on a specific test while failing basic reasoning tasks that any child could handle. The Winograd Schema, various reading comprehension datasets, and early versions of the GLUE benchmark all saw rapid saturation by models that turned out to be exploiting statistical patterns rather than demonstrating genuine understanding. Each time, the response was a new, harder benchmark, often designed by the same community whose models had just broken the previous one.
What makes OpenAI's framework different in scale, if not in kind, is the explicit framing around AGI itself. Once a company can credibly claim, by its own metrics, that it has achieved AGI, the legal, regulatory, and commercial consequences are enormous. OpenAI's own charter contains provisions that trigger structural changes upon AGI achievement. Investors, governments, and the public would respond to such a declaration in ways that could reshape the entire technology landscape within months.
The second-order effect worth watching is regulatory. Policymakers in the U.S. and Europe are already struggling to define AI risk thresholds in legislation. If a private lab's internal framework becomes the de facto reference point for what AGI means, regulators may find themselves writing rules around a definition they never chose and cannot independently verify. The benchmark becomes the law before anyone votes on it.
Science has always had a self-referential quality: researchers define problems, design experiments, and interpret results. But the best scientific communities build in adversarial review, replication requirements, and institutional separation between those who build and those who evaluate. What OpenAI is constructing is a measurement system for a technology that could restructure labor markets, national security, and democratic governance, and it is doing so largely in-house, with crowdsourced decoration around the edges.
The hackathon will produce interesting evaluations. Some may even be genuinely rigorous. But the more important question, which institution gets to declare that AGI has arrived and on what basis, remains unanswered. And the longer it stays that way, the more likely it is that the answer will simply be whoever built the benchmark first.
References
- Raji et al. (2021) β AI and the Everything in the Whole Wide World Benchmark
- Bowman et al. (2021) β What Will it Take to Fix Benchmarking in Natural Language Understanding?
- Marcus et al. (2019) β Rebooting AI: Building Artificial Intelligence We Can Trust
- Birhane et al. (2022) β The Values Encoded in Machine Learning Research
- OpenAI (2023) β OpenAI Charter
Discussion (0)
Be the first to comment.
Leave a comment