Claude Plays Pokémon
It has been a busy week for frontier AI labs, but perhaps a busier week for Claude. Since Anthropic launched Claude 3.7, formalizing an ever-more fine-grained naming convention, the chatbot has been streaming nearly continuously on Twitch in its efforts to beat the game Pokémon Red. In homage to “Twitch Plays Pokémon,” an attempt from 2014 to crowd-source commands on live-stream, Claude is slowly and methodically reasoning its way through each movement and command, employing an emulator software to control the game.
Manifold thinks Claude has a shot, despite many failed attempts, giving the Anthropic model almost 50% odds at beating the game this year, and far higher if you give Claude until the end of next year.
You can also bet on specific events during Claude’s stream. As I write this, Claude has spent nearly 24 hours stuck in a place called “Mt. Moon”, where the Twitch chat is imploring it to figure out how to use a ladder in order to traverse the dungeon. Yesterday, viewers were in fits when Claude named its in-game rival Waclaud.
Progress on Agency Benchmarks
The popularity of this type of demo makes it likely that other AI models may start to have their own video game streams. This bodes well for a long-standing market on whether an AI will be able to successfully play randomly selected computer games without practice by the start of 2028. This is an insightful question, and perhaps a very practical benchmark. Navigating the complex world of a video game is a great halfway step between problem-solving in contained, rules-based situations such as a game of chess or a logic puzzle, and operating agentically in the real world. The market currently stands at 62%, although allowing the models to train through self-play bumps that up to 77%.
Claude is able to perform quite well in the world of Pokémon Red due to its use of extended thinking (reasoning), access to a knowledge base it can write to and read from, as well as some improved computer vision tools. It’s not hard to imagine that a reasoning model such as GPT-5 or Claude 4 will soon be able to operate agentically, not just inside a video game, but inside a robot in the real world.
While the next month seems too soon for a ChatGPT-level breakthrough in robotics, forecasters seem to think it’s more plausible by the end of the year.
Manifold users estimate that a robot will be able to pass the “Tea Test” by 2027, or alternatively, Steve Wozniak’s “Coffee Test”. The “Coffee Test” requires that…
a machine is required to enter an average American home and figure out how to make coffee: find the coffee machine, find the coffee, add water, find a mug, and brew the coffee by pushing the proper buttons
If beating a game of Pokémon is a good benchmark for the general intelligence capabilities of an AI, the Coffee Test is probably one step further along that axis, and perhaps a couple years shy of Manifold’s AGI estimate of ~2030.
Manifold users are getting creative in their ways of testing AI progress, with one enterprising user trying to compete her left hand against a robot hand, although the left hand (YES) appears to have the edge against the robot hand (NO) for the moment:
The continued, compounding progress seems to be causing forecasters to lower their odds on an “AI winter”, now down from 30% at the start of the year to 15%, just a couple months into 2025.
This comes despite an underwhelming GPT-4.5 release. Bettors were quite confident that GPT-4.5 would be announced today, reacting to accidental leaks.
Indeed it was, on an OpenAI livestream, to boos from the audience on twitter. Odds on GPT-4.5 topping the lmarena leaderboard dropped precipitously from 80% during the livestream. Despite this, some commentators think that GPT-4.5 is quite good at creative writing and might have some agentic improvements.
A Human Edge at the Oscars, For Now
Currently, the use of AI seems to hurt one’s odds at the Oscars. A recent controversy with The Brutalist’s use of AI might be implicated in its dropping odds to take the best picture next week. Also, note the steadily rising odds for “Conclave”, perhaps tied to rising interest in the potentially upcoming real-world conclave.
Robots, however, remain likable, with The Wild Robot holding an edge over Flow in the Best Animated Feature Film category:
You can bet on every Oscars category on Manifold, but we might have to wait until 2028 or (far) later if we want to see a fully AI-generated film represented among the picks. One of the most traded markets in Manifold history continues to hover in the low 40s range on the odds that Hollywood-level films will be generatively made by the start of 2028.
Happy Forecasting!
-Above the Fold