A new AI coding challenge, the K Prise, has just published its inaugural results, and they are surprisingly low, setting a new and challenging benchmark for AI-powered software engineers. Launched by Databricks and Perplexity co-founder Andy Konwinski, the K Prise aims to provide a “contamination-free” evaluation of AI models’ ability to solve real-world programming problems, with its first winner achieving a mere 7.5% correct answers.
The K Prise: A Hard Benchmark by Design
On Wednesday, July 23, 2025, the nonprofit Laude Institute announced the first winner of the K Prise: Brazilian prompt engineer Eduardo Rocha de Andrade, who received $50,000 for his performance. The most striking revelation, however, was his winning score of just 7.5% accuracy on the test.
Konwinski expressed satisfaction with the difficulty of the benchmark. “We’re glad we built a benchmark that is actually hard,” he stated, emphasising that “benchmarks should be hard if they’re going to matter.” He also highlighted that the K Prise operates offline with limited compute resources, a design choice intended to favour smaller and open models, thereby levelling the playing field against larger, proprietary models from major AI labs. Konwinski has publicly pledged $1 million to the first open-source model that can achieve a score higher than 90% on the test.
Contamination-Free Testing and Discrepancies with SWE-Bench
Similar to the well-known SWE-Bench system, the K Prise evaluates models against flagged issues from GitHub, simulating real-world programming problems. However, unlike SWE-Bench, which uses a fixed set of problems that models can inadvertently train against, the K Prise is designed to be “contamination-free.” For its first round, models were submitted by March 12, and the organisers then built the test using only GitHub issues flagged after that date, ensuring that the models could not have pre-trained on the specific test problems.
The 7.5% top score on the K Prise stands in stark contrast to SWE-Bench’s reported top scores of 75% on its “Verified” test and 34% on its harder “Full” test. Konwinski acknowledges the disparity but is unsure whether it’s solely due to contamination on SWE-Bench or the inherent challenge of collecting truly new issues from GitHub. He anticipates that as the K Prise conducts more rounds and participants adapt to its dynamic nature, the project will provide clearer answers to this question.
A Reality Check for AI Development
The surprisingly low scores from the K Prise are being seen by many critics as a necessary “reality check” for the AI industry, which faces a growing evaluation problem as existing benchmarks become too easy. Princeton researcher Sayash Kapoor, who has proposed similar ideas, emphasises the importance of such new tests to determine if high scores on other benchmarks are due to contamination or human intervention.
For Konwinski, the K Prise is more than just a better benchmark; it’s an open challenge to the entire industry. He argues against the prevailing hype surrounding AI’s capabilities. “If you listen to the hype, it’s like we should be seeing AI doctors and AI lawyers and AI software engineers, and that’s just not true,” Konwinski stated. “If we can’t even get more than 10% on a contamination-free SWE-Bench, that’s the reality check for me.” The results underscore that despite widespread public availability of AI coding tools, truly autonomous and highly proficient AI software engineers remain a distant goal, highlighting the need for continued, rigors testing and a focus on real-world applicability.