AI Coding Challenge Reveals "Not Pretty" First Results

.elementor-82 .elementor-element.elementor-element-11ff7f0 .nekit-theme-mode.light-mode--on{--wpr-bg-8b966060-5592-4514-b4c6-375cc46a8108: url('https://kryptontoday.com/wp-content/uploads/2025/02/theme-dark-min.webp');}.rll-youtube-player .play{--wpr-bg-291515f4-c268-4ef4-9adf-18e4923baa7e: url('https://kryptontoday.com/wp-content/plugins/wp-rocket/assets/img/youtube.png');}

AI Coding Challenge Reveals “Not Pretty” First Results

Krypton Today

July 25, 2025

AI Coding Challenge Reveals "Not Pretty" First Results

Table of Content

A new AI coding challenge, the K Prise, has just published its inaugural results, and they are surprisingly low, setting a new and challenging benchmark for AI-powered software engineers. Launched by Databricks and Perplexity co-founder Andy Konwinski, the K Prise aims to provide a “contamination-free” evaluation of AI models’ ability to solve real-world programming problems, with its first winner achieving a mere 7.5% correct answers.

The K Prise: A Hard Benchmark by Design

On Wednesday, July 23, 2025, the nonprofit Laude Institute announced the first winner of the K Prise: Brazilian prompt engineer Eduardo Rocha de Andrade, who received $50,000 for his performance. The most striking revelation, however, was his winning score of just 7.5% accuracy on the test.

Konwinski expressed satisfaction with the difficulty of the benchmark. “We’re glad we built a benchmark that is actually hard,” he stated, emphasising that “benchmarks should be hard if they’re going to matter.” He also highlighted that the K Prise operates offline with limited compute resources, a design choice intended to favour smaller and open models, thereby levelling the playing field against larger, proprietary models from major AI labs. Konwinski has publicly pledged $1 million to the first open-source model that can achieve a score higher than 90% on the test.

Contamination-Free Testing and Discrepancies with SWE-Bench

Similar to the well-known SWE-Bench system, the K Prise evaluates models against flagged issues from GitHub, simulating real-world programming problems. However, unlike SWE-Bench, which uses a fixed set of problems that models can inadvertently train against, the K Prise is designed to be “contamination-free.” For its first round, models were submitted by March 12, and the organisers then built the test using only GitHub issues flagged after that date, ensuring that the models could not have pre-trained on the specific test problems.

The 7.5% top score on the K Prise stands in stark contrast to SWE-Bench’s reported top scores of 75% on its “Verified” test and 34% on its harder “Full” test. Konwinski acknowledges the disparity but is unsure whether it’s solely due to contamination on SWE-Bench or the inherent challenge of collecting truly new issues from GitHub. He anticipates that as the K Prise conducts more rounds and participants adapt to its dynamic nature, the project will provide clearer answers to this question.

A Reality Check for AI Development

The surprisingly low scores from the K Prise are being seen by many critics as a necessary “reality check” for the AI industry, which faces a growing evaluation problem as existing benchmarks become too easy. Princeton researcher Sayash Kapoor, who has proposed similar ideas, emphasises the importance of such new tests to determine if high scores on other benchmarks are due to contamination or human intervention.

For Konwinski, the K Prise is more than just a better benchmark; it’s an open challenge to the entire industry. He argues against the prevailing hype surrounding AI’s capabilities. “If you listen to the hype, it’s like we should be seeing AI doctors and AI lawyers and AI software engineers, and that’s just not true,” Konwinski stated. “If we can’t even get more than 10% on a contamination-free SWE-Bench, that’s the reality check for me.” The results underscore that despite widespread public availability of AI coding tools, truly autonomous and highly proficient AI software engineers remain a distant goal, highlighting the need for continued, rigors testing and a focus on real-world applicability.

Tags :

AI Challenge Coding

Krypton Today

Independent crypto journalism, daily insights, and breaking blockchain news.

Disclaimer: All content on this site is for informational purposes only and does not constitute financial advice. Always conduct your research before investing in any cryptocurrency.

All Categories

Recent News

AI Coding Challenge Reveals “Not Pretty” First Results

Table of Content

The K Prise: A Hard Benchmark by Design

Contamination-Free Testing and Discrepancies with SWE-Bench

A Reality Check for AI Development

RICH Miner Unveils XRP Automated Mining for Passive Income

TRUMP Token Surge Analysis: Price Jump Potential

Krypton Today

Popular News

Recent News

Follow Us

Quick Links

Categories

Tags

Social Media