LLM Coding Benchmarks

Study finds newer LLMs introduce more severe coding bugs despite higher benchmark scores

A new report today from code quality testing startup SonarSource SA is warning that while the latest large language models may be getting better at passing coding benchmarks, at the same time they are ...

SiliconANGLE

Researchers develop new LiveBench benchmark for measuring AI models’ response accuracy

A group of researchers has developed a new benchmark, dubbed LiveBench, to ease the task of evaluating large language models’ question-answering capabilities. The researchers released the benchmark on ...

VentureBeat

Nvidia, Intel claim new LLM training speed records in new MLPerf 3.1 benchmark

Training AI models is a whole lot faster in 2023, according to the results from the MLPerf Training 3.1 benchmark released today. The pace of innovation in the generative AI space is breathtaking to ...

Forbes

How Open Benchmarking Ensures AI Development Is Reliable And Safe

Artificial intelligence (AI) is essential to our daily lives. It influences everything from the way we drive and secure our homes to how we manage our money and receive medical care. However, the rush ...

VentureBeat

Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data

Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix. However, these benchmarks often test for general ...

Virtualization Review

AI's Heavy Hitters: Best Models for Every Task

In today's crowded AI landscape, organizations looking to leverage AI models are faced with an overwhelming number of options. But how to choose? An obvious starting point are all the various AI ...

Morningstar

Diffblue’s Latest Innovations in Unit Test Generation Deliver 20x Productivity Advantage Versus AI Coding Assistants

New benchmark study confirms Diffblue’s advantages over LLM coding assistants realized through its reinforcement learning-powered agentic capabilities Diffblue today announced the release of the next ...

Ars Technica

New Claude 4 AI model refactored code for 7 hours straight

On Thursday, Anthropic released Claude Opus 4 and Claude Sonnet 4, marking the company’s return to larger model releases after primarily focusing on mid-range Sonnet variants since June of last year.

Geeky Gadgets

Free Qwen 3 Coder AI Coding Assistant : Insanely Powerful and Open Source

What if coding could be faster, smarter, and more accessible than ever before? Enter Qwen 3 Coder, a new open source large language model (LLM) developed by Alibaba. With a staggering 480 billion ...

Hosted on MSN

Diffblue’s Latest Innovations in Unit Test Generation Deliver 20x Productivity Advantage Versus AI Coding Assistants

Diffblue today announced the release of the next generation of its flagship product, Diffblue Cover, to address the unmet need for automated, high quality unit test generation at scale. Focused on ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results