PhAIL ranks top robotics foundation models on real hardware

PhAIL ranks top robotics foundation models on real hardware

Positronic Robotics evaluated 4 VLA fashions on bin-to-bin order choosing. | Credit score: Positronic Robotics

Positronic Robotics, which mentioned it helps builders make robots work with synthetic intelligence, has launched its “Bodily AI Leaderboard,” or PhAIL. It’s an ongoing, benchmark evaluating robotics basis fashions on business duties.

Based in September 2025, Positronic mentioned it has developed an open-source infrastructure to standardize and scale bodily AI by bridging the hole between analysis basis fashions and real-world robotic manufacturing. The Springfield, Mo.-based company‘s system makes use of a unified Python toolkit for the complete robotics lifecycle and the PhAIL benchmark.

PhAIL evaluates fashions on bodily robotic setups performing commercially related operations. Positronic Robotics has began with bin-to-bin order choosing — one of the crucial frequent duties in logistics and industrial automation. On this job, objects are transferred one after the other from an inbound container to an outbound container.

The present analysis rig makes use of a Franka Analysis 3 robotic arm paired with a Robotiq 2F-85 gripper in DROID-style configuration, a extensively used and reproducible analysis platform.

PhAIL measures throughput and reliability

Bodily AI has superior quickly lately, with basis fashions able to dealing with more and more various manipulation duties. However most benchmarks nonetheless depend on simulation or managed laboratory circumstances, and plenty of public evaluations emphasize curated demonstration movies moderately than sustained operation. For industrial deployment, two variables dominate: throughput and reliability.

PhAIL measures each instantly. Every run is executed on actual {hardware}, not in simulation. Mannequin checkpoints are chosen randomly and evaluated in blinded circumstances. Each run is logged and printed with synchronized video, robotic telemetry, station metadata, and scoring artifacts.

From these runs, PhAIL computes models per hour (UPH), and imply time between failures or assists (MTBF/A) – the identical metrics an operations supervisor would use to judge a deployment, moderately than a tutorial “success price.” The protocol is absolutely documented within the PhAIL white paper.

The Bodily AI Leaderboard itself is hardware-agnostic. Positronic Robotics mentioned it plans so as to add robotic embodiments in Q2 2026 to mirror the range of real-world deployments. Bin-to-bin choosing is simply the start line, it mentioned. The benchmark’s purpose is to measure how effectively AI fashions carry out on repetitive, economically necessary operations that happen 1000’s of instances per day in actual services.

“All of us dream a few robotic that folds our laundry – however that’s a job that occurs as soon as a day. In factories and logistics, the identical operation runs lots of of instances per shift, and most of these nonetheless aren’t solved,” mentioned Sergey Arkhangelskiy, founding father of Positronic Robotics. “Bodily AI must show itself there first, and PhAIL is how we measure whether or not it may.”

Positronic Robotics evaluates fashions

Within the inaugural evaluations, 4 fashions have been fine-tuned and examined: OpenPI 0.5 from Bodily Intelligence, GR00T from NVIDIA, SmolVLA from HuggingFace/LeRobot, and ACT from LeRobot – in addition to teleoperated and human baselines. The outcomes present a measurable hole between present basis fashions and human-level efficiency in each throughput and reliability on business choosing duties.

Positronic Robotics described it as calibration — a clear baseline that enables progress to be measured constantly over time. As new fashions are launched, they are often evaluated below the identical protocol, making a steady, comparable report of efficiency, it mentioned.

The corporate asserted that PhAIL targets three structural points within the bodily AI ecosystem:

  • Lack of goal measurement of economic readiness. Most public metrics don’t mirror factory-floor constraints.
  • Unclear return-on-investment (ROI) alerts for operators. 
Success charges don’t translate instantly into deployment choices.
  • A damaged suggestions loop for mannequin builders.
With out standardized, auditable benchmarks, it’s tough to iterate towards real-world reliability.

By publishing synchronized video, logs, firmware variations, {hardware} configuration, and scoring artifacts for each run, PhAIL emphasizes auditability and reproducibility, mentioned Positronic Robotics.

It launched PhAIL as a ruled consortium moderately than as a proprietary product. Nebius, which gives an AI cloud basis for the robotics lifecycle, has joined as a founding consortium associate. Toloka participates as an information associate supporting analysis processes. Positronic Robotics famous that the benchmark is meant as a shared business yardstick, not as a aggressive advertising and marketing automobile.

“Scaling bodily AI requires a transparent, shared commonplace for manufacturing readiness,” mentioned Evan Helda, head of bodily AI at Nebius. “With no established blueprint for deploying these methods at scale, the PhAIL Leaderboard delivers an necessary benchmark grounded in real-world efficiency knowledge—bringing larger transparency to what’s prepared for deployment.”

“Nebius is dedicated to accelerating bodily AI improvement throughout the ecosystem,” he added. “Via our participation within the PhAIL consortium, we’re proud to assist advance the following part of economic robotics alongside business companions.”

The PhAIL dataset and fine-tuning scripts are publicly available. Mannequin builders can fine-tune their methods and submit checkpoints for analysis. {Hardware} distributors can validate mannequin efficiency throughout embodiments. Operators can evaluation printed artifacts instantly.


Catch the newest in bodily AI on the Robotics Summit & Expo

Registration is now open for the Robotics Summit & Expo, the world’s main technical occasion for business robotics builders. The occasion is produced by The Robotic Report and WTWH Media.

The present may have greater than 50 sessions in tracks on synthetic intelligence, design and improvement, enabling applied sciences, healthcare, and logistics. The Engineering Theater on the present flooring will even characteristic shows by business specialists.

Greater than 70 audio system are confirmed from firms corresponding to AWS, Mind Corp, Fictiv, Harmonic Drive, maxon, PickNik Robotics, RealSense, the Robotics and AI Institute, Sturdy AI, Tesla, Toyota Analysis Institute, and extra.

The Robotics Summit will even characteristic plenty of networking alternatives. They embrace a Combine & Mingle Networking Reception after the primary day of the present and the ticketed RBR50 Awards Dinner.

The Robotics Summit & Expo is co-located with DeviceTalks Boston, which focuses on medical units.



The submit PhAIL ranks high robotics basis fashions on actual {hardware} appeared first on The Robotic Report.