Project Page
Fail2Drive is the first CARLA benchmark designed to test closed-loop generalization on truly unseen long-tail scenarios. By pairing each shifted route with an in-distribution counterpart, it exposes substantial hidden failures in current state-of-the-art driving models.
As autonomous driving moves closer to real-world deployment, the focus is shifting toward rare, long-tail situations that are not represented in the training distribution. This is exactly where generalization matters most. Simulators like CARLA are ideal for evaluating these cases in a safe and cost-effective setting. However, current CARLA benchmarks often fail to leverage this potential, as they evaluate on the same scenarios used during training. Fail2Drive addresses this limitation by introducing truly unseen scenarios and establishing a concrete, reproducible basis for evaluating generalization, exposing widespread overfitting in recent CARLA models and making generalization quantifiable.
Public benchmark with 200 routes serves as a fair, model independent comparison and enables systematic work on generalization.
Evaluation of seven models reveals widespread shortcut learning, brittle visual cues, and missing fallback behavior. Inspiring avenues for future work.
Extensible scenarios, assets, and expert policy enable community-driven customization.
All generalization scenarios are paired with an in-distribution baseline to isolate generalization performance.
CARLA uses the same scenarios for training and testing. Fail2Drive adds unseen generalization scenarios to measure the true generalization potential of current models.





We analyze seven recent driving models, revealing significant overfitting and weak generalization to unseen long-tail scenarios.
Fail2Drive makes generalization failures measurable. The quantitative metrics capture the gap between in-distribution and unseen scenarios and show that breakdowns are widespread across recent state-of-the-art models.
Fail2Drive maintains an official leaderboard of driving methods to ensure fair and consistent comparison across approaches.
We encourage users to share their results in the leaderboard repository:
Submit results via GitHub →
The Fail2Drive toolbox provides tools for researchers and practitioners to create and share custom long-tail scenarios. The toolbox is designed to make scenario authoring lightweight, reproducible, and easy to integrate into evaluation workflows.
Drag an object into the scene, place it anywhere, and rotate it. Drag objects outside the scene to delete them.
Construction cone
Street barrier
Food cart
Trash bin
Plastic chair
Fail2Drive is designed to grow with the community. The toolbox empowers researchers to easily create custom routes and scenarios tailored to their specific needs, as well as expand the existing benchmark. Together, Fail2Drive provides a unified and rigorous testing ground for evaluating generalization in CARLA. We welcome code contributions and offer a scenario hub for sharing and discovering interesting scenarios.
@misc{gerstenecker2026fail2drivebenchmarkingclosedloopdriving,
title={Fail2Drive: Benchmarking Closed-Loop Driving Generalization},
author={Simon Gerstenecker and Andreas Geiger and Katrin Renz},
year={2026},
eprint={2604.08535},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2604.08535},
}