V-2 Rockets, Teslas, and Iran: The Precision of AI We Aren't Measuring Will Save Lives
This is the follow-up to Measuring the Machines that Kill. That piece broke down where Claude actually sits in the Maven Smart System kill chain and estimated the large language model’s (LLM) incremental accuracy contribution at roughly 5-8%. This one takes up the question I left open: what if using an LLM actually saves lives?
168 children died at the Shajareh Tayyebeh school in Minab. Within days, the global conversation was about whether an AI model was responsible. Did Claude pick the target? Did the algorithm fail? Was this what happens when you let machines into the kill chain?
Nobody asked whether an AI is what prevents the next Minab.
I know that arguing for more AI in targeting two weeks after 168 children died sounds like the wrong conversation to be having. But the alternative, refusing to measure whether the technology prevents the next one, has a civilian casualty count too. It’s the count less frequently tracked: the people who die because the intelligence was stale, the analyst was overloaded, and the system did what it has always done, processed too much data too slowly with too few people checking too many targets.
And in many cases, humans will likely make more mistakes than AI.
-----------------
We’ve watched this exact pattern play out on American roads.
Tesla’s Full Self-Driving (FSD) system and Waymo’s autonomous fleet are both safer than human drivers. The specifics vary depending on who’s counting. Phil Koopman at Carnegie Mellon, probably the most rigorous outside critic of Tesla’s data, puts the real FSD improvement at about 1.8x once you control for fleet age and road type. [11] Waymo’s peer-reviewed data across 56.7 million fully autonomous miles shows an 85% reduction in injury-causing crashes versus adjusted human benchmarks. [12] Swiss Re, which prices catastrophic risk for a living, found 92% fewer bodily injury claims across 25 million Waymo miles. [13] The methodological debates are real, but nobody seriously argues the technology is less safe than a human behind the wheel.
When a human driver runs a red light and hits a Tesla operating on FSD, the headlines are about the Tesla. Not the human who caused the accident. The technology gets blamed for being present at a failure it didn’t create, for a failure that would have been more likely without it. If both cars had been autonomous, the accident probably doesn’t happen. Meanwhile, 38,000 Americans die in car crashes every year, almost all from human error. Nobody proposes banning human drivers.
Minab is structurally the same. The emerging evidence points to stale intelligence, a target nomination built on data that predated the building’s conversion from military compound to civilian girls’ school. That is a failure of the human intelligence pipeline, not the AI.
But because AI was present in the broader system, the conversation became: did the AI cause this?
The better question is whether more AI applied to the problem, specifically to monitoring and validating targets against current intelligence, would have flagged the discrepancy before the strike was authorized. If the system had been doing what language models are actually good at, continuously cross-referencing facility data against open-source intelligence (OSINT), commercial imagery timestamps, and local records, the building probably gets flagged as civilian.
The technology gets blamed for being in the room. The failure that killed those children is the kind of failure better AI would have caught.
-----------------
I can already hear the counterargument: if AI was in the system and didn’t catch the stale data, doesn’t that prove it failed?
No. It proves the AI wasn’t being used for this. As I laid out in “Measuring the Machines,” Claude was integrated into Maven as a natural language interface and unstructured data processor. It helps analysts query the system and synthesize intelligence faster. It was not tasked with continuous facility validation, staleness detection, or cross-referencing target nominations against current OSINT. I’m not describing a failure of the AI that was deployed. I’m describing an AI application that should exist and doesn’t yet.
The numbers make the case for why it should.
In 1943, the average circular error probable (CEP) for B-17 high-altitude bombing was roughly 1,200 feet. Only about 20% of Eighth Air Force bombs fell within 1,000 feet of the aim point. Destroying a single power plant required 108 bombers, 648 bombs, and 1,080 airmen in the air. That wasn’t precision. It was brute-force probability applied to geography. Over 300,000 German civilians were killed. 7.5 million were made homeless.
The V-2 was worse. In 1946, R.D. Clarke proved that V-1 impacts on London followed a Poisson distribution: the mathematical signature of pure randomness. [7] The V-2’s effective CEP against London was roughly 12 kilometers. Two people died per rocket on average. The weapon cost as much as the Manhattan Project and achieved less than conventional area bombing.
The compression since then, for both bombs and missiles:
Both categories started wildly inaccurate and converged to single-digit meters through completely different engineering. Bombs got there through laser guidance kits bolted onto iron bombs. Missiles got there through inertial navigation, terrain matching, and GPS. The paths were different. The destination is the same: the guidance problem is effectively solved. [8] [9] [10]
In the Gulf War, precision-guided munitions (PGMs) made up 9% of ordnance dropped but accounted for 75% of successful hits. A single sortie doing what used to take hundreds of aircraft and thousands of tons.
The civilian casualty ratios across conflicts tell a messier story, because the ratio depends on the nature of the war, the density of the battlespace, and whether combatants embed among civilians. But the broad pattern across conflicts where the U.S. had operational control:
Korea’s high ratio reflects massive area bombardment of North Korean cities with unguided weapons. Vietnam’s decline tracks with the introduction of the first LGBs. Bosnia and Afghanistan reflect increasingly precise munitions combined with intelligence-driven targeting and formal collateral damage estimation (CDE) processes. The trend is not monotonic and the variables are not controlled, but the direction across a half-century of improving precision and intelligence is real. [1] [2] [3]
The CEP problem is now largely solved. A JDAM hits within meters, an SDB within feet. What’s left is the intelligence that determines what the weapon is aimed at.
Larry Lewis at the Center for Naval Analyses (CNA) spent a decade studying how civilian casualties actually happen. His Joint Civilian Casualty Study for General Petraeus found the same drivers over and over: misidentification, outdated intelligence, incomplete pattern-of-life data, bad collateral damage estimates. Not malice. Not carelessness. Overload. Too many sources, not enough time, not enough analysts, and a targeting tempo that compresses decisions from hours to minutes. [4] [5]
Consider what a staleness detection pass would look like on the Minab target. The system queries the targeting package for the facility adjacent to the Islamic Revolutionary Guard Corps (IRGC) naval compound. The military database entry says: IRGC compound, last updated 2014. The LLM cross-references against commercial satellite imagery tagged March 2025, which shows playground equipment and a walled-off schoolyard. It pulls a Fars News article from 2017 reporting the school’s opening ceremony. It finds a non-governmental organization (NGO) education report listing enrollment figures. Three independent sources contradict the military classification. The conflict gets flagged. The package gets kicked back for human review before the strike is authorized.
This isn’t a hypothetical technology – techniques like retrieval-augmented generation (RAG), or more recently, iterative reasoning agents (IRA), against a multi-source intelligence corpus. These models do this every day in corporate environments. Nobody has built it for targeting validation, and nobody has mandated it.
There is a structural obstacle worth naming. Palantir’s ontology, the data model underlying the Maven Smart System, is a governed, human-curated system. Object types, property definitions, and the action types that modify them are all configured by administrators through a controlled workflow. Changing a facility’s classification from “military” to “civilian” is technically possible, but it requires someone to have built the action type, configured the permissions, and triggered the change.
The ontology’s strength, its auditability and governance, is also its limitation for this failure mode: a facility designation doesn’t update itself when reality changes on the ground. If the underlying data source carries a stale classification and nobody initiates the update through the governed pipeline, the old designation persists and everything downstream treats it as current. An LLM operating outside the ontology’s rigid classification structure, pulling from unstructured and open sources that the ontology doesn’t ingest, could surface the contradiction that the structured system never will.
In “Measuring the Machines,” I named a risk I called the Fluency Trap: the danger that an LLM synthesizes incomplete data into a confident answer, papering over gaps that a structured database would leave visible. That risk is real and it needs to be engineered against.
The Staleness Trap is its inverse, and it’s the one that more plausibly killed those children. A system that processed old data without flagging that it was old, because nobody had time to check and the workflow didn’t require it. An LLM built to refuse confident answers when the underlying intelligence is stale or contradictory, one that surfaces the conflict instead of smoothing over it, is a direct countermeasure to the failure that appears to have produced Minab.
-----------------
If the 5-8% accuracy improvement I estimated in the first piece is roughly correct, and concentrated at the Filter and Identify steps where civilian protection decisions are made, the math isn’t complicated. Assume a campaign involving a thousand targeting decisions. Assume a baseline civilian casualty incident rate of 3-5% per decision, consistent with recent conflicts involving PGMs and formal CDE processes. A 5-8% relative improvement in targeting accuracy at the Filter step means roughly 2-4 fewer civilian casualty incidents per thousand decisions. In a campaign the scale of the Iran strikes, that is plausibly dozens of lives.
That estimate is rough and it needs empirical validation. But nobody is doing the work, and that’s not caution. It’s avoidance.
There is a counterpoint worth taking seriously. Precision has historically made strikes more politically viable, not less frequent. If AI reduces the expected civilian cost per strike, decision-makers may authorize more strikes, and the total civilian toll could stay flat or even increase. This is a real dynamic in the precision munitions literature. [6] But the answer is not to suppress the technology. It is to measure the total effect, per-strike rates and total volume together, and hold decision-makers accountable for both. Refusing to improve per-strike precision because politicians might authorize more strikes is an argument for keeping the tools blunt so the political cost stays high, paid in other people’s lives.
The question is whether anyone is building the staleness detection system and measuring the result.
Take a set of historical targeting packages. Run them with and without LLM assistance. Compare the collateral damage estimates. Compare staleness detection rates. Compare false positive and false negative rates on civilian presence. CNA or the RAND Corporation could do this independently. Congress could mandate it. Anthropic, Palantir, and the Department of Defense (DoD) could jointly commission it.
Waymo published 56.7 million miles of peer-reviewed safety data. The medical AI community showed that human-AI diagnostic teams outperform either alone. Tesla, for all the legitimate critiques of its methods, at least publishes quarterly safety reports. The targeting community has published nothing comparable. The defense establishment treats the question as too sensitive to study. The AI safety community treats it as too dangerous to engage with. Nobody is collecting the data that would tell us whether this technology saves lives or costs them.
-----------------
The CEP went from 12 kilometers to less than 5 meters across eighty years. Each compression was feared, debated, and adopted because someone measured the difference and the data won.
The next compression isn’t in where the weapon lands. It’s in the quality of the decision about where to point it. And what often happens in charged events, the public and media are debating AI without anyone insisting on the data.
168 children are dead, and the only conversation we’re having is whether the AI caused it. The conversation we should also be having is whether the AI, done right, is what prevents the next one. There is a school somewhere on a target list right now, and whether the intelligence on it is current is not a theoretical question. It has an empirical answer. It is past time someone went and got it.
-----------------
Sources
[1] William Eckhardt, “Civilian Deaths in Wartime,” Bulletin of Peace Proposals, 1989.
[2] Spagat et al., “Estimating the Number of Civilian Casualties in Modern Armed Conflicts,” Frontiers in Public Health, 2021.
[3] Wikipedia, “Civilian casualty ratio.” Compiled estimates across conflicts.
[4] Larry Lewis and Sarah Sewall, “Joint Civilian Casualty Study,” Center for Naval Analyses (CNA), 2010.
[5] Larry Lewis, “Leveraging AI to Mitigate Civilian Harm,” CNA, 2021.
[6] Watson and McKay, cited in “Precision Paradox and Myths of Precision Strike in Modern Armed Conflict,” Royal United Services Institute (RUSI) Journal, 2024.
[7] R.D. Clarke, “An Application of the Poisson Distribution,” Journal of the Institute of Actuaries, Vol. 72, 1946.
[8] Air & Space Forces Magazine, “The Emergence of Smart Bombs,” 2010.
[9] Center for Strategic and Budgetary Assessments (CSBA), “Six Decades of Guided Munitions and Battle Networks,” 2007.
[10] Encyclopedia.com, “Precision-Guided Munitions.”
[11] Phil Koopman, “New Tesla FSD Safety Data,” Substack, November 2025. Methodological critique of Tesla’s fleet comparison; estimates real improvement at ~1.8x after controls.
[12] Kusano et al., “Comparison of Waymo Rider-Only Crash Data to Human Benchmarks at 56.7 Million Miles,” Traffic Injury Prevention, 2025. Peer-reviewed; 85% reduction in any-injury-reported crashes.
[13] Swiss Re / Waymo safety analysis, 2024. 92% fewer bodily injury claims, 88% fewer property damage claims over 25 million miles.





AI in warfare creates an unacceptable moral hazard. Accountability. Increasing use of AI will mean our politicians will blame AI which cannot suffer meaningful consequences such as losing a job, fines, or imprisonment. Loss of innocent lives will be relegated to the civil arena where the government cannot generally be sued or defense companies where lawsuits can be effectively fought as just a cost of business.
We have lost so much accountability in this country which is the backbone of a strong meritocracy. AI is just more lubricant on our downward decent unless carefully regulated.