Every major carrier now requires a photo at delivery.

The idea is simple: the driver snaps a picture of the package at the door, uploads it to the system, and that image becomes the record. If a customer claims their package never arrived, you pull the photo. Case closed.

It's a reasonable system in theory. In practice, I spent two years at FedEx learning exactly how many ways it breaks down — and building a model to catch the ones that mattered most.

The problem with "proof"

When I started digging into proof-of-delivery data at FedEx, I expected to find edge cases. What I found instead was a set of overlapping failure modes that were systematic, not exceptional.

The first was the most straightforward: images that were missing entirely, blurry, or so poorly composed they couldn't establish anything. A photo of a driver's shoe. A completely dark frame. An image that uploaded successfully but contained nothing identifiable. These weren't rare. They were common enough that a meaningful percentage of delivery disputes had no usable visual record at all — which meant that when a customer filed a claim, there was nothing to dispute it with.

The second failure mode was more deliberate. Drivers marking packages as delivered without completing the delivery — what the industry calls a "false scan" — and either taking no photo or submitting one that couldn't be verified. A photo of a generic doorstep. An image timestamped and geotagged but taken somewhere that wasn't the delivery address. The photo existed in the system. It just wasn't proof of anything.

The third was on the customer side. Valid, legitimate delivery photos being ignored by customers filing non-delivery claims anyway — betting, correctly in many cases, that nobody was actually reviewing the images systematically. If the volume is high enough and the review process is manual, most claims just get paid.

All three failure modes were present. All three were costing money. And before the model existed, the only defense against any of them was a human reviewer looking at images one at a time.

What the model actually did

The system I built approached this as a multi-class classification problem — not a single question but several questions asked simultaneously about every image that came through.

First: is there a valid photo here at all? Not just whether an image file exists in the system, but whether it contains a legible depiction of a delivered package. Blurry, obstructed, or empty images get flagged before they're ever used as evidence in a dispute.

Second: does the image show delivery at the right location? Using geolocation data alongside visual signals in the image itself, the model could identify cases where the photo was taken somewhere inconsistent with the delivery address — a significant flag for potential false scans or staged deliveries.

Third: does anything in the image suggest it was staged or manipulated? This is the subtler classification, but the signal is there if you train for it. Certain patterns in lighting, composition, and context cluster around fraudulent submissions in ways that are statistically distinguishable from legitimate ones.

Every delivery image ran through all three checks. The model didn't make final decisions — it surfaced flags for human review, and it ranked those flags by confidence and estimated financial exposure. The highest-risk cases got human attention. The clean ones moved through automatically.

Where the $2M came from

The savings came from two places, and they're worth separating because they represent different kinds of value.

The first was claim deflection. When a customer filed a non-delivery complaint, the system could now automatically pull the relevant POD image, run it through the model, and return a confidence assessment of whether the image constituted valid delivery evidence. For claims where the model returned high confidence in a legitimate delivery, the process of disputing the claim became faster, more consistent, and less dependent on whether a human reviewer happened to catch it. A significant portion of fraudulent claims — ones that previously got paid simply because nobody looked closely enough — stopped getting paid.

The second was operational. Manual POD review is expensive when it scales across millions of daily deliveries. Routing only flagged images to human reviewers, rather than sampling randomly or reviewing everything, meant the same headcount could cover dramatically more volume with their attention concentrated where it actually mattered.

Together, those two levers — fewer fraudulent claims paid out, less labor spent on low-risk images — added up to $2M in yearly savings. Not from a single dramatic intervention, but from systematic pressure applied consistently across an enormous volume of daily transactions.

What this looks like from a 3PL's perspective

I've thought a lot about the POD problem since leaving the carrier side, because it looks completely different depending on where you're sitting.

At FedEx, the images were ours. We owned the data, we owned the infrastructure, and we could build a model that sat directly in the delivery workflow. The integration was hard, but the access was total.

A 3PL doesn't have that. You're coordinating across multiple carriers, each with their own image formats, upload cadences, and API behaviors. The POD data you get is often incomplete, inconsistently structured, and delayed. Building the same kind of systematic auditing layer is harder — but the exposure is identical.

Your shipper clients are filing claims against you. Some of those claims are legitimate. Some of them aren't. And without a systematic way to evaluate the image evidence, you're making decisions based on whoever makes the most noise — not on what the data actually shows.

The tools exist to change that. The data, in most cases, already does too.

The pattern underneath

The POD auditing work and the Shipper at Risk work I've written about before are the same problem at different altitudes.

In both cases, there was an enormous volume of operational data that was being generated, stored, and largely ignored. In both cases, the damage was happening quietly — claims being paid, drivers gaming the system, customers exploiting the gap between what the data showed and what anyone was actually looking at. And in both cases, the fix wasn't a more sophisticated algorithm. It was building a system that actually looked at the data, consistently, at scale.

That's the work. It's less glamorous than it sounds and more valuable than most people expect.


I write about ML systems, logistics operations, and building for production. If you're working on a problem that sounds like this, I'd like to hear about it — waugh.joseph10@gmail.com