BGP Prefix Leak, RPKI, and the Cold Email That Confirmed It

On day 1, I spent the night building a hypothesis I couldn’t prove. On day 2, a Google network engineer confirmed it by mid-morning.

Mid-afternoon on day 1, the entire campus lost access to every major CDN simultaneously. Google, Azure, AWS … gone. SSH to my home lab still worked from my desk. That scope told me something: this wasn’t a campus failure. Something upstream was routing CDN-bound traffic into a black hole.

By 10pm I had a theory. Our BGP posture advertised more-specific prefixes to our R&E peer to bias inbound traffic onto that path. If those more-specifics had leaked into the global table, CDNs worldwide would prefer the longer, more-specific match regardless of whether the path was actually reachable. Longest prefix wins. Every time.

By 2am we’d exhausted every other avenue. The CIO stood the team down. I went home with an unconfirmed theory and no management clearance to act on it.

I remembered a name from a network security listserv: Chris Morrow, Google. The next morning I wrote the cold email at 8:40am with zero expectation of a reply. Chris replied the same morning and confirmed it.

Our R&E provider had recently turned up a new customer on the other side of the globe who hadn’t sanitized their outbound route advertisements. Our more-specifics had escaped into the global table through that customer’s commodity ISP. CDNs globally were preferring our leaked more-specifics — and routing return traffic across a path so long most packets died a TTL death before arriving.

I had his reply in hand before the war room reconvened. We shut down the R&E peer. CDN access began returning within minutes.

But triage isn’t resolution.

Shutting down the peer stopped the bleeding. It didn’t close the wound. The vulnerability was the more-specific advertisement policy itself. I reached out directly to Steve Wallace, a Sr. Engineer at Internet2. Same approach as the Morrow email — cold contact, no ticket, no queue. He responded the same day. We screenshared. He walked me through RPKI Route Origin Authorization registration using our ARIN credentials. I executed the registrations myself. ROAs published and live: any peer enforcing Route Origin Validation now rejects conflicting announcements from downstream.

The fix wasn’t just pulling the peer. The /20s that had given I2 the more-specific edge for CDN traffic steering were retired entirely. I2 had been receiving the explicit /16 plus all 16 of its /20 subnets — full coverage at one additional bit of granularity beyond what either commodity ISP saw. Post-incident, the /20s came out. I2 now receives the /16 plus the same 8x /19s carried by the commodity upstreams. With no peer holding a more-specific prefix than any other, a leak through I2 produces no preferential attraction. Longest prefix wins nothing it didn’t already win everywhere else.

Three phases: triage, root cause closure, architectural redesign. The outage lasted hours. The fix is still standing.

Is your BGP advertisement policy designed around what happens if your peer’s customer doesn’t filter?


Comments

Leave a Reply