On day 1, I spent the night building a hypothesis I couldn’t prove. On day 2, a Google network engineer confirmed it by mid-morning.
Mid-afternoon on day 1, the entire campus lost access to every major CDN simultaneously. Google, Azure, AWS … gone. SSH to my home lab still worked from my desk. That scope told me something: this wasn’t a campus failure. Something upstream was routing CDN-bound traffic into a black hole.
By 10pm I had a theory. Our BGP posture advertised more-specific prefixes to our R&E peer to bias inbound traffic onto that path. If those more-specifics had leaked into the global table, CDNs worldwide would prefer the longer, more-specific match regardless of whether the path was actually reachable. Longest prefix wins. Every time.
By 2am we’d exhausted every other avenue. The CIO stood the team down. I went home with an unconfirmed theory and no management clearance to act on it.
I remembered a name from a network security listserv: Chris Morrow, Google. The next morning I wrote the cold email at 8:40am with zero expectation of a reply. Chris replied the same morning and confirmed it.
Our R&E provider had recently turned up a new customer on the other side of the globe who hadn’t sanitized their outbound route advertisements. Our more-specifics had escaped into the global table through that customer’s commodity ISP. CDNs globally were preferring our leaked more-specifics — and routing return traffic across a path so long most packets died a TTL death before arriving.
I had his reply in hand before the war room reconvened. We shut down the R&E peer. CDN access began returning within minutes.
But triage isn’t resolution.
Shutting down the peer stopped the bleeding. It didn’t close the wound. The vulnerability was the more-specific advertisement policy itself. I reached out directly to Steve Wallace, a Sr. Engineer at Internet2. Same approach as the Morrow email — cold contact, no ticket, no queue. He responded the same day. We screenshared. He walked me through RPKI Route Origin Authorization registration using our ARIN credentials. I executed the registrations myself. ROAs published and live: any peer enforcing Route Origin Validation now rejects conflicting announcements from downstream.
The fix wasn’t just pulling the peer. The /20s that had given I2 the more-specific edge for CDN traffic steering were retired entirely. I2 had been receiving the explicit /16 plus all 16 of its /20 subnets — full coverage at one additional bit of granularity beyond what either commodity ISP saw. Post-incident, the /20s came out. I2 now receives the /16 plus the same 8x /19s carried by the commodity upstreams. With no peer holding a more-specific prefix than any other, a leak through I2 produces no preferential attraction. Longest prefix wins nothing it didn’t already win everywhere else.
Three phases: triage, root cause closure, architectural redesign. The outage lasted hours. The fix is still standing.
Is your BGP advertisement policy designed around what happens if your peer’s customer doesn’t filter?
Leave a Reply
You must be logged in to post a comment.