FIB Failures: When the Control Plane Is Right and Traffic Still Drops
Understanding the gap between RIB convergence and FIB programming in production networks.
Most large networks don't fail because the design is wrong.
They fail because the Forwarding Information Base (FIB) hits limits that architecture reviews never model.
📋 What Design & Config Checks Validate
Design and config checks validate:
- ✔ Routing correctness
- ✔ Features like PIC, TI-LFA, SR, Add-Path
- ✔ Timers and best practices
All necessary.
Still insufficient.
Because forwarding is constrained by silicon, not by intent.
⚠️ RIB Converged ≠ Forwarding Correct
A familiar production pattern:
|
✓ Control Plane Status
|
✗ Forwarding Reality
|
This is not a control-plane issue.
It's a FIB programming failure.
🔍 Common Real-World FIB Failure Patterns
1. TCAM Exhaustion & Fragmentation
- Asymmetric programming across line cards
- Fragmentation blocks new entries
- Prefixes exist in RIB but never reach hardware
Often triggered by combined scale: Internet routes + ACLs + QoS + SR
2. PIC Edge Timing Gaps
- Software switches next-hops instantly
- Hardware lags under scale
- Micro-blackholes, stale adjacencies, VRF-specific loss
PIC works.
Forwarding timing doesn't always match.
3. Segment Routing / TI-LFA Scale Pressure
- Node SIDs, Adj-SIDs, repair paths, policies all compete for FIB
- Backup paths compute correctly
- Only partially program in hardware
Failures surface during large topology events—exactly when protection is needed.
❌ Why Design & Config Audits Miss This
| Audit Type | What It Answers |
|---|---|
| Design Review | Should this work? |
| Config Audit | Is it enabled? |
| ❓ Missing Question | Can the hardware sustain worst-case churn, scale, and recovery simultaneously? |
FIB failures are stress-induced, incremental, and often invisible until failure conditions align.
✅ Post-Incident FIB Audit Checklist
After every major incident, check:
| Audit Area | What to Check |
|---|---|
| RIB vs FIB |
• Prefixes present in RIB but missing in hardware • Per-line-card inconsistencies |
| TCAM Health |
• Utilization and fragmentation • Feature-wise consumption (BGP, ACL, QoS, SR) |
| Failover Reality |
• PIC trigger time vs actual forwarding switchover • Micro-blackholes during convergence |
| SR / Labels |
• Repair paths actually installed in FIB • Label space pressure or partial installs |
| Programming Performance |
• FIB update latency during failure • Hardware programming drops or queueing |
| Asymmetry & Churn |
• Uneven FIB pressure across cards • Route churn volume during the event |
💡 The Hard Truth
Most "random" outages are not bugs.
They are hardware scale limits discovered during failure.
The control plane did exactly what it should.
The silicon couldn't keep up.
🔧 Diagnostic Commands for FIB Validation
Cisco IOS-XR
show route
show cef
show cef inconsistency
# TCAM utilization
show controllers npu resources all location all
show controllers fia diagshell 0 "diag cosq stat" location all
# Per-line-card FIB
show cef location 0/0/CPU0
show adjacency location 0/0/CPU0
Cisco IOS-XE / NX-OS
show ip route
show ip cef
show ip cef inconsistency
# TCAM health
show platform hardware fed active fwd-asic resource tcam utilization
show hardware capacity
Juniper Junos
show route
show route forwarding-table
# FIB programming
show pfe statistics traffic
show chassis forwarding
📊 Real-World Scenario: When Everything "Works" But Traffic Drops
Incident Timeline:
| T+0 | Link failure triggers PIC Edge |
| T+50ms | RIB updates complete, next-hops switched |
| T+200ms | FIB programming starts on line cards |
| T+2s | Line card 3 TCAM full, drops 1,200 prefixes |
| T+5s | Monitoring shows "BGP converged" ✓ |
| Impact | Traffic to 1,200 prefixes blackholed for 8 minutes until manual intervention |
Root cause: TCAM fragmentation + scale. No config error. No design flaw. Hardware couldn't sustain the churn.
🛠️ Preventive Measures
- Baseline TCAM utilization across all line cards
- Track per-feature consumption (routing, ACLs, QoS, SR labels)
- Monitor fragmentation levels
- Set alerts at 70%, not 90%
- Test FIB programming under failure conditions
- Simulate link failures during peak routing table size
- Measure actual FIB update latency, not just RIB convergence
- Validate per-line-card consistency
- Implement FIB monitoring in production
- Compare RIB vs FIB prefix counts continuously
- Alert on inconsistencies that persist > 30 seconds
- Track hardware programming queue depth
- Right-size SR/TI-LFA deployments
- Not every prefix needs backup path protection
- Limit repair path depth
- Test combined scale: Internet + SR + ACLs
- Include FIB validation in change windows
- Post-change: verify RIB/FIB consistency
- Check TCAM utilization trends
- Document FIB programming timing
🎯 Final Thought
"Your network is defined not by what the RIB converges to, but by what the FIB can sustain under stress."
If you don't audit the FIB after incidents,
you're debugging symptoms—not root cause.
And hope is not an operational strategy.
📚 Key Takeaways:
- RIB convergence ≠ Forwarding correctness — Always verify FIB programming
- TCAM exhaustion is silent — Until failure strikes during churn
- PIC timing gaps are real — Software and hardware don't always sync
- SR/TI-LFA scale matters — Protection paths compete for limited resources
- Post-incident FIB audits are mandatory — Not optional
- Design reviews miss hardware limits — Test under stress, not just steady-state

No comments:
Post a Comment