FIB Failures: When the Control Plane Is Right and Traffic Still Drops | RJS Expert

FIB Failures: When the Control Plane Is Right and Traffic Still Drops

✍️ Written by: RJS Expert
Understanding the gap between RIB convergence and FIB programming in production networks.

Most large networks don't fail because the design is wrong.

They fail because the Forwarding Information Base (FIB) hits limits that architecture reviews never model.

📋 What Design & Config Checks Validate

Design and config checks validate:

✔ Routing correctness
✔ Features like PIC, TI-LFA, SR, Add-Path
✔ Timers and best practices

All necessary.
Still insufficient.

Because forwarding is constrained by silicon, not by intent.

⚠️ RIB Converged ≠ Forwarding Correct

A familiar production pattern:

✓ Control Plane Status

BGP converged
IGP stable
PIC triggered
Routes present in RIB

✗ Forwarding Reality

Selective packet loss
Prefix-level blackholes
Drops during failover

This is not a control-plane issue.
It's a FIB programming failure.

🔍 Common Real-World FIB Failure Patterns

1. TCAM Exhaustion & Fragmentation

Asymmetric programming across line cards
Fragmentation blocks new entries
Prefixes exist in RIB but never reach hardware

Often triggered by combined scale: Internet routes + ACLs + QoS + SR

2. PIC Edge Timing Gaps

Software switches next-hops instantly
Hardware lags under scale
Micro-blackholes, stale adjacencies, VRF-specific loss

PIC works.
Forwarding timing doesn't always match.

3. Segment Routing / TI-LFA Scale Pressure

Node SIDs, Adj-SIDs, repair paths, policies all compete for FIB
Backup paths compute correctly
Only partially program in hardware

Failures surface during large topology events—exactly when protection is needed.

❌ Why Design & Config Audits Miss This

Audit Type	What It Answers
Design Review	Should this work?
Config Audit	Is it enabled?
❓ Missing Question	Can the hardware sustain worst-case churn, scale, and recovery simultaneously?

FIB failures are stress-induced, incremental, and often invisible until failure conditions align.

✅ Post-Incident FIB Audit Checklist

After every major incident, check:

Audit Area	What to Check
RIB vs FIB	• Prefixes present in RIB but missing in hardware • Per-line-card inconsistencies
TCAM Health	• Utilization and fragmentation • Feature-wise consumption (BGP, ACL, QoS, SR)
Failover Reality	• PIC trigger time vs actual forwarding switchover • Micro-blackholes during convergence
SR / Labels	• Repair paths actually installed in FIB • Label space pressure or partial installs
Programming Performance	• FIB update latency during failure • Hardware programming drops or queueing
Asymmetry & Churn	• Uneven FIB pressure across cards • Route churn volume during the event

💡 The Hard Truth

Most "random" outages are not bugs.

They are hardware scale limits discovered during failure.

The control plane did exactly what it should.
The silicon couldn't keep up.

🔧 Diagnostic Commands for FIB Validation

Cisco IOS-XR

# Compare RIB vs FIB

show route

show cef

show cef inconsistency

# TCAM utilization

show controllers npu resources all location all

show controllers fia diagshell 0 "diag cosq stat" location all

# Per-line-card FIB

show cef location 0/0/CPU0

show adjacency location 0/0/CPU0

Cisco IOS-XE / NX-OS

# RIB vs FIB

show ip route

show ip cef

show ip cef inconsistency

# TCAM health

show platform hardware fed active fwd-asic resource tcam utilization

show hardware capacity

Juniper Junos

# RIB vs FIB

show route

show route forwarding-table

# FIB programming

show pfe statistics traffic

show chassis forwarding

📊 Real-World Scenario: When Everything "Works" But Traffic Drops

Incident Timeline:

T+0	Link failure triggers PIC Edge
T+50ms	RIB updates complete, next-hops switched
T+200ms	FIB programming starts on line cards
T+2s	Line card 3 TCAM full, drops 1,200 prefixes
T+5s	Monitoring shows "BGP converged" ✓
Impact	Traffic to 1,200 prefixes blackholed for 8 minutes until manual intervention

Root cause: TCAM fragmentation + scale. No config error. No design flaw. Hardware couldn't sustain the churn.

🛠️ Preventive Measures

Baseline TCAM utilization across all line cards
- Track per-feature consumption (routing, ACLs, QoS, SR labels)
- Monitor fragmentation levels
- Set alerts at 70%, not 90%
Test FIB programming under failure conditions
- Simulate link failures during peak routing table size
- Measure actual FIB update latency, not just RIB convergence
- Validate per-line-card consistency
Implement FIB monitoring in production
- Compare RIB vs FIB prefix counts continuously
- Alert on inconsistencies that persist > 30 seconds
- Track hardware programming queue depth
Right-size SR/TI-LFA deployments
- Not every prefix needs backup path protection
- Limit repair path depth
- Test combined scale: Internet + SR + ACLs
Include FIB validation in change windows
- Post-change: verify RIB/FIB consistency
- Check TCAM utilization trends
- Document FIB programming timing

🎯 Final Thought

"Your network is defined not by what the RIB converges to, but by what the FIB can sustain under stress."

If you don't audit the FIB after incidents,
you're debugging symptoms—not root cause.

And hope is not an operational strategy.

📚 Key Takeaways:

RIB convergence ≠ Forwarding correctness — Always verify FIB programming
TCAM exhaustion is silent — Until failure strikes during churn
PIC timing gaps are real — Software and hardware don't always sync
SR/TI-LFA scale matters — Protection paths compete for limited resources
Post-incident FIB audits are mandatory — Not optional
Design reviews miss hardware limits — Test under stress, not just steady-state

RJS Network Cloud Academy

Main Menu

Saturday, January 24, 2026