It has almost been a year since we turned 25, and transferred a “whole universe of data” at Supercomputing 2011 – and that was over a single 100G link between NERSC and Seattle. Now we are close to the end of building out the fifth generation of our network, ESnet5.
In order to minimize the downtime for the sites, we are building ESnet5 parallel to ESnet4, with just a configuration-driven switch of traffic from one network to the other. Since the scientific community we serve depends on the network to be up, it’s important to have assurance that the transition is not disruptive in anyway. The question we have heard over and over again from some of our users – when you switch the ESnet4 production traffic to ESnet5, how confident are you that the whole network will work, and not collapse?
In this blog post, I’d like to introduce an innovative testing concept the ESnet network engineering team (with special kudos to Chris Tracy) developed and implemented to address this very problem.
The goal of our testing was to ensure that the entire set of backbone network ports would perform solidly at full 100 Gbps saturation with no packet loss, over a 24 hour period. However we had some limitations. With only one Ixia test-set with 100 GE cards at hand to generate and receive packets and not enough time to ship that equipment to every PoP and test each link, we had to create a test scenario that would generate confidence that all the deployed routers and optical hardware, optics, the fiber connections, and the underlying fiber would performing flawlessly in production.
This implied creating a scenario where the 100 Gbps traffic stream being generated by the Ixia would be carried bi-directionally over every router interface deployed in ESnet5, traverse it only once and cover the entire ESnet5 topology before being directed back to the test hardware. A creative traffic loop was created that traversed the entire footprint, and we called it the ‘Snake Test’. Even though the first possible solution was used to create the ‘snake’, I am wondering if this could be framed as a NP-hard theoretical computer science and optimization approach known as the traveling salesman problem for more complex topologies?
The diagram below illustrates the test topology:
So after sending around 1.2 petabytes of data in 24 hours, and accounting for surprise fiber maintenance events that caused the link to flap, the engineering team was happy to see a zero loss situation.
Here’s a sample portion of the data collected:
Automation is key – utility scripts had been built to do things like load/unload the config from the routers, poll the firewall counters (to check for loss ingress/egress at every interface), clear stats, parse the resulting log files and turn them into CSV (a snapshot you see in the picture) for analysis.
Phew! – the transition from ESnet4 to ESnet5 continues without a hitch. Watch out for the completion news, it may come quicker than you think…..
On behalf of the ESnet Engineering team