High-speed intelligent Research and Educational Networks (RENs), such as the one we’re building as part of the ESnet 6 program, will require a greater ability to understand and manage traffic flows. One research program underway to provide this capability is the High Touch effort, a programmable, scalable, and expressive hardware and software solution that produces and analyzes per-packet telemetry information with nanosecond-accurate timing. Along with Zhang Liu, Bruce Mah, Yatish Kumar, and Chin Guok, I have just released a presentation for the Proceedings of the 2021 Virtual Meeting on Systems and Network Telemetry and Analytics, describing work underway to create a programmable, very high speed, packet monitoring, and telemetry capability as part of bringing High-Touch to life.
In the previous post we discussed deploying ZoMbis (Zeek on Management based information system) for ESnet6’s management network to monitor the traffic traversing the network and to provide visibility into what’s happening on our management network. This blog post will discuss how we use traffic sinkholes, which are a way of redirecting traffic so that it can be captured and analyzed. As opposed to our usual passive data collection system (e.g., tapping or port mirroring), traffic is being actively redirected to network monitoring systems such as Zeek. Network sensors can then perform various levels of in-depth analysis on the traffic, which can help detect misconfigurations, identify hostile traffic, or even perform automated mitigations for certain attacks.
Sinkholes are an important tool in the arsenal of network operators—they support network cyber defense by providing a way to redirect packets sent to or from unallocated (so-called “bogon” addresses) or other unexpected IP addresses. Additionally, they can help protect against reconnaissance or vulnerability scanning. If an attack does slip through these defenses, the damage could be limited, or the malicious traffic could be analyzed by network defenders to determine the source and methods being used.
As part of the ESnet6 security architecture, a sinkhole service will be deployed on the production management network, to redirect internal management traffic as well as externally sourced internet traffic destined to the management network. Using the Border Gateway Protocol (BGP), the sinkholes will advertise routes to the destination gateway for IP ranges of the management network to redirect the traffic to the target sinkhole. In our network, the management plane address set fits within a “supernet” (a collection of subnets) which can then advertise the sinkhole address as a destination. We will use this advertised supernet to redirect all traffic from external sources on the Internet away from the management network and to the external sinkhole.
An internal sinkhole will also advertise this management supernet for “inside” resources, but in this case, legitimate traffic will have a more specific route for the destination and not go to the sinkhole. This way, only traffic destined to an invalid subnet will be redirected to the internal sinkhole. This design should be extremely useful in identifying possible misconfigurations or other unexpected behaviors in the ESnet6 management network. if everything is behaving as expected, we should never see any traffic to the catch-all destination of the sinkhole.
The following diagram, taken from a ZeekWeek 2020 presentation by ESnet security engineer Scott Campbell, shows the basic design of the two kinds of sinkholes:
In the external sinkhole conceptual diagram above, routers R1 and R2 will be advertising the management address ranges to external sources. If any traffic destined to the management network is received from the Internet, it will instead be redirected to the sinkhole.
The external use case is a bit simpler than the internal sinkhole, which is diagrammed below. In the latter case, there will be some legitimate connections, such as between two ESnet points of presence (POPs), or between a POP and our data center. Any unwanted, misconfigured, or hostile scanning traffic will end up in the internal sinkhole. Hence internal sinkholes can be thought of both as network “garbage cans” and intrusion sensors helping to detect changes in normal management traffic patterns.
The ESnet Security Team will use Zeek, to analyze traffic at the application level, for both types of sinkholes. The logs generated by Zeek will then be collected centrally and will provide useful insights into what kind of unwanted traffic is being directed at our management plane, both from internal or external sources, and help better protect ESnet6 from attackers.
The ESnet6 2021 Annual Status Review was a great success, and the Review Committee, led by DOE, concluded that the ESnet6 Project is being managed and executed well!
Given that the project’s budget, scope, and schedule were approved in February 2020, this was the first official Annual Status Review – and what a year it has been! The 2021 Review was a major milestone, allowing the Project to formally present the project performance over the past year and, consequently, during the COVID-19 pandemic. I continue to be amazed by the entire project team, and I felt very honored to be the one to introduce the astounding progress we made during an extremely challenging year. Not only that, it was all done while operating the current ESnet5 production network at the same time.
The project execution continued at full speed while some of us started carving out time over the past several months to prepare for the Review. Pulling together all of the information required, synthesizing it into a clear and concise set of briefings and documents, and presenting it to leaders in our field is a monumental task under any circumstances, but the pandemic made this especially difficult. However, the project team, backed by strong support across LBNL (Procurement, Project Management Office, Project Management Advisory Board members, and many others) made everything appear seamless. The impressive level of teamwork did not go unnoticed and was specifically mentioned repeatedly during the Closeout session. I am grateful for and proud of, all of the members of the team who contributed to this terrific success.
The Review Committee consisted of three Subcommittees (Technical, Cost & Schedule, Project Management & Environment, Safety & Health), all charged with answering a set of questions to determine if we were on schedule, achieving scope, within budget, and performing all tasks safely. The answer to every charge question: Yes! It was an all-encompassing couple of days, but we really couldn’t have asked for a better result. In short, there were no formal recommendations, so we’ll be considering how best to implement several of the Review Committee’s extremely helpful comments as we proceed onward. Our hard work, not only on the Review itself, paid off!
With the formal Review complete for the year, we’re all back to our daily project plan of execution, while keeping the network “lights on” in the process, of course.
Advancing our strategy and shaping our position on the board. Some thoughts from Inder on the year-that-was.
Dear Friends, Well-wishers, Colleagues, and all of ESnet,
Chess! 2020 has been much more challenging than this game. It’s also been a year where we communicated through the squares on our zoom screens, filled with faces of our colleagues, collaborators, and loved ones.
In January, Research and Education leaders came together in Hawaii at the Pacific Telecommunications Council meeting to discuss the future of networking across the oceans. It was impossible to imagine then that we would not be able to see each other again for such a long time. Though thanks to those underwater cables, we have been able to communicate seamlessly across the globe.
Looking back at 2020, we not only established a solid midgame position on our ESnet chessboard, but succeeded in ‘winning positions’ despite the profound challenges. The ESnet team successfully moved our network operations to be fully remote (and 24/7) and accomplished several strategic priorities.
ESnet played some really interesting gambits this year:
Tackled COVID-related network growth and teleworking issues for the DOE complex
We saw a 4x spike in remote traffic and worked closely across several Labs to upgrade their connectivity. We continue to address the ever-growing demand in a timely manner.
As we all shifted to telework from home, ESnet engineers developed an impromptu guide that was valuable to troubleshoot our home connectivity issues.
Progressed greatly on implementing our next-generation network, ESnet6
We deployed and transitioned to the ESnet6 optical backbone network, with 300 new site installations, 100’s of 100G waves provisioned, with just six months of effort, and while following pandemic safety constraints. I am grateful to our partners Infinera (Carahsoft) and Lumen for working with our engineers to make this happen. Check out below how we decommissioned the ESnet5 optical network and lit up the ESnet6 network.
Installed a brand new management network and security infrastructure upgrades along with significant performance improvements.
We awarded the new ESnet6 router RFP (Congratulations Nokia and IMPRES!); the installs start soon.
Issued another RFP for optical transponders, and will announce the winner shortly.
Took initiative on several science collaborations to address current and future networking needs
We brainstormed new approaches with the Rubin Observatory project team, Amlight, DOE and NSF program managers to meet the performance and security goals for traffic originating in Chile. We moved across several countries in South America before reaching the continental U.S. in Florida (Amlight), and eventually the U.S. Data Facility at SLAC via ESnet.
Drew insights through deep engagement of ESnet engineers with the High Energy Physics program physicists, for serving the data needs of their current and planned experiments expediently. Due to the pandemic, a two-day immersive in-person meeting turned into a multi-week series of Zoom meetings, breakouts, and discussions.
When an instrument produces tons of data, how do you build the data pipeline reliably? ESnet engineers took on this challenge, and worked closely with the GRETA team to define and develop the networking architecture and data movement design for this instrument. This contributed to a successful CD 2/3 review of the project—a challenging enough milestone during normal times, and particularly tough when done remotely.
Exciting opening positions were created with EMSL, FRIB, DUNE/SURF, LCLS-II…these games are still in progress, more will be shared soon.
Innovated to build a strong technology portfolio with a series of inspired moves
We demonstrated Netpredict, a tool using deep learning models and real-time traffic statistics to predict when and where the network will be congested. Mariam’s web page showcases some of the other exciting investigations in progress.
Richard and his collaborators published Real-time flow classification by applying AI/ML to detailed network telemetry.
High-touch ESnet6 project
Ever dream of having the ability to look at every packet, a “packetscope”, at your fingertips? An ability to create new ways to troubleshoot, performance engineer, and gain application insights? We demonstrated a working prototype of that vision at the SC20 XNET workshop.
We deployed a beta version of software that provides science applications the ability to orchestrate large data flows across administrative domains securely. What started as a small research project five years ago (Thanks ASCR!) is now part of the AutoGOLE project initiative in addition to being used for Exascale Computing Project (ECP) project, ExaFEL.
Initiated the Q-Factor project this year, a research collaboration with Amlight, funded by NSF. The project will enable ultra-high-speed data transfer optimization by TCP parameter tuning through the use of programmable dataplane telemetry: https://q-factor.io/
Executed on the vision and design of a nationwide @scale research testbed working alongside a superstar multi-university team.
With the new FAB grant, FABRIC went international with plans to put nodes in Bristol, Amsterdam, Tokyo and Geneva. More locations and partners are possibilities for the future.
Created an prototype FPGA-based edge-computing platform for data-intensive science instruments in collaboration with the Computational Research Division and Xilinx. Look for exciting news on the blog as we complete the prototype deployment of this platform.
What are the benefits of widespread deployment of 5G technology on science research? We contributed to the development of this important vision at a DOE workshop. New and exciting pilots are emerging that will change the game on how science is conducted. Stay tuned.
Growth certainly has its challenges. But, as we grew, we evolved from our old game into an adept new playing style. I am thankful for the trust that all of you placed in ESnet leadership, vital for our numerous, parallel successes. Our 2020 reminds me of the scene in Queen’s Gambit where the young Beth Harmon played all the members of a high-school chess team at the same time.
Several achievements could not make it to this blog, but are important pieces on the ESnet chess board. They required immense support from all parts of ESnet, CS Area staff, Lab procurement, Finance, HR, IT, Facilities, and Communications partners.
I am especially grateful to the DOE Office of Science, Advanced Scientific Computing Research leadership, NSF, and our program manager Ben Brown, whose unwavering support has enabled us to adapt and execute swiftly despite blockades.
All this has only been possible due to the creativity, resolve, and resilience of ESnet staff — I am truly proud of each one of you. I am appreciative of the new hires that trusted their careers with us and joined us remotely—without shaking hands or even stepping foot at the lab.
My wish is for all to stay safe this holiday season, celebrate your successes, and enjoy that extra time with your immediate family. In 2021, I look forward to killer moves on the ESnet chessboard, while humanity checkmates the virus.
Derek Howard is a software developer from Columbia, MO. Prior to joining ESnet, Derek worked as an HPC system administrator for the University of Missouri. Derek also created Augur (https://github.com/chaoss/augur) which is part of the Linux Foundation’s CHAOSS group (https://chaoss.community/), a working group focused on measuring the health and sustainability of open source software.
Derek is part of the Network Services Automation group under John MacAuley, where he will be working primarily on our internal ESnet Database (ESDB).
Question 1: What brought you to ESnet?
I worked with George Robb at the University of Missouri and he joined ESnet a while ago and it seemed like a great place to work. I asked him if there were any positions at ESnet he thought might be a good fit for me and he referred me to the position I am in now. I’m really happy I joined; it is as great as I expected!
Question 2: What is the most exciting thing going on in your field right now?
With so much work underway for ESnet6, exciting changes are happening every day. We are pushing to get features out for all of our software as fast as possible right now. Right now, I am working on a feature in ESDB to make it easier for network engineers to verify hardware was installed correctly during router installs.
As far as the broader field goes, I am excited about DDR5 memory becoming commercially available soon.
Question 3: What book would you recommend?
Randall Munroe’s “What If?” – It’s a wonderful collection of serious answers to silly questions by the creator of XKCD.
Three years ago, ESnet unveiled its plan to build ESnet6, its next-generation network dedicated to serving the Department of Energy (DOE) national lab complex and overseas collaborators. With a projected early finish in 2023, ESnet6 will feature an entirely new software-driven network design that enhances the ability to rapidly invent, test, and deploy new innovations. The design includes:
State-of-the-art optical, core and service edge equipment deployed on ESnet’s dedicated fiber optic cable backbone
A scalable switching core architecture coupled with a programmable services edge to facilitate high-speed data movement
100–400Gbps optical channels, with up to eight times the potential capacity compared to ESnet5
Services that monitor and measure the network 24/7/365 to ensure it is operating at peak performance, and
Advanced cybersecurity capabilities to protect the network, assist its connected sites, and defend its devices in the event of a cyberattack
Later this month, ESnet staff will present an online update on ESnet6 to the ESnet Site Coordinators Committee (ESCC). Despite the challenges of deploying new equipment at over 300 distinct sites across the country and lighting up approximately 15,000 of miles of dark fiber during a pandemic, the team is making great progress, according to ESnet6 Project Director Kate Mace.
“We’ve had some delays, but our first priority is making sure the work is being done safely,” Mace said. “We have a lot of subcontractors and we are working closely with them to make sure they’re safe, they’re following local pandemic rules and they’re getting the access they need for installs.
“The bottom line is that we have a lot of pretty amazing people putting in a lot of hours and hard work to keep the project moving forward,” Mace said.
When completed in 2023, ESnet6 will provide the DOE science community with a dedicated backbone capable of carrying at least 400 Gigabits per second (Gbps), with some spans capable of carrying more than 1 Terabit per second.
The current network, known as ESnet5, comprises a series of interconnected backbone rings, each with 100Gbps or higher bandwidth. ESnet5 operates on a fiber footprint owned by and shared with Internet2. Once the switch is complete, Internet2 will take over ESnet’s share of the fiber spectrum to provide more bandwidth to the U.S. education community.
“We’re almost done with the optical layer, which is a big deal,” Mace said. “It’s been a major procurement of new optical line equipment from Infinera to light up the new optical footprint.”
Mapping the road to ESnet6
Back in 2011, using Recovery Act funds for its Advanced Networking Initiative, ESnet secured the long-term rights to a pair of fibers on a national fiber network that had been built, but not yet used. Because there was a surplus of installed fiber cable at the time, ESnet was able to negotiate advantageous terms for the network.
As part of the ESnet6 project, ESnet and its subcontractors began installing optical equipment along the ESnet fiber footprint starting in November 2019. The optical network consists of seven large fiber rings east to west across the U.S., and smaller “metro” rings in the Chicago and San Francisco Bay areas.
At this point, Infinera has completed the installation of the equipment at all locations. The four large eastern-most rings have passed ESnet’s rigorous testing and verification process ensuring that they are configured and working as designed, and most ESnet services in these areas have been transitioned over to the new optical system.
Infinera has turned over the other three large rings and is working closely with ESnet staff to address a number of minor issues identified during testing.
ESnet and Infinera are collaborating on turning up, testing, and rolling services to the new network in the Chicago and Bay Area rings. The installation in these areas is more complex because it is re-using the ESnet5 fiber going into the DOE Laboratories.
“The ESnet and Infinera teams have worked really well together to overcome all of the typical challenges we expected on a network build of this scale, as well as some unexpected obstacles,” said Joe Metzger, the ESnet6 Implementation Lead.
The typical expected challenges ranged from installing thousands of perfectly clean (microscopically verified) fiber connections, to the unexpected, such as engineers driving for hours to get to a remote isolated location to install the equipment only to find the access road is drifted in with snow, or somebody changed the lock.
Most of the unexpected challenges were related to COVID-19.
“It was amazing to see how the facility providers, including the DOE Laboratories, ESnet and Infinera teams worked together to find safe, workable solutions to the COVID-19-related access constraints that we encountered during the installation,” said Metzger.
The team expects the optical system build to be fully accepted and all services transitioned over to it by Oct. 1, completing what they are calling ESnet5.5, the first major step in the transition from ESnet5 to ESnet6.
To get to this point, ESnet’s network engineers needed extensive, hands-on training on the new Infinera equipment and built a specialized test lab at Berkeley Lab. To do this, a test lab was built at Berkeley Lab to provide hands-on training. Engineers take a weeklong session learning how to configure, operate, and troubleshoot the equipment deployed in the field.
The next major step will be the installation of new routers for the packet layer, which is expected to begin in early 2021, Mace said.
And of course, this is all being carried out while ESnet keeps its production network and services in regular operation and with the undercurrent of stress from the COVID-19 pandemic.
While ESnet staff are known for building an ever-evolving network that’s super fast and super reliable, along with specialized tools to help researchers make effective use of the bandwidth, there is also a side of the organization where things are pushed, tested, broken and rebuilt: ESnet’s testbed.
For example, in conjunction with the rollout of its nationwide 100Gbps backbone network, the staff opened up a 100Gbps testbed in 2009 with Advanced Networking Initiative funding through the American Reinvestment and Recovery Act. This allowed scientists to test their ideas on a separate but equally fast network so if something crashed, ESnet traffic would continue to flow unimpeded across the network. Six years later, ESnet upped the ante and launched the 400Gbps network — the first science network to hit this speed — to help NERSC move its massive data archive from Oakland to Berkeley Lab.
Eric Pouyoul is the principal investigator for the testbed and the things he’s learned on past projects can be applied to others. His most recent project also pushed the boundaries of what the organization does in supporting DOE science. With funding from the lab’s Nuclear Physics Division, Pouyoul developed a pair of uniquely specialized data processing systems for the GRETA experiment, short for Gamma Ray Energy Tracking Array. The gamma ray detector will be installed at DOE’s Facility for Rare Isotope Beams (FRIB) located at Michigan State University in East Lansing.
When an early version of GRETA goes online at the end of 2023 it will house an array of 120 detectors that will produce up to 480,000 messages per second—totaling 4 gigabytes of data per second—and send them through a computing cluster for analysis. Not only did Pouyoul write the software for the first stage that will reduce the amount of data by an order of magnitude—in real-time—he also designed the physics simulation software to generate realistic data generation to test the system.
For the second data handling phase of GRETA, called the Global Event Builder, he wrote the software that will take all of the data from the first phase and, using the timestamps, aggregate them in order, as well as sort them by event. This data will then be stored for future analysis.
Even though he designed and built the systems to simulate the behavior of the nuclear physics that will occur inside the detector, “don’t expect me to understand it,” Pouyoul said. “I never did anything like this before.”
GRETA is the first of its kind in that it will track the positions of the scattering paths of the gamma rays using an algorithm specifically developed for the project. This capability will help scientists understand the structure of nuclei, which is not only important for understanding the synthesis of heavy elements in stellar environments, but also for applied-science topics in nuclear energy, nuclear forensics, and stockpile stewardship.
“This has been my most exciting project and it only could have happened here,” he said. “I think it takes me back to the origins of the Lab when scientists and engineers worked together to create new physics. We know it will work, but we don’t even know how the results will turn out, we don’t know what will be discovered.”
Before joining ESnet at Berkeley Lab 11 years ago, he had worked in the private sector. At one point in his career, he wrote code for control systems for nuclear power plants. Looking back, he estimates that maybe three lines of his code made it into the final library. He’s quick to point out that he doesn’t consider himself a software engineer, nor does he think of himself as a network engineer. At ESnet, those engineers are responsible for designing and deploying robust systems that keep the data moving in support of DOE’s research missions.
“I really like to work with prototypes, one-time projects like in the testbed,” he said. “I know how to build stuff.”
He developed that skill as a high school student in Paris, where he preferred to roam the sidewalks, looking for discarded electronics he could take home, repair, and sell. He did manage to attend classes often enough to pass his exams and graduate with a degree. That was the only diploma he’s ever received.
Since then, he’s learned by working on things, not sitting in lecture halls. Some of it he picked up working for a supercomputing startup company. He learned how to tune networks for maximum performance by tweaking data transfer nodes, the equipment that takes in data from experiments, observations, or computations and speeds them on their way to end-users.
He sees the GRETA project as a pilot and it’s already drawing interest from other researchers. The idea is that if ESnet can work with scientists from the start, it will be more efficient and effective than trying to tack on the networking components afterward. Pouyoul looking forward to the next one.
“I’m really not specialized, but I do understand different aspects of projects,” he said. “I only have fun when I’m not in my comfort zone — and I had a lot of fun working on GRETA.”