High-speed intelligent Research and Educational Networks (RENs), such as the one we’re building as part of the ESnet 6 program, will require a greater ability to understand and manage traffic flows. One research program underway to provide this capability is the High Touch effort, a programmable, scalable, and expressive hardware and software solution that produces and analyzes per-packet telemetry information with nanosecond-accurate timing. Along with Zhang Liu, Bruce Mah, Yatish Kumar, and Chin Guok, I have just released a presentation for the Proceedings of the 2021 Virtual Meeting on Systems and Network Telemetry and Analytics, describing work underway to create a programmable, very high speed, packet monitoring, and telemetry capability as part of bringing High-Touch to life.
Two graduate students working with ESnet have published their papers recently in IEEE and ACM workshops.
Bibek Shrestha, a graduate student at the University of Nevada, Reno, and his advisor Engin Arslan worked with Richard Cziva from ESnet to publish a work on “INT Based Network-Aware Task Scheduling for Edge Computing”. In the paper, Bibek investigated the use of in-band network telemetry (INT) for real-time in-network task scheduling. Bibek’s experimental analysis using various workload types and network congestion scenarios revealed that enhancing task scheduling of edge computing with high-precision network telemetry can lead up to a 40% reduction in data transfer times and up to 30% reduction in total task execution times by favoring edge servers in uncongested (or mildly congested) sections of the network when scheduling tasks. The paper will appear in the 3rd Workshop on Parallel AI and Systems for the Edge (PAISE) co-conducted with IEEE IPDPS 2021 conference to be held on May 21st, 2021, in Portland, Oregon.
Zhang Liu, a former ESnet intern and a current graduate student at the University of Colorado at Boulder, worked with the ESnet High Touch Team – Chin Guok, Bruce Mah, Yatish Kumar, and Richard Cziva – on fastcapa-ng, ESnet’s telemetry processing software. In the paper “Programmable Per-Packet Network Telemetry: From Wire to Kafka at Scale,” Zhang showed the scaling and performance characteristics of fastcapa-ng, and highlighted the most critical performance considerations that allow the pushing of 10.4 million telemetry packets per second to Kafka with only 5 CPU cores, which is more than enough to handle 170 Gbit/s of original traffic with 1512B MTU. This paper will appear in the 4th International Workshop on Systems and Network Telemetry and Analytics (SNTA 2021) held at the ACM HPCD 2021 conference in Stockholm, Sweden between 21-25 June 2021.
Congratulations Bibek and Zhang!
If you are a networked systems research student looking to collaborate with us on network measurements, please reach out to Richard Cziva. If you are interested in a summer internship with ESnet, please visit this page.
In March 2020, the U.S. Government Office of Management and Budget (OMB) released a draft memo outlining a required migration to IPv6 only. Memorandum M-21-07 was made official on November 19, 2020. Among other things, this memo mandates that 80% of IP-enabled assets on Federal networks are operating in IPv6-only environments by the end of FY 2025.
ESnet is in the process of planning this transition now, to ensure that we provide our users with the support and resources they need to continue their work uninterrupted and unimpeded by the transition. Practically speaking, this means for ESnet that by 2025, all of our nodes will be transitioned to IPv6 address space, and we will not support dual-stacking with IPv4 and IPv6 addresses.
Transitioning to an IPv6-only network has been over a quarter-century in the making for ESnet. Here’s a look back at our history with IPv6
IPv6: Past and Present
ESnet’s history of helping to develop, support, and operationalize new protocols begins well before the advent of IPv6.
In the early 1990s, Cathy Aronson, an employee of Lawrence Livermore National Laboratory working on ESnet, helped establish a production implementation and support plan for the Open Systems Interconnect (OSI) Connectionless-mode Network Service (CLNS) suite of network protocols. Crucially, Aronson developed a scalable network addressing plan that provided a model for the utilization of the kinds of massive address spaces that OSI CLNS and, later, IPv6 would come to use. CLNS itself was a logical progression from DECnet which had been embraced and supported by ESnet’s precursors (MFEnet and HEPnet).
As the IPv6 draft standard (RFC2460) developed in the 1990s, ESnet staff created an operational support model for the new protocol. The stakes were high; if IPv6 were to succeed in supplanting IPv4, and prevent the ill effects of IPv4 address exhaustion, it would need a smooth roll-out. Bob Fink, Tony Hain, and Becca Nitzan spearheaded early IPv6 adoption processes, and their efforts reached far beyond ESnet and the Department of Energy (DOE). The trio were instrumental in establishing a set of operational practices and testbeds under the auspices of the Internet Engineering Task Force–the body where IPv6 was standardized–and this led to the development of a worldwide collaboration known as the 6bone. 6bone was a set of tunnels that allowed IPv6 “islands” to be connected, forming a global overlay network. More importantly, it was a collaboration that brought together commercial and research networks, vendors, and scientists, all with the goal of creating a robust internet protocol for the future.
Not only were Fink, Hain, and Nitzan critical in this development of what would become a production IPv6 network (their names appear on a number of IETF RFCs), they would also spearhead the adoption of the protocol within ESnet and DOE. In the summer of 1996, ESnet was officially connected to the 6bone; by 1999, the Regional Internet Registries had received their production allocations of IPv6 address space. Just one month later, the first US allocation of that space was made–to ESnet. ESnet has the distinction of being the first IPv6 allocation from ARIN – assigned on August 3, 1999, with the prefix 2001:0400::/32.
Nitzan continued her pioneering work, establishing native IPv6 support on ESnet, and placing what we believe was the first workstation on a production IPv6 network. This was part of becoming the first production network in North America to adopt IPv6 in tandem with IPv4 via the use of an IPv6 “dual-stack.” As US Government requirements and mandates developed in 2005, 2012, and 2014, the ESnet team met these requirements for increased IPv6 adoption, while also providing support and consultation for the DOE community.
Although Aronson, Fink, Hain, and Nitzan have all moved on from ESnet, a new generation of staff continued the spirit of innovation and early adoption. In the early 2010s, ESnet’s internal routing protocols were consolidated around the use of multi-topology Intermediate System to Intermediate System or IS-IS. This allowed for the deployment of flexible and disparate IPv4 and IPv6 topologies, paving the way for the creation of IPv6-only portions of ESnet, allowing the use of optimized routing protocols for the entire network. ESnet’s acquisition strategy has long emphasized IPv6 support andfeature parity between IPv4 and IPv6.
For our customers and those connected to us, here’s what this means:
ESnet will be ready, willing, and able to support connectors, constituents, and partners in their journey to deploying IPv6-only across our international network.
ESnet planning and architecture team members have been included in the Department of Energy Integration and Product Team (DOE IPT) for migration to IPv6-only, and are supporting planning and documentation efforts for the DOE Complex.
We look forward to supporting our customers and users, as we all make this change to IPv6 together.
Zeek is a powerful open source network security monitoring software extensively used by ESnet. Zeek (formally called Bro) was initially developed by researchers at Berkeley Lab; it allows users to identify & manage cyber threats by tracking and logging network traffic activity. Zeek operates as a passive monitor, providing a holistic view of what is transpiring in the network and on all network traffic.
In a previous post, I presented some of our efforts in approaching the WAN security using Zeek for general network monitoring, with successes and challenges found during the process. In this blog post I’ll focus on our efforts in using Zeek as part of security monitoring for the ESnet6 management network – ZoMbis (Zeek on Management based information system).
ZoMbis on the ESnet6 management network:
Most research and educational networks employ a dedicated management network as a best practice. The management network provides a configuration command and control layer, as well as conduits for all of the inter-routing communications between the devices used to move our critical customer data. Because of the sensitive nature of these communications, the management network needs to be protected from external and general user network traffic (websites, file transfers, etc.), and our staff needs to have detailed visibility on management network activity.
At ESnet, we typically use real IP addresses for all internal network resources, and our management network is allocated a fairly large address space block advertised in our global routing table, to help protect against opportunistic hijacking attacks. By isolating our management network from user data streams, the amount of routine background noise is vastly reduced making the use of Zeek, or any network monitoring security capability, much more effective.
The above diagram shows an overview of the deployment strategy of Zeek on the ESnet6 management network. The blue dots in the diagram show the locations that will have equipment running Zeek instances for monitoring the network traffic on the management network. The traffic from the routers on those locations is mirrored to the Zeek instances using a spanning port, and the Zeek logs generated are then aggregated in our central security information and event logging and management system (SIEM).
ESnet 6’s new management network will use only IPv6. From a monitoring perspective this change from the traditional IPv4 poses a number of interesting challenges; In particular, IPv6 traffic employs more multicast and link-local traffic for local subnet communications. Accordingly, we are in the process of adjusting and adding to Zeek’s policy based detection scripts to support these changes in network patterns. These new enhancements and custom scripts being written by our cybersecurity team to support IPv6 will be of interest to other Zeek users and we will release them to the entire Zeek community soon.
The set of Zeek policy created for this project can be broken out into two general groups. The first of these is protocol mechanics – particularly looking closer between layer 2 and 3 where there are a number of interesting security behaviors with IPv6. A subset of notices that these protocol mechanic policies will provide are:
ICMP6_RtrSol_NotMulticat – Router solicitation not multicast
ICMP6_RtrAnn_NotMulticat – Router announcement should be a multicast request
ICMP6_RtrAnn_NewMac – Router announcement from an unknown MAC
ICMP6_MacIPChange – If the MAC <-> IP mapping changes
ICMP6_NbrAdv_NotRouter – Advertisement comes from non-router
ICMP6_NbrAdv_UnSolicit – Advertisement is not solicited
ICMP6_NbrAdv_OverRide – Advertisement without override
ICMP6_NbrAdv_NoRequest – Advertisement without known request
The second set of Zeek policies that have been developed in support for ZoMbis involves taking advantage of predictable management network behavioral patterns – we build policy to model anticipated behaviors and let us know if something is amiss. For example looking at DNS and NTP behavior we can identify unexpected hosts and data volumes, since we know which systems are supposed to be communicating with one another, and what patterns traffic between these components should follow.
Stay tuned for the part II of this blogpost, where I will discuss ways of using Sinkholing, together with ZoMbis, to provide better understanding and visibility of unwanted traffic upon the management network.
As a Network Engineer at ESnet, I am no stranger to the importance of designing and maintaining a robust fiber-optic network. To operate a network that will “enable and accelerate scientific discovery by delivering unparalleled network infrastructure, capabilities, and tools,” ESnet has acquired an impressive US continental footprint of more than 21,000 kilometers of leased fiber-optic cable. We spend a great deal of effort designing and sourcing redundant fiber-optic paths to support network data connectivity between scores of DOE Office of Science facilities and research collaborators across the country.
But network data transfer is only one of the uses for fiber-optic cable. What about using buried fiber-optic cable for some truly “ground-shaking” science? The answer is “Yes, absolutely!” – and I was fortunate to play a part in exploring new uses for fiber-optic cable networks this past year.
Back in 2017, the majority of our 21,000 km fiber footprint was still considered “dark fiber,” meaning it was not yet in use. At that time, ESnet was actively working on the design to upgrade from our current production network “ESnet5” to our next-generation network “ESnet6,” but we hadn’t yet put our fiber into production.
At the same time, Dr. Jonathan Ajo-Franklin, then graduate students Nate Lindsey and Shan Dou, and the Berkeley Lab’s Earth and Environmental Science Area (EESA) were exploring the use of distributed acoustic sensing (DAS) technology to detect seismic waves by using laser pulses across buried fiber optic cable. The timing was perfect to try and expand on the short-range tests that Dr. Ajo-Franklin and his team had been performing at the University of California’s Richmond Field Station by using a section of the unused ESnet dark fiber footprint in the West Sacramento area for more extensive testing. ESnet’s own Chris Tracy worked with Dr. Ajo-Franklin and team to demonstrate how the underground fiber-optic cables running from West Sacramento northwest toward Woodland in California’s Central Valley made an excellent sensor platform for early earthquake detection, monitoring groundwater, and mapping new sources of potential geothermal energy.
Fast forward to May 2019, and Dr. Ajo-Franklin was heading up a new collaborative scientific research project for the DOE’s Geothermal Technology Office based on his prior DAS experimentation successes using ESnet fiber. The intent was to map potential geothermal energy locations in the California Imperial Valley south of the Salton Sea, near Calipatria and El Centro. The team, including scientists in EESA, Lawrence Livermore National Laboratory (LLNL), and Rice University needed a fiber path to conduct the experiment. It would make sense to assume that ESnet’s fiber footprint, which runs through that area, would be an excellent candidate for this experiment. Fortunately for ESnet’s other users, but unfortunately for the DAS team, by 2018 the ESnet6 team was already “lighting” this previously dark fiber.
However, just because ESnet fiber in the Imperial Valley was no longer a candidate for DAS-based experiments, that didn’t mean there weren’t ways to gain access to unused dark fiber. For every piece of fiber that has been put into production to support ESnet6, there are dozens if not hundreds of other fibers running right alongside it. When fiber-optic providers install new fiber paths, they pull large cables consisting of many individual fibers to lease or sell to as many customers as possible. Because the ESnet fiber footprint was running right through the Imperial Valley, we knew that there was likely unused fiber in the ground, and only had to find a provider that would be willing to lease a small section to Berkeley Lab for Dr. Ajo-Franklin’s experiment.
Making the search a little more complicated, the DAS equipment utilized for this experiment has an effective sensing range that is limited to less than 30 kilometers. Most fiber providers expect to lease long sections of fiber connecting metropolitan areas. For example, the fiber circuits that run through the Imperial Valley are actually intended to connect metropolitan areas of Arizona to large cities in Southern California. Finding a provider that would be willing to break up a continuous 600 km circuit connecting Phoenix to Los Angeles just to sell a 30 km piece for a year-long research project would be a difficult task.
One of my contributions to the ESnet6 project was sourcing new dark fiber circuits and data center colocation spaces to “fill out” our existing footprint and get ready for our optical system deployments. Because of those efforts, I knew that there were often entire sections of fiber that had been damaged across the country and would likely not be repaired until there was a new customer that wanted to lease the fiber. I was asked to assist Dr. Ajo-Franklin and his team to engineer a new fiber solution for the experiment. I just had to find someone willing to lease us one of these small damaged sections.
After speaking with many providers in the area, the communications company Zayo was able to find a section of fiber starting in Calipatria, heading south through El Centro and then west to Plaster City, that was a great candidate for DAS use. This section of fiber had been accidentally cut near Plaster City and was considered unusable for networking purposes. Working with Zayo, we were able to negotiate a lease on this “broken” fiber span along with a small amount of rack space and power to house the DAS equipment that Dr. Ajo-Franklin’s team would need to move forward with their research.
This cut fiber segment was successfully “turned up” for the project on November 10, 2020 by a team including Co-PI Veronica Rodriguez Tribaldos, Michelle Robertson, and Todd Wood (EESA/LBNL), and seismic data collection equipment is now up and running. The figure above (D) shows some great initial data recorded on the array, a small earthquake many miles to the north. There will be many more articles and reports from the Imperial Valley Dark Fiber Team as they continue to gather data and perform their experiments, and I’m sure we’ll begin to see fiber across the country put to use for this type of sensing and research.
I’ve had a great experience working with the different groups that were assembled for this project. By seeing how new technologies and methods are being developed to use fiber-optic cable for important research outside of computing science, I’ve developed a greater appreciation for how our labs and universities are tackling some of our biggest energy and public safety challenges.
ESnet has recently completed an experiment testing high-performance, file-based data transfers using Data Transfer Nodes (DTNs) on the 100G ESnet Testbed. Within ESnet, new ways to provide optimized, on-demand data movement tools to our network users are being prototyped. One such potential new data movement tool is offered by Zettar, Inc. Zettar’s “zx” product integrates with several storage technologies with an API for automation. This ESnet data movement experiment allowed us to test the use of tools like zx on our network.
Two 100Gbps capable DTNs were deployed on the ESnet Testbed for this work, each with 8 x NVMe SSDs for fast disk-to-disk transfers, and connected using an approximately 90ms round trip time network path. As many readers are aware, this combination of fast storage and fast networking requires careful tuning from both a file I/O and network protocol standpoint to achieve expected end-to-end transfer rates, and this evaluation was no exception. With the help of a storage throughput baseline achieved using the freely available elbenchotool, a single tuning profile for zx was found that struck an impressive performance balance when moving a sweep of hyperscale data sets (>1TB total size or >1M total files or both, see figure below) between the testbed DTNs.
To keep things interesting, the DTN software under evaluation was configured and launched within Docker containers to understand any performance and management impacts, and to establish a potential use case for more broadly deploying DTNs as-a-Service using containerization approaches. Spoiler: the testing was a great success! When configured appropriately, our evaluation has shown that modern container namespaces using performance-oriented Linux networking impart little to no impact on achievable storage and network performance at the 100Gbps scale while enabling a great deal of potential for distributed deployment of DTNs. More critically, the problem of service orchestration and automation becomes the next great challenge when considering any large-scale deployment of dynamic data movement endpoints.
When properly provisioned and configured, a containerized environment has a high potential to provide an optimized, on-demand data movement service.
Data movers such as zx demonstrate that when modern TCP is used efficiently to move data at scale and speed, network latency becomes less of a factor – the same level of data rates are attainable over LAN, Metro, and WAN as long as packet loss rates can be effectively kept low
Finally, creating a holistic data movement solution demands integrated consideration of storage, computing, networking, and highly concurrent and intrinsically scale-out data mover software that incorporates a proper understanding of the variety in data movement scenarios.
For more information, a project report detailing the testing environment, performance comparisons, and best practices may be found here.
ESnet’s first 40 Gb/s public data transfer node (DTN) has been deployed and is now available for community testing. This new DTN is the first of a new generation of publicly available networking test units, provided by ESnet to the global research and engineering network community as part of promoting high-speed scientific data mobility. This 40G DTN will provide four times the speed of previous-generation DTN test units, as well as the opportunity to test a variety of network transfer tools and calibrated data sets.
The 40G DTN server, located at ESnet’s El Paso location, is based on an updated reference implementation of our Science DMZ architecture. This new DTN (and others that will soon follow in other locations) will allow our collaborators throughout the global research and engineering network community to test high speed, large, demanding data transfers as part of improving their own network performance. The deployment provides a resource enabling the global science community to reach levels of data networking performance first demonstrated in 2017 as part of the ESnet Petascale DTN project.
The El Paso 40G DTN has Globus installed for gridFTP and parallel file transfer testing. Additional data transfer applications may be installed in the future. To facilitate user evaluation of their own network capabilities ESnet Data Mobility Exhibition (DME), test data sets will be loaded on this new 40G DTN shortly.
All ESnet DTN public servers can be found at https://app.globus.org/file-manager. ESnet will continue to support existing 10G DTNs located at Sunnyvale, Starlight, New York, and CERN.
The full 40G DTN Reference architecture and more information on the design of these new DTN can be found here:
A second 40G DTN will be available in the next few weeks, and will be deployed in Boston. It will feature Google’s bottleneck bandwidth and round-trip propagation time (BBR2) software, allowing improved round-trip-time measurement and the ability for users to explore BBR2 enhancements to standard TCP congestion control algorithms.
In an upcoming blog post, I will describe the Boston/BBR2-enabled 40G DTN and perfSONAR servers. In the meantime, ESnet and the deployment team hope that the new El Paso DTN will be of great use to the global research community!
Scientific discovery increasingly relies on the ability to perform large data transfers across networks operated by many different providers (including ESnet) around the globe. But what happens when a researcher initiates one of these large data transfers and data movement is slow? What does “slow” even mean? These can be surprisingly complex questions and it is important to have the right tools to help answer them. perfSONAR is an open source software tool designed to measure network performance and pinpoint issues that occur as data travels across many different networks on the way to a destination.
perfSONAR has been around for more than 15 years and is primarily maintained today by a collaboration of ESnet, GEANT, Indiana University, Internet2, and the University of Michigan. perfSONAR has an active community that extends well beyond the five core organizations that maintain the software with more than 2000 public deployments that span six continents and hundreds of organizations. perfSONAR deployments are capable of scheduling and running tests that calculate metrics including (but not limited to) how fast a transfer can be performed (throughput), if a unit of information makes it to a desired destination (packet loss), if so how long did it take (latency) and what path did it take to get there (traceroute). What is novel about perfSONAR is not just these metrics, but the set of tools to feature these metrics in dashboards built by multiple collaborating organizations. These dashboards aim to clearly identify patterns that signify potential issues and provide the means to drill-down into graphs that give more information.
While perfSONAR has had great success in providing the current set of capabilities, there is more that can be done. For example, perfSONAR is very good at correlating metrics it collects with the other perfSONAR metrics with at least one similar endpoint. But what if we want to correlate the metrics by location, intermediate network or with non-perfSONAR collected statistics like flow statistics and interface counters? These are all key questions the perfSONAR project is looking to answer.
Building upon a strong foundation
PerfSONAR has the ability to add analytics from other software tools using a plug-in framework. Recently, we have begun to use Elastic Search via this framework, to ingest log data and enable improved search and analytics on perfSONAR data.
For example, traditionally perfSONAR has viewed an individual measurement as something between a pair of IP addresses. But what do these IP addresses represent and where are they located? Using off-the-shelf tools Elastic Search in combination with Logstash, perfSONAR is able to answer questions like “What geographic areas are showing the most packet loss?”.
Additionally, we can apply this same principle to traceroute (and similar tools) that yield a list of IP addresses giving an idea of the path a measurement takes between source and destination. Each IP address is a key to more information about the path including not only geographic information but also the organization at each point. This means you can ask questions such as “What is the throughput of all results that transit a given organization?”. Previously a user would not only have to know the exact address of the IPs, but it would have to be the first (source) or last (destination) address in the path.
Integration with non-perfSONAR data is another area the project is looking to expand. By putting perfSONAR data in a well established data store like Elasticsearch, the door is open to leverage other off-the-shelf open source tools like Grafana to display results. What’s interesting about this platform is not only its ability to build new visualizations, but also the diverse set of backends it is capable of querying. If data such as host metrics, network interface counters and flow statistics are kept in any of the supported data stores, then there is a means to present this information along perfSONAR data.
These efforts are very much still in their early stages of development, but initial indicators are promising. Leveraging the perfSONAR architecture in conjunction with the wealth of off-the-shelf open source tools available on the market today create opportunities to gain new insights from the network, like those described above, not previously possible with the traditional perfSONAR tools.
Getting involved and learning more
The perfSONAR project will continue to provide updates as this work progresses. You can also see the perfSONAR web site for updates and more information on keeping in touch through our mailing lists. The perfSONAR project looks forward to working with the community to provide exciting new network measurement capabilities.
In my previous post, we discussed use of the open-source Zeek software to support network security monitoring at ESnet. In this post, I’ll talk a little about work underway to improve Zeek’s ability to support network traffic monitoring when faced with stream asymmetry.
This comes from recent work by two of my colleagues on the ESnet Security team.
Some of the significant findings and results from this presentation are highlighted below:
Phase I: Initial Zeek Node Design Considerations
Select locations that provide an interesting network vantage point – in the case of our ESnet network, we deployed Zeek nodes on our commodity internet peerings (eqx-sj, eqx-chi, eqx-ash) since they represent the interface to the vast majority of hostile traffic.
Identifying easy traffic to test with and using spanning ports to forward traffic destined to the stub network on each of the routers used for collection.
Phase I: Initial Lessons learned from testing and results
Some misconfigurations were found in the ACL prefix lists.
We increased visibility into our WAN side traffic through implementation of new background methods.
Establishing a new process for end-to-end testing, installing and verifying Zeek system reporting.
Phase II: Prove there is more useful data to be seen
For phase II we moved towards collection of full peer connection records, from statistical sampling based techniques. Started running Zeek on traffic crossing the interfaces which connect ESnet network peers to the internet from the AS (Autonomous system) responsible for most notices. .
To get high fidelity connection information without being crushed by data volume, define a subset of packets that are interesting – zero length control packets (Syn/Syn-Ack/Fin/Rst) from peerings.
Phase II: Results
A lot of interesting activity got discovered like information leakage in syslogs, logins (and attempted logins) using poorly secure authentication protocols, and analysis of the amount of asymmetric traffic patterns gave valuable insights to understand better the asymmetric traffic problems.
Ongoing Phase III: Expanding the reach of traffic collection on WAN
We are currently in the process of deploying Zeek nodes at another three WAN locations for monitoring commodity internet peering – PNWG (peering at Seattle WA), AM-SIX (peering at Amsterdam) and LOND (peering at London)
As our use of Zeek on the WAN side of ESnet continues to grow, the next phase to the ZoW pilot is currently being defined. We’re working to incorporate these lessons learned on how to handle traffic asymmetry into these next phases of effort.
Some (not all) solutions being taken into consideration include:
Aggregating traffic streams at a central location to make sense out of the asymmetric packet streams and then run Zeek on the aggregated traffic, or
Running Zeek on the individual asymmetric streams and then aggregating these Zeek streams @ 5-tuple which will be aggregation of connection metadata rather than the connection stream itself.
We are currently exploring these WAN solutions as part of providing better solutions to both ESnet, and connected sites.
Zeek is an open source network security monitoring software extensively used by ESnet. Zeek (formally called Bro) was initially developed by researchers at Berkeley Lab. Zeek allows users to identify & manage cyber threats by monitoring network traffic. It acts as a passive monitoring software (NSM – Network Security Monitor), that gives a holistic view of what is transpiring in the network and gives visibility into the network traffic.
In order to better understand network behavior and provide flexible security services, we use Zeek as an important part of our data center security architecture and are experimenting with placing Zeek clusters on various WAN high value points. This is providing technical insights as well as significant challenges.
In this post we would present some of our efforts in approaching the WAN security using Zeek for network monitoring, with successes and challenges hit during the process and interesting things learned.
Zeek on the ESnet LAN:
Monitoring local area and data center networks is a familiar and less complex network traffic monitoring design, and ESnet is no different. The traffic flowing through the LAN networks is currently monitored using two Zeek clusters, one at Brookhaven National Lab and another for the west coast at Berkeley Lab. We have implemented BHR (black hole routing) functionality on our data center routers to block external actors which violate our established policies based on Zeek detections on both IPv4 and IPv6 protocol stacks.
Apart from network security monitoring using “standard” Zeek detection scripts, many enhancements and custom scripts written by the ESnet Security team members serve a vital role in detecting various kinds of suspicious activity. Recently, a Zeek package – Zeek-Known-outbound contributed by Michael “Dop” Dopheide won the first prize in the Zeek Package Contest-2 held in May 2020. The package provides the ability to track and alert on outbound service usage to a list of ‘watched’ countries, and also adds the country codes for the origin and recipient hosts in one of the log files that Zeek generates called conn.log, to log all the connection attempts seen on the network. The motivation behind this work came from the discovery of few systems contacting hosts in foreign countries for package updates, and DNS services found during routine log analysis.
Zeek on the ESnet WAN:
To augment our LAN efforts on a wider scale, we have been experimenting with monitoring the network traffic on the WAN side of the network using Zeek in order to get more visibility and to provide improved security/network services. Most of this work is experimental: iterative design changes as we use what we learn from stage 1 to stage 3 and beyond.
Some notable differences and challenges from typical LAN network:
Data Volume: There are a large number of WAN links that run at 1-400Gb/s
Data Encapsulation: Data with variable length headers is problematic, so we have been employing a load balancer to address this problem.
Asymmetric Data Flows: This is a hard problem to solve, especially when the network is distributed across the country. When the packets corresponding inbound and outbound flows between two network nodes follow different paths, it can be challenging to reconcile conversation activities as part of network monitoring.
Technical Integration: Coordinating activities between teams distributed geographically introduces challenges, which we are developing ways to overcome.
At ESnet we thrive to push the boundaries and try innovative ways to address challenges, Zeek on the WAN is an example of that and in my next article I will discuss some ways we have been experimenting with to address above noted complex problems and specifically going into details of the research been done in addressing Asymmetric Data Flows on WAN.