Re-imagining perfSONAR to gain new network insights

Scientific discovery increasingly relies on the ability to perform large data transfers across networks operated by many different providers (including ESnet) around the globe. But what happens when a researcher initiates one of these large data transfers and data movement is slow? What does “slow” even mean? These can be surprisingly complex questions and it is important to have the right tools to help answer them. perfSONAR is an open source software tool designed to measure network performance and pinpoint issues that occur as data travels across many different networks on the way to a destination.

perfSONAR has been around for more than 15 years and is primarily maintained today by a collaboration of ESnet, GEANT, Indiana University, Internet2, and the University of Michigan. perfSONAR has an active community that extends well beyond the five core organizations that maintain the software with more than 2000 public deployments that span six continents and hundreds of organizations. perfSONAR deployments are capable of scheduling  and running tests that calculate metrics including (but not limited to) how fast a transfer can be performed (throughput), if a unit of information makes it to a desired destination (packet loss), if so how long did it take (latency) and what path did it take to get there (traceroute). What is novel about perfSONAR is not just these metrics, but the set of tools to feature these metrics in dashboards built by multiple collaborating organizations.  These dashboards aim to clearly identify patterns that signify potential issues and provide the means to drill-down into graphs that give more information.

Example perfSONAR dashboard grid highlighting packet loss to an ANL test node (top). Example line graphs that further illustrate aspects of the problem (bottom).

While perfSONAR has had great success in providing the current set of capabilities, there is more that can be done. For example, perfSONAR is very good at correlating metrics it collects with the other perfSONAR metrics with at least one similar endpoint. But what if we want to correlate the metrics by location, intermediate network or with non-perfSONAR collected statistics like flow statistics and interface counters? These are all key questions the perfSONAR project is looking to answer. 

Building upon a strong foundation

PerfSONAR has the ability to add analytics from other software tools using a plug-in framework. Recently, we have begun to use Elastic Search via this framework, to ingest log data and enable improved search and analytics on perfSONAR data.

For example, traditionally perfSONAR has viewed an individual measurement as something between a pair of IP addresses. But what do these IP addresses represent and where are they located? Using off-the-shelf tools Elastic Search in combination with Logstash, perfSONAR is able to answer questions like “What geographic areas are showing the most packet loss?”.

Example map showing packet loss hotspots to different locations around the globe. It also contains a menu to filter results by intermediate network.

Additionally, we can apply this same principle to traceroute (and similar tools) that yield a list of IP addresses giving an idea of the path a measurement takes between source and destination. Each IP address is a key to more information about the path including not only geographic information but also the organization at each point. This means you can ask questions such as “What is the throughput of all results that transit a given organization?”. Previously a user would not only have to know the exact address of the IPs, but it would have to be the first (source) or last (destination) address in the path. 

Integration with non-perfSONAR data is another area the project is looking to expand. By putting perfSONAR data in a well established data store like Elasticsearch, the door is open to leverage other off-the-shelf open source tools like Grafana to display results. What’s interesting about this platform is not only its ability to build new visualizations, but also the diverse set of backends it is capable of querying. If data such as host metrics, network interface counters and flow statistics are kept in any of the supported data stores, then there is a means to present this information along perfSONAR data. 

Example of perfSONAR statistics combined with host statistics from a completely different database being displayed in Grafana

These efforts are very much still in their early stages of development, but initial indicators are promising. Leveraging the perfSONAR architecture in conjunction with the wealth of off-the-shelf open source tools available on the market today create opportunities to gain new insights from the network, like those described above, not previously possible with the traditional perfSONAR tools. 

Getting involved and learning more

The perfSONAR project will continue to provide updates as this work progresses. You can also see the perfSONAR web site for updates and more information on keeping in touch through our mailing lists. The perfSONAR project looks forward to working with the community to provide exciting new network measurement capabilities.

Three questions with Derek Howard

Three questions with a new ESnet staff member!  

Derek Howard is a software developer from Columbia, MO. Prior to joining ESnet, Derek worked as an HPC system administrator for the University of Missouri. Derek also created Augur (https://github.com/chaoss/augur) which is part of the Linux Foundation’s CHAOSS group (https://chaoss.community/), a working group focused on measuring the health and sustainability of open source software. 


Derek is part of the Network Services Automation group under John MacAuley, where he will be working primarily on our internal ESnet Database (ESDB).

Question 1: What brought you to ESnet?

I worked with George Robb at the University of Missouri and he joined ESnet a while ago and it seemed like a great place to work. I asked him if there were any positions at ESnet he thought might be a good fit for me and he referred me to the position I am in now. I’m really happy I joined; it is as great as I expected!

Question 2: What is the most exciting thing going on in your field right now?

With so much work underway for ESnet6, exciting changes are happening every day. We are pushing to get features out for all of our software as fast as possible right now. Right now, I am working on a feature in ESDB to make it easier for network engineers to verify hardware was installed correctly during router installs. 

As far as the broader field goes, I am excited about DDR5 memory becoming commercially available soon. 

Question 3: What book would you recommend?

Randall Munroe’s “What If?” – It’s a wonderful collection of serious answers to silly questions by the creator of XKCD.

Zeek and stream asymmetry research at ESnet

In my previous post, we discussed use of the open-source Zeek software to support network security monitoring at ESnet.  In this post, I’ll talk a little about work underway to improve Zeek’s ability to support network traffic monitoring when faced with stream asymmetry.

This comes from recent work by two of my colleagues on the ESnet Security team.

Scott Campbell and Sam Oehlert presented ‘Running Zeek on the WAN: Experiences and solutions for large scale flow asymmetry’ during a workshop held last year at CERN Geneva that explained the phases and deployment of the Zeek-on-the-WAN (ZoW) pilot in detail.

Scott Campbell at CERN presenting ‘Running Zeek on the WAN’
The asymmetry problem on a WAN (example)

Some of the significant findings and results from this presentation are highlighted below:

  • Phase I: Initial Zeek Node Design Considerations 
    • Select locations that provide an interesting network vantage point – in the case of our ESnet network, we deployed Zeek nodes on our commodity internet peerings (eqx-sj, eqx-chi, eqx-ash) since they represent the interface to the vast majority of hostile traffic.
    • Identifying easy traffic to test with and using spanning ports to forward traffic destined to the stub network on each of the routers used for collection.
  • Phase I: Initial Lessons learned from testing and results
    • Some misconfigurations were found in the ACL prefix lists. 
    • We increased visibility into our WAN side traffic through implementation of new background methods.
    • Establishing a new process for end-to-end testing, installing and verifying Zeek system reporting. 
  • Phase II:  Prove there is more useful data to be seen
    • For phase II we moved towards collection of full peer connection records, from statistical sampling based techniques. Started running Zeek on traffic crossing the interfaces which connect ESnet network peers to the internet from the AS (Autonomous system) responsible for most notices. .
    • To get high fidelity connection information without being crushed by data volume, define a subset of packets that are interesting – zero length control packets (Syn/Syn-Ack/Fin/Rst) from peerings.
  • Phase II: Results
    • A lot of interesting activity got discovered like information leakage in syslogs, logins (and attempted logins) using poorly secure authentication protocols, and analysis of the amount of asymmetric traffic patterns gave valuable insights to understand better the asymmetric traffic problems.
  • Ongoing Phase III: Expanding the reach of traffic collection on WAN
    • We are currently in the process of deploying Zeek nodes at another three WAN locations for monitoring commodity internet peering – PNWG (peering at Seattle WA), AM-SIX (peering at Amsterdam) and LOND (peering at London)
Locations for the ZoW systems, the pink shows ongoing Phase III deployment

As our use of Zeek on the WAN side of ESnet continues to grow, the next phase to the ZoW pilot is currently being defined.  We’re working to incorporate these lessons learned on how to handle traffic asymmetry into these next phases of effort. 

Some (not all) solutions being taken into consideration include: 

  • Aggregating traffic streams at a central location to make sense out of the asymmetric packet streams and then run Zeek on the aggregated traffic, or
  • Running Zeek on the individual asymmetric streams and then aggregating these Zeek streams @ 5-tuple which will be aggregation of connection metadata rather than the connection stream itself. 

We are currently exploring these WAN solutions as part of providing better solutions to both ESnet, and connected sites.

Three Questions with Chris Cummings

Three questions with a new ESnet staff member!  

Chris has joined ESnet as a network engineer and is supporting the ESnet6 project and day-to-day operations. He is a Network Engineer based out of Chicago, IL with many years of on-the-job experience, designing, deploying, and managing networks.

Chris started his networking career working on wireless broadband internet at a Wireless Internet Service Provider (WISP) in Juneau, Alaska. He has worked in various network engineering roles including for heavy industry, and underground mining.

When not doing networking, you likely won’t be able to find Chris as he will be out in the woods camping and entirely off the grid.

Question 1: What brought you to ESnet?

I wanted to work at ESnet because I knew it would bring me an entirely different set of challenges than what I was used to and would place me in an environment with incredibly intelligent people who could help me sort through those challenges.

Question 2: What is the most exciting thing going on in your field right now?

I’d have to say that it is the explosion of resources and focus on network automation. Networking has typically lagged behind other IT disciplines in this regard, so I think it’s very exciting to see networking catch up and start to benefit from combining software development methodologies with traditional network engineering.

Question 3: What’s a book you recommend?

For networking specifically, I would highly recommend Computer Networking Problems and Solutions by Ethan Banks and Russ White. What I like about this book is that it closely follows the advice given in RFC 1925 rule 11, which states that “Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works.” This high-level approach teaches you to think about networking problems in a more abstract manner so that when you are approached with a new problem you can apply a common framework to the solution rather than having to reinvent the wheel every time.

For something more leisurely, I would recommend reading The Dresden Files, which is a series of contemporary fantasy/detective/mystery books written from the perspective of a wizard who lives in modern-day Chicago.