Using perfSONAR to Find Network Anomalies

Last week ESnet and Berkeley Lab Computing Sciences sponsored a talk on network anomaly detection by Prasad Calyam, Ph.D., from the Ohio Supercomputer Center/OARnet, and The Ohio State University. For the last year and a half, Calyam has worked on a DOE funded perfSONAR-related project. He emphasized that accurate measurement is necessary to be able to troubleshoot in multiple domains and layers. Calyam’s group is developing metrics and adaptive performance sampling techniques to analyze network performance across multiple layers for better network status awareness, performance, and to determine optimal paths for large data sets.

To do active measurements you need intelligent sampling, Calyam says, but the types of measurements necessary are difficult to accomplish due to policies and other constraints.

Calyam’s group is currently developing two new perfSONAR tools:

  • The OnTimeDetect Tool to detect network anomalies. The OnTimeDetect Tool leverages perfSONAR lookup services to query for projects or sites, and then using the perfSONAR Measurement Archives for data on the path of interest.  The OnTimeDetect tool uses this data to accurately and rapidly detect anomalies as they occur.
  • The OnTimeSample Tool does intelligent forecasting to manipulate and plan network infrastructure, and validating them with enterprise monitoring data from the E-Center portal led by Fermilab as well as ESnet’s over 60 perfSONAR deployments. The next step is to integrate the tools to present information in a more user friendly way. Useful network performance information can be collected from user logs, anomaly alerts, and measurement data from applications such as PerfSONAR, PingER, and ESxSNMP (ESnet’s SNMP database).
Measuring the layers of network intelligence

Active vs Passive Measurement

Calyam is also building a mechanism that will enable network infrastructure providers to continuously check networks using both active measurements of end- to-end performance, and passive measurements collected by instrumentation deployed at strategic points in a network. 

According to Calyam, one cannot accurately assess network status without adaptive and random sampling techniques. Calyam’s group is trying to determine the optimal sampling frequency and distribution to monitor networks and to forecast and detect anomalies.

“You can do sampling in a particular domain that you control, but the real challenge is in multi-network domains controlled by multiple entities.” Calyam comments, “For this approach to work, you need a federation of ISPs to share network performance information.” For example the perfSONAR measurement federation  (e.g. ESnet, Internet2 and GEANT) can share measurement topologies, policies, and measurement exchange formats for mutual troubleshooting.

His group is working an application for scientists who are using instrumentation remotely and experiencing lag in data. Suitable sampling can indicate to users whether the lag in instrument control they are experiencing is due to lags in the movement of physical instrument components, or due to network latency.

However, perfSONAR cannot yet handle strict sampling patterns. It is engineered for ease of use, but that means a trade-off in sampling precision and sophistication.  Its current set of tools; ping, traceroute, owamp, and bwctl can potentially conflict with each other or when used concurrently along with any other active measurement tools. Calyam advocates a meta-scheduling function to control measurement tools, as well as new regulation policies and semantic priorities. His group is also building some model user interfaces, with a GUI tool, a twitter publishing API,  Google charts that perfSONAR uses as well as Graphite charts.

Calyam’s group was the first to query perfSONAR measurements on 480 paths and 65 sites worldwide. The group has so far developed an adaptive anomaly detection algorithm, demonstrated a new adaptive sampling scheme and released a set of algorithms to the perfSONAR user and developer community. http://ontimedetect.oar.net. Tools are also available at the perfSONAR website.

ANI Testbed Departs Left Coast for Long Island

It's hard to part with hardware

In another milestone, ESnet’s testbed was dispatched to its more permanent home at Brookhaven National Laboratory. The testbed, part of LBL’s $62 million ARRA-funded Advanced Networking Initiative, was established so researchers could experiment and push the limits of network technologies. A number of researchers have taken advantage of its capabilities so far, and we are collecting proposals for new projects.

We handled the painstaking shut down and packing procedures. We verified the internal wiring and the structural integrity of the hosts. We dealt with the intricacies of IP addressing. The testbed will be reassembled and open for new research projects in a couple of weeks. Note to Brookhaven: We meticulously counted every last screw and packed them all in plastic baggies.

Aussies’ Data Flows Literally over the Top

Here are some plots of data transfers posted by Alex Sim of the Scientific Data Management Group at Lawrence Berkeley National Lab, which is headed by Arie Shoshani. Sim et al. have been deeply engaged with data movements for the Earth System Grid (ESG)  and Climate100 projects.

The Australian National University/NERSC data transfer rates are impressive. As you observe from this graph, there is about a third of a terabyte of data flowing over the network in under 10 minutes. ESnet carries the data from the Bay Area to Pacific Wave in Seattle and then it continues across the Pacific to Australia on AARNET. There are three groups at ESnet providing assistance with various aspects of this effort.

Notice here, the combined plot runs off the graph. We like to see that sort of thing as it sets a good example of what we are all striving for—data transfer performance of scientific utility. How do you get it? It is a matter of figuring out how to use systems correctly and optimize infrastructure.

The Australians were savvy in figuring out their system before they launched huge data flows. Initial qualification of the network path between the BAMAN sites and ANU was done using the ESnet disk performance tester, lbl-diskpt1.  Before the ESG nodes went live at NERSC, the Australians were testing against lbl-diskpt1 to qualify their network and storage system performance for long-distance transfers from the San Francisco Bay Area. So, by the time the NERSC ESG Gateway and Data Node came up, they knew the data transfer infrastructure was relatively clean.

Let’s just say that our test and measurement infrastructure is continuing to show its value…

The data flow from Livermore to NERSC is pretty impressive as well. Recent data movement from the British Atmospheric Data Centre (BADC), UK to LBNL/NERSC also achieved another milestone in ESG and Climate100 projects.

These data replications were managed by Bulk Data Mover (BDM), a scalable data transfer management tool, developed by the SDM group at LBNL under the ESG project. BDM manages efficient data transfers with optimized transfer queue and concurrent management algorithms. GridFTP is used as the main underlying transfer protocol.

–Eli Dart