ESnet Machine Learning Researchers Win Best Paper at MLN ‘2021!

MLN '2021 Best Paper Award Notification

Sheng Shen, Mariam Kiran, and Bashir Mohammed have just been awarded the Best Paper award at the International Conference on Machine Learning for Networking (MLN). Sponsored by the Conservatoire National des Arts et Métiers (CNAM), the École Supérieure d’Ingénieurs en Électrotechnique et Électronique (ESIEE), and Laboratoire d’Informatique Gaspard-Monge (LIGM), MLN is being held virtually 1-3 December 2021.

The paper, “DynamicDeepFlow: An Approach for Identifying Changes in Network Traffic Flow Using Unsupervised Clustering,” uses a hybrid of deep learning variational autoencoder model and a shallow learning k-means to help identify unique traffic patterns across ESnet. These unique patterns can help identify if a new experiment has started or whether current network bandwidth is changing.

DynamicDeepFlow (DDF) model structure

“We’re very excited to receive this recognition and the conference was a wonderful opportunity to exchange thoughts and ideas with peers in France. MLN is a conference dedicated to discussing machine learning applications in networks. Our next task is to integrate DynamicDeepflow with Netpredict to show real-time information in ESnet data” — Mariam Kiran

Papers from MLN will be published as post-proceedings in Springer’s Lecture Notes in Computer Science (LNCS).

ESnet Highlights from the National Science Foundation’s Cybersecurity Summit ’21

The National Science Foundation (NSF) Cybersecurity Center of Excellence, Trusted CI Project hosts a yearly cybersecurity summit, inviting people from various NSF-funded research organizations to share innovations and ideas. Here are some videos of ESnet presentations.

Scott Campbell presented “ESnet Security Group Impact on Network Architecture” where he discussed some of the social, technical, and architectural outcomes of the ESnet6 network upgrade that were beneficial to the organization. By being involved early, security design elements were incorporated into workflows at early stages and were both tightly integrated and vetted during the core design process. This early involvement also heightened the security group’s visibility, which led to a better understanding of how the various groups interact and their different methods of problem-solving and time management.

Eli Dart and Fatema Bannat Wala presented “Best practices for securing Science DMZ,” focusing on disentangling security policies and enforcement for science flows from traditional security approaches for business systems, and use of the Science DMZ model to protect high-performance science flows. They discussed thinking of the Science DMZ as a security architecture that provides useful and implementable security controls without impacting performance. 

ESnet Scientists awarded best paper at SC21 INDIS!

A combined team from ESnet and Lehigh University was awarded the best paper for Exploring the BBRv2 Congestion Control Algorithm for use on Data Transfer Nodes at the 8th IEEE/ACM International Workshop on Innovating the Network for Data-Intensive Science (INDIS 2021), which was held in conjunction with the 2021 IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC21) on Monday, November 15, 2021.

The team was comprised of:

  • Brian Tierney, Energy Sciences Network (ESnet)
  • Eli Dart, Energy Sciences Network (ESnet)
  • Ezra Kissel, Energy Sciences Network (ESnet)
  • Eashan Adhikarla, Lehigh University

The paper can be found here. Slides from the presentation are here. In this Q+A, ESnet spoke with the award-winning team about their research — answers are from the team as a whole.

INDIS 21 Best Paper Certificate

The paper is based on extensive testing and controlled experiments with the BBR (Bottleneck Bandwidth and Round-trip propagation time), BBRv2 and the Cubic Function Binary Increase Congestion Control (CUBIC) Transmission Control Protocol (TCP) Internet congestion algorithms. What was the biggest lesson from this testing?

BBRv2 represents a fundamentally different approach to TCP congestion control. CUBIC (as well as Hamilton, Reno, and many others) are loss-based, meaning that they interpret packet loss as congestion and therefore require significant network engineering effort to achieve high performance. BBRv2 is different in that it measures the network path and builds a model of the path – it then paces itself to avoid loss and queueing. In practical terms, this means that BBRv2 is resilient to packet loss in a way that CUBIC is not. This comes through loud and clear in our data.

What part of the testing was the most difficult and/or interesting?

We ran a large number of tests in a wide range of scenarios. It can be difficult to keep track of all the test configurations, so we wrote a “test harness” in python that allowed us to keep track of all the testing parameters and resulting data sets.

The harness also allowed us to better compare results collected over real-world paths to those in our testbed environments. Managing the deployment of the testing environment though containers also allowed for rapid setup and improved reproducibility. 

You provide readers with links to great resources so they can do their own testing and learn more about BBRv2. What do you hope readers will learn?

We hope others will test BBRv2 in high-performance research and education environments. There are still some things that we don’t fully understand, for example there are some cases where CUBIC outperforms BBRv2 on paths with very large buffers. It would be great for this to be better characterized, especially in R&E network environments.

What’s the next step for ESnet research into BBRv2? How will you top things next year?

We want to further explore how well BBRv2 performs at 100G and 400G. We would also like to spend additional time performing a deeper analysis of the current (and newly generated) results to gain insights into how BBRv2 performs compared to other algorithms across varied networking infrastructure. Ideally we would like to provide strongly substantiated recommendations on where it makes sense to deploy BBRv2 in the context of research and educational network applications.

Arecibo Support Wins SC21 HPCwire Readers’ Choice Award!

Arecibo dish after the collapse

As part of a team spanning 15 government, academic, and industrial partners, the Engagement and Performance Operations Center (EPOC) – a collaboration between Indiana University and ESnet – was awarded the “Best HPC Collaboration (Academia/Government/Industry)” HPCwire Readers’ Choice award on Tuesday, Nov. 16. The award, which was made at the High Performance Computing, Networking, Storage and Analysis (SC21) conference, recognizes the effort and collaboration required to move and safeguard irreplaceable data (over 50 years of astronomical observations) from the Arecibo observatory following the structural collapse of this scientific resource in 2016.

At ESnet, Ken Miller, George Robb, and Jason Zurawski supported these efforts as both full members of EPOC and ESnet staff. Both Jason and Ken divide their time between ESnet’s Science Engagement Team, while George is with ESnet’s Infrastructure Systems group. LightBytes looped up with Jason Zurawski to get his thoughts on the project and award, and an update on the Arecibo effort since our post in April 2021 on this project.


Now that data from Arecibo has been migrated to the Texas Advanced Computing Center (TACC), what happens now, and how will this data be used?

The team at the University of Central Florida has been engaged with TACC on several ways to build up the capabilities for their data analysis and sharing requirements. They are working to deploy a portal that will allow researchers access to the data, as well as build workflows to investigate and process using computation provided by TACC.

The team at Arecibo is also still going to process much older data that still resides on tape. Due to the delicate state of the media, it is carefully being read and transferred to on-island storage before being transmitted to TACC for archiving. This work will take several more months to complete.

What do you think the lessons from this effort are in terms of getting so many different organizations to work together to support this very challenging problem?

The collapse that Arecibo experienced sent ripples through the R&E community because researchers and technology professionals alike knew there was a limited window to act on replicating important observations gathered over the years. The partners in this effort were motivated to act, and that removed many barriers to putting some solutions in place. Everyone collaborated efficiently with their core competencies, and we continue to work together as the next steps for the scientific collaboration are planned.

Plans are starting to emerge for a “next generation” Arecibo based on the loss of this instrument, how might the next generation of data management resources be shaped by this collaboration?

Now that there has been some time to evaluate the work, it has also spurred UCF and Arecibo to plan for the future with respect to computation, storage, and network connectivity both in Puerto Rico and in Florida.  With these improvements planned, they will be well-positioned to serve the scientific data for years to come.  New instruments will no doubt increase the data demands by many orders of magnitude – addressing all aspects of the data pipeline now, and then gradually increasing the capabilities over time, will help to prepare for these emerging challenges. 

Congratulations to all of the organizations and staff who helped prevent the loss of this data!

Making the Research and Educational Community SAFER: Adam Slagell on the creation of a new global collaboration to combat cyberthreats.

Adam Slagell is ESnet’s Chief Security Officer and a founding member of the newly formed Security Assistance For Education & Research (SAFER) trust group.

SAFER is an operational security entity focused on fighting computer misuse and defending the academic, research, and education (R&E) mission globally.  SAFER brings together expertise and resources from organizations across the Research and Educational cybersecurity community, including CERN, DFN-CERT, ESET, ESnet, LBNL, STFC, and WLCG.

More information can be found here https://www.safer-trust.org/.


What motivates the creation of SAFER and what do you think success will look like for the community?

There are many cybersecurity trust groups out there, some even dedicated to R&E like REN-ISAC or XSEDE’s trust group consisting of current and former Teragrid and XSEDE site  members. However, there really isn’t anything like this—both permanent and truly international— even though attacks are almost always transnational. So each time there is a new, major campaign, an international group connecting all these regional responders must be created again. What we are trying to do is create that permanent backbone with a core set of highly connected individuals who are a part of these regional and project-specific trust groups around the world.

If we are successful, we will see several things. First, I believe we will see more international cooperation and information sharing, leading to an earlier notice of new attack campaigns. Second, we will be able to activate a response more quickly, pulling in the expertise needed from a broad pool of SAFER members and their trusted colleagues. Finally, it is our hope that we can provide surge capabilities when a member is under attack. Many R&E organizations have limited resources and small teams. It is a tremendous asset if they can get help from their peers, maybe with unique expertise as they are facing a disruptive attack.

What kind of security resources will SAFER provide?

I alluded to some of the services when discussing what success will look like. But ultimately, our security resources will be determined by community needs. The founding members will serve as the steering committee for the first year until we elect the next steering committee. 

One of our  first-steps is  setting up a Malware Information Sharing Platform (MISP) instance to share Indicators of Compromise, e.g., IP addresses, file hashes, domain names, etc. Usually, there is no requirement for members to share such data as the rules and regulations differ so much across organizations. But even on day one, we will have enough organizations that can contribute to making this service useful.

There is also a secure messaging and chat service using decentralized cryptography that all of our members can participate in. These ad hoc conversations about what people are seeing on their networks will hopefully help detect trends early.

Finally, many of the founding members have more resources from these large institutions, and I believe we can quickly help those projects and institutions that might struggle with an attack by providing our expertise while helping to train the next generation of security professionals.

What excites you most about this effort and what is the opportunity to do the most good?

I love the community-building aspect. In a past life, I created the Bro (now Zeek) Leadership Team and really worked hard to build a vibrant community around that software. I think this expertise is where I can be most helpful as I am less technical in my roles today.

I will also say, I am excited about getting young people involved, too. Organizations who contribute time from their teams will really benefit. There is no training for an incident response like jumping in, and I expect the variety of issues we will see will prove very useful just from a training and development perspective.

LBL has a long history supporting cybersecurity research, from the early days of Clifford Stoll and The Cuckoo’s Egg to the creation of Bro.  What does the future of cybersecurity look like, and how will that shape the REN community?

Indeed, LBL’s security team is also a SAFER founding member. One of the things I love about working here and at ESnet is that our mission is outward-focused and when we help the community we raise all boats so to speak.

Fortune telling however is a dangerous game. We have anticipated some things, like cryptocurrency mining coming to HPCs. However, the threat landscape and tools available keep changing. That is part of what makes this job interesting. The important thing that I hope we keep in mind is that security is not done for its own sake, but to enable our mission of scientific research. To me, this means that we must always work to make risk-based security decisions, even when that might challenge pushes for compliance and simple one-size-fits-all solutions. 

Next Generation ESnet6 Routers Installed and Accepted!

ESnet6 took a major step forward last week with the completed installation and acceptance of all 40 “greenfield” routers on the network backbone. These new routers will enable ESnet to operate at speeds up to 400 Gbps across our national fiber network, and provide the backbone infrastructure behind our next generation scientific data mobility capabilities.

A new ESnet6 backbone router in its native habitat.

The installation and acceptance process at each location across the continental US required careful coordination between subcontractors, colocation facility personnel, Lab site staff, and multiple teams across ESnet. Following local health regulations and access requirements, ESnet arranged physical access for the subcontractors at each location and all parties participated in a turn-up conference call as the routers were installed and brought online..

In addition to networking capabilities, the ESnet6 team implemented new software automation capabilities simplifying the installation and acceptance process.  These capabilities included enhancements to the ESnet inventory system to support bulk planning data import, automatic bill of materials generation, automatic site survey generation, and automated generation of all backbone links within the network.  In addition, the team introduced new workflow orchestration, automated provisioning, and inventory discovery capabilities to help with the installation process.

The acceptance of the ESnet6 greenfield routers is a major milestone for the ESnet6 Project and the team has already migrated a significant portion of customer traffic onto the new routers. Despite the extra challenges presented by the COVID-19 pandemic, the project has made steady progress and is on track to finish ahead of schedule. 

Science begins as a Conversation! See how ESnet creates a world where conversations become discovery. Watch our new video now!

Ever want to know how big research data moves around the globe? ESnet plays a significant role in supporting the great scientific conversations, collaborations, and experiments underway, wherever and whenever they occur. We move Exabytes of data around the world creating a global laboratory that accelerates scientific discovery.

In order to meet these needs of scientists, we are constantly looking for opportunities to expand our capabilities with our next generation network ESnet6, intelligent edge analytics, advanced network testbeds, 5G wireless, quantum networking and more.

https://www.es.net/scienceconversation/

ESnet’s Data Mobility Exhibition: Moving to petascale with the research community

Research and Education Networks (REN) capacity planning and user requirements differ from those faced by commodity internet service providers for home users. One key difference is that scientific workflows can require the REN to move large, unscheduled, high-volume data transfers, or “bursts” of traffic. Experiments may be impossible to duplicate and even one underperforming network link can cause the entire data transfer to fail.  Another set of challenges stem from the federated nature of scientific collaboration and networking. Because network performance standards cannot be centrally enforced, performance is obtained as a result of the entire REN community working together to identify best practices and resolve issues.  For example:

  • Data Transfer Nodes (DTN), which connect network endpoints to local data storage systems are owned by individual institutions, facilities, or labs. DTNs can be deployed with various equipment configurations, with local or networked storage configurations, and connected to internal networks in many different ways. 
  • Research institutions have diverse levels of resources and varied data transfer requirements; DTNs and local networks are maintained and operated based on these local considerations.
  • Devising performance benchmarks for “how fast a data transfer should be” is difficult as capacity, flexibility, and general capabilities of networks linking scientists and resources constantly evolve and are not consistent across the entire research ecosystem.

ESnet has long been focused on developing ways to streamline workflows and reduce network operational burdens on the scientific programs, researchers, and others both those we directly serve and on behalf of the entire R&E network community.  Building on the successful Science DMZ design pattern and the Petascale DTN project, the Data Mobility Exhibition (DME) was developed to improve the predictability of data movement between research sites and universities. Many sites use perfSONAR to test end-to-end network performance. The DME allows sites to take this a step farther and test end to end data transfer performance.

DME is a resource that enables the calibration of data transfer performance for a site’s DTNs to ensure that they are performing well by using ESnet’s own test environment, at scale. As part of the DME, system/storage administrators and network engineers have a wide variety of resources available to analyze data transfer performance against ESnet’s standard DTNs, obtain help from ESnet Science Engagement (or from universities, Engagement and Performance Operation Centers) to tune equipment, and to share performance data and network designs with the community to help others.  For instance, a 10Gbps DTN should be capable of – at a minimum – transferring one Terabyte per hour. However, we would like to see DTNs > 10G or a cluster of 10G DTNs transfer at PetaScale rates of 6TB/hr or 1PB/week.

Currently, the DME has geographically dispersed benchmarking DTNs in three research locations:

  • Cornell Center for Advanced Computing in Ithaca, NY, connected through NYSERnet
  • NCAR GLADE in Boulder, CO, connected through Front Range Gigapop
  • Petrel system at Argonne National Lab, connected through ESnet

Benchmarking DTNs are also deployed in two commercial cloud environments: Google Drive and Box.  All five DME DTN can be used for both upload and download testing allowing users to calibrate and compare their network’s data transfer performance. Additional DTNs are being considered for future capacity. Next generation ESnet6 DTNs will be added in FY22-23, supporting this data transfer testing framework.

DME provides calibrated data sets ranging in size from 100MB to 5TB, so that performance of different sized transfers can be studied. 

DOE scientists or infrastructure engineers can use the DME testing framework, built from the Petascale DTN model, with their peers to better understand the performance that institutions are achieving in practice. Here are examples of how past Petascale DTN data mobility efforts have helped large scientific data transfers:

  1. 768 TB of DESI data sent via ESnet, between OLCF and NERSC automatically via Globus over 20 hours. Despite the interruption of a maintenance activity at ORNL, the transfer was seamlessly reconnected without any user involvement.
  2. Radiation-damage-free high-resolution SARS-CoV-2 main protease SFX structures obtained at near-physiological-temperature offer invaluable information for immediate drug-repurposing studies for the treatment of COVID19. This Work required near-real-time collaboration and data movement between LCLS, NERSC via ESnet.

To date, over 100 DTN operators have used DME benchmarking resources to tune their own data transfer performance. In addition, the DME has been added to the NSF-funded Engagement and Performance Operations Center (EPOC) program’s six main scientific networking consulting support services, bringing this capability to a wide set of US Research Universities. 

As the ESnet lead for this project, I invite you to contact me for more info (consult@es.net). We also have information up on our knowledge-base website fasterdata.es.net. DME is an easy, effective way to ensure your network, data transfer, and storage resources are operating at peak efficiency! 

Three questions with a new staff member! Please meet Rémy Doucet

Rémy comes to us from ByteDance/TikTok where they worked as a Systems Engineer responsible for large-scale server allocation and bare-metal OS deployment.  They have worked as a systems engineer for five years, with experience both in the Telecom industry and for large social media companies.  Rémy began their career as a software developer in Python but shifted when they realized a passion for infrastructure and systems.  

Rémy Doucet

What brought you to ESnet?

I have a long history of activism and also worked in the nonprofit sector prior to my engineering career. I became dissatisfied working only for social media giants and began seeking a career that married my passion for technology with my drive to make a positive impact on the world. Climate change is the most pressing issue humans are facing today, so I am excited to begin contributing to a place that not only has an impressive legacy of scientific discovery, but is continuing to make strides in areas such as renewable and clean energy.

What is the most exciting thing going on in your field right now?

Although it is not exactly under my purview, I have always been fascinated by artificial intelligence. Not only will it continue to transform our society in unimaginable ways, but I am also curious to see how it will come to be used for systems administration tasks such as monitoring and deployment. Currently, these processes are still largely human and automation driven but I think we will start to see more AI incorporated into the process in the future. For my personal interests, I enjoy experiencing art or music created by AI.

What book would you recommend?

Simulacra and Simulation by Jean Baudrillard. It is a philosophical treatise that I think will become increasingly relevant in our society.