ESnet Scientists awarded best paper at SC21 INDIS!

2021-11-222021-11-22 ~ ESnetComms

A combined team from ESnet and Lehigh University was awarded the best paper for Exploring the BBRv2 Congestion Control Algorithm for use on Data Transfer Nodes at the 8th IEEE/ACM International Workshop on Innovating the Network for Data-Intensive Science (INDIS 2021), which was held in conjunction with the 2021 IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC21) on Monday, November 15, 2021.

The team was comprised of:

Brian Tierney, Energy Sciences Network (ESnet)
Eli Dart, Energy Sciences Network (ESnet)
Ezra Kissel, Energy Sciences Network (ESnet)
Eashan Adhikarla, Lehigh University

The paper can be found here. Slides from the presentation are here. In this Q+A, ESnet spoke with the award-winning team about their research — answers are from the team as a whole.

The paper is based on extensive testing and controlled experiments with the BBR (Bottleneck Bandwidth and Round-trip propagation time), BBRv2 and the Cubic Function Binary Increase Congestion Control (CUBIC) Transmission Control Protocol (TCP) Internet congestion algorithms. What was the biggest lesson from this testing?

BBRv2 represents a fundamentally different approach to TCP congestion control. CUBIC (as well as Hamilton, Reno, and many others) are loss-based, meaning that they interpret packet loss as congestion and therefore require significant network engineering effort to achieve high performance. BBRv2 is different in that it measures the network path and builds a model of the path – it then paces itself to avoid loss and queueing. In practical terms, this means that BBRv2 is resilient to packet loss in a way that CUBIC is not. This comes through loud and clear in our data.

What part of the testing was the most difficult and/or interesting?

We ran a large number of tests in a wide range of scenarios. It can be difficult to keep track of all the test configurations, so we wrote a “test harness” in python that allowed us to keep track of all the testing parameters and resulting data sets.

The harness also allowed us to better compare results collected over real-world paths to those in our testbed environments. Managing the deployment of the testing environment though containers also allowed for rapid setup and improved reproducibility.

You provide readers with links to great resources so they can do their own testing and learn more about BBRv2. What do you hope readers will learn?

We hope others will test BBRv2 in high-performance research and education environments. There are still some things that we don’t fully understand, for example there are some cases where CUBIC outperforms BBRv2 on paths with very large buffers. It would be great for this to be better characterized, especially in R&E network environments.

What’s the next step for ESnet research into BBRv2? How will you top things next year?

We want to further explore how well BBRv2 performs at 100G and 400G. We would also like to spend additional time performing a deeper analysis of the current (and newly generated) results to gain insights into how BBRv2 performs compared to other algorithms across varied networking infrastructure. Ideally we would like to provide strongly substantiated recommendations on where it makes sense to deploy BBRv2 in the context of research and educational network applications.

Arecibo Support Wins SC21 HPCwire Readers’ Choice Award!

2021-11-16 ~ ESnetComms

As part of a team spanning 15 government, academic, and industrial partners, the Engagement and Performance Operations Center (EPOC) – a collaboration between Indiana University and ESnet – was awarded the “Best HPC Collaboration (Academia/Government/Industry)” HPCwire Readers’ Choice award on Tuesday, Nov. 16. The award, which was made at the High Performance Computing, Networking, Storage and Analysis (SC21) conference, recognizes the effort and collaboration required to move and safeguard irreplaceable data (over 50 years of astronomical observations) from the Arecibo observatory following the structural collapse of this scientific resource in 2016.

At ESnet, Ken Miller, George Robb, and Jason Zurawski supported these efforts as both full members of EPOC and ESnet staff. Both Jason and Ken divide their time between ESnet’s Science Engagement Team, while George is with ESnet’s Infrastructure Systems group. LightBytes looped up with Jason Zurawski to get his thoughts on the project and award, and an update on the Arecibo effort since our post in April 2021 on this project.

Now that data from Arecibo has been migrated to the Texas Advanced Computing Center (TACC), what happens now, and how will this data be used?

The team at the University of Central Florida has been engaged with TACC on several ways to build up the capabilities for their data analysis and sharing requirements. They are working to deploy a portal that will allow researchers access to the data, as well as build workflows to investigate and process using computation provided by TACC.

The team at Arecibo is also still going to process much older data that still resides on tape. Due to the delicate state of the media, it is carefully being read and transferred to on-island storage before being transmitted to TACC for archiving. This work will take several more months to complete.

What do you think the lessons from this effort are in terms of getting so many different organizations to work together to support this very challenging problem?

The collapse that Arecibo experienced sent ripples through the R&E community because researchers and technology professionals alike knew there was a limited window to act on replicating important observations gathered over the years. The partners in this effort were motivated to act, and that removed many barriers to putting some solutions in place. Everyone collaborated efficiently with their core competencies, and we continue to work together as the next steps for the scientific collaboration are planned.

Plans are starting to emerge for a “next generation” Arecibo based on the loss of this instrument, how might the next generation of data management resources be shaped by this collaboration?

Now that there has been some time to evaluate the work, it has also spurred UCF and Arecibo to plan for the future with respect to computation, storage, and network connectivity both in Puerto Rico and in Florida. With these improvements planned, they will be well-positioned to serve the scientific data for years to come. New instruments will no doubt increase the data demands by many orders of magnitude – addressing all aspects of the data pipeline now, and then gradually increasing the capabilities over time, will help to prepare for these emerging challenges.

Congratulations to all of the organizations and staff who helped prevent the loss of this data!

Making the Research and Educational Community SAFER: Adam Slagell on the creation of a new global collaboration to combat cyberthreats.

2021-11-102021-11-10 ~ ESnetComms

Adam Slagell is ESnet’s Chief Security Officer and a founding member of the newly formed Security Assistance For Education & Research (SAFER) trust group.

SAFER is an operational security entity focused on fighting computer misuse and defending the academic, research, and education (R&E) mission globally. SAFER brings together expertise and resources from organizations across the Research and Educational cybersecurity community, including CERN, DFN-CERT, ESET, ESnet, LBNL, STFC, and WLCG.

More information can be found here https://www.safer-trust.org/.

What motivates the creation of SAFER and what do you think success will look like for the community?

There are many cybersecurity trust groups out there, some even dedicated to R&E like REN-ISAC or XSEDE’s trust group consisting of current and former Teragrid and XSEDE site members. However, there really isn’t anything like this—both permanent and truly international— even though attacks are almost always transnational. So each time there is a new, major campaign, an international group connecting all these regional responders must be created again. What we are trying to do is create that permanent backbone with a core set of highly connected individuals who are a part of these regional and project-specific trust groups around the world.

If we are successful, we will see several things. First, I believe we will see more international cooperation and information sharing, leading to an earlier notice of new attack campaigns. Second, we will be able to activate a response more quickly, pulling in the expertise needed from a broad pool of SAFER members and their trusted colleagues. Finally, it is our hope that we can provide surge capabilities when a member is under attack. Many R&E organizations have limited resources and small teams. It is a tremendous asset if they can get help from their peers, maybe with unique expertise as they are facing a disruptive attack.

What kind of security resources will SAFER provide?

I alluded to some of the services when discussing what success will look like. But ultimately, our security resources will be determined by community needs. The founding members will serve as the steering committee for the first year until we elect the next steering committee.

One of our first-steps is setting up a Malware Information Sharing Platform (MISP) instance to share Indicators of Compromise, e.g., IP addresses, file hashes, domain names, etc. Usually, there is no requirement for members to share such data as the rules and regulations differ so much across organizations. But even on day one, we will have enough organizations that can contribute to making this service useful.

There is also a secure messaging and chat service using decentralized cryptography that all of our members can participate in. These ad hoc conversations about what people are seeing on their networks will hopefully help detect trends early.

Finally, many of the founding members have more resources from these large institutions, and I believe we can quickly help those projects and institutions that might struggle with an attack by providing our expertise while helping to train the next generation of security professionals.

What excites you most about this effort and what is the opportunity to do the most good?

I love the community-building aspect. In a past life, I created the Bro (now Zeek) Leadership Team and really worked hard to build a vibrant community around that software. I think this expertise is where I can be most helpful as I am less technical in my roles today.

I will also say, I am excited about getting young people involved, too. Organizations who contribute time from their teams will really benefit. There is no training for an incident response like jumping in, and I expect the variety of issues we will see will prove very useful just from a training and development perspective.

LBL has a long history supporting cybersecurity research, from the early days of Clifford Stoll and The Cuckoo’s Egg to the creation of Bro. What does the future of cybersecurity look like, and how will that shape the REN community?

Indeed, LBL’s security team is also a SAFER founding member. One of the things I love about working here and at ESnet is that our mission is outward-focused and when we help the community we raise all boats so to speak.

Fortune telling however is a dangerous game. We have anticipated some things, like cryptocurrency mining coming to HPCs. However, the threat landscape and tools available keep changing. That is part of what makes this job interesting. The important thing that I hope we keep in mind is that security is not done for its own sake, but to enable our mission of scientific research. To me, this means that we must always work to make risk-based security decisions, even when that might challenge pushes for compliance and simple one-size-fits-all solutions.

Next Generation ESnet6 Routers Installed and Accepted!

2021-11-022021-11-02 ~ ESnetComms

ESnet6 took a major step forward last week with the completed installation and acceptance of all 40 “greenfield” routers on the network backbone. These new routers will enable ESnet to operate at speeds up to 400 Gbps across our national fiber network, and provide the backbone infrastructure behind our next generation scientific data mobility capabilities.

*A new ESnet6 backbone router in its native habitat.*

The installation and acceptance process at each location across the continental US required careful coordination between subcontractors, colocation facility personnel, Lab site staff, and multiple teams across ESnet. Following local health regulations and access requirements, ESnet arranged physical access for the subcontractors at each location and all parties participated in a turn-up conference call as the routers were installed and brought online..

In addition to networking capabilities, the ESnet6 team implemented new software automation capabilities simplifying the installation and acceptance process. These capabilities included enhancements to the ESnet inventory system to support bulk planning data import, automatic bill of materials generation, automatic site survey generation, and automated generation of all backbone links within the network. In addition, the team introduced new workflow orchestration, automated provisioning, and inventory discovery capabilities to help with the installation process.

The acceptance of the ESnet6 greenfield routers is a major milestone for the ESnet6 Project and the team has already migrated a significant portion of customer traffic onto the new routers. Despite the extra challenges presented by the COVID-19 pandemic, the project has made steady progress and is on track to finish ahead of schedule.