Today, the world’s most powerful particle accelerator, the Large Hadron Collider (LHC), begins Run 3, a new period of data taking! This comes after more than three years of upgrade and maintenance work.
ESnet is looking forward to continuing to support the US LHC community for Run 3 through its connectivity to CERN. ESnet carries all traffic for LHC from Europe to the US, to the two Tier 1 centers at Fermilab and Brookhaven National Lab, and for all the physicists in Tier 2 centers in Universities. You can see the list of LHCONE collaborators at my.es.net/lhcone/list.
View the traffic on our trans-Atlantic links live at my.es.net.
The WINS program was developed in 2015 to combat the gender gap in the network engineering and high performance computing fields. WINS is a joint effort between the Department of Energy’s Energy Sciences Network (ESnet), the Keystone Initiative for Network Based Education and Research (KINBER), and the University Corporation for Atmospheric Research (UCAR) and works collaboratively with the SC program committee.
What the program entails
If selected for the program, you’ll be matched with a SCinet team and a high-profile mentor based on your interests and background. You’ll also get to dive in and work side-by-side with top engineers building SCinet.
Those selected for the program will also receive:
Travel funds for attending staging, setup, and live support of the SC conference as a SCinet volunteer.
Complimentary conference registration
Professional development support before, during, and after the conference
Who should apply
Early- and mid-career engineers and technologists who:
Want to work side-by-side with the world’s leading network, software, and systems engineers and top network technology vendors.
Identify as women at the time of application.
Are able to travel to Dallas, TX during the following dates (assuming COVID doesn’t interfere):
SCinet Staging: Oct. 20-28, 2022
SCinet Setup: Nov. 7-13, 2022
SCinet Live Operations/SC22 Conference and SCinet teardown: Nov 13-19, 2022
WINS is especially interested in applications from historically underrepresented groups in the Information Technology field, including Black or African-American, American Indian or Alaska Native, Hispanic or Latinx women.
Learn more and submit your application here. Applications are due by January 21, 2022, at 11:59 pm. If you want to participate in SCinet but don’t fit the above criteria, you can contact SCinet to learn more about other volunteer opportunities
As part of a team spanning 15 government, academic, and industrial partners, the Engagement and Performance Operations Center (EPOC) – a collaboration between Indiana University and ESnet – was awarded the “Best HPC Collaboration (Academia/Government/Industry)” HPCwire Readers’ Choice award on Tuesday, Nov. 16. The award, which was made at the High Performance Computing, Networking, Storage and Analysis (SC21) conference, recognizes the effort and collaboration required to move and safeguard irreplaceable data (over 50 years of astronomical observations) from the Arecibo observatory following the structural collapse of this scientific resource in 2016.
At ESnet, Ken Miller, George Robb, and Jason Zurawski supported these efforts as both full members of EPOC and ESnet staff. Both Jason and Ken divide their time between ESnet’s Science Engagement Team, while George is with ESnet’s Infrastructure Systems group. LightBytes looped up with Jason Zurawski to get his thoughts on the project and award, and an update on the Arecibo effort since our post in April 2021 on this project.
Now that data from Arecibo has been migrated to the Texas Advanced Computing Center (TACC), what happens now, and how will this data be used?
The team at the University of Central Florida has been engaged with TACC on several ways to build up the capabilities for their data analysis and sharing requirements. They are working to deploy a portal that will allow researchers access to the data, as well as build workflows to investigate and process using computation provided by TACC.
The team at Arecibo is also still going to process much older data that still resides on tape. Due to the delicate state of the media, it is carefully being read and transferred to on-island storage before being transmitted to TACC for archiving. This work will take several more months to complete.
What do you think the lessons from this effort are in terms of getting so many different organizations to work together to support this very challenging problem?
The collapse that Arecibo experienced sent ripples through the R&E community because researchers and technology professionals alike knew there was a limited window to act on replicating important observations gathered over the years. The partners in this effort were motivated to act, and that removed many barriers to putting some solutions in place. Everyone collaborated efficiently with their core competencies, and we continue to work together as the next steps for the scientific collaboration are planned.
Plans are starting to emerge for a “next generation” Arecibo based on the loss of this instrument, how might the next generation of data management resources be shaped by this collaboration?
Now that there has been some time to evaluate the work, it has also spurred UCF and Arecibo to plan for the future with respect to computation, storage, and network connectivity both in Puerto Rico and in Florida. With these improvements planned, they will be well-positioned to serve the scientific data for years to come. New instruments will no doubt increase the data demands by many orders of magnitude – addressing all aspects of the data pipeline now, and then gradually increasing the capabilities over time, will help to prepare for these emerging challenges.
Congratulations to all of the organizations and staff who helped prevent the loss of this data!
Ever want to know how big research data moves around the globe? ESnet plays a significant role in supporting the great scientific conversations, collaborations, and experiments underway, wherever and whenever they occur. We move Exabytes of data around the world creating a global laboratory that accelerates scientific discovery.
In order to meet these needs of scientists, we are constantly looking for opportunities to expand our capabilities with our next generation network ESnet6, intelligent edge analytics, advanced network testbeds, 5G wireless, quantum networking and more.
Across the physical sciences, new instruments and capabilities are continuing a relentless growth in data production and need for high speed networking and analysis resources.
ESnet stays on-top of these trends via the Network Requirements Review process, which for the past 15 years has been a remarkable and useful collaboration between the DOE Office of Advanced Supercomputing Research (ASCR), ESnet and science programs across the DOE Office of Science.
The latest Network Requirements Review for the Office of Science High Energy Physics program office (HEP) is now available — among many other findings, this review confirms that the exponential growth of scientific data generation will continue unabated as we proceed into what may well be a new golden age for high energy physics research. Some samples include:
⇾ The upcoming High Luminosity era for the Large Hadron Collider (beyond 2027, or Run-4) will require multi-Tbps network speeds to support globally dispersed “Tier 1” HPC resources. Scientists will use the LHC to uncover how the Higgs-Boson interacts and gives mass to other particles, and explore emerging evidence for particle behaviors not explained by current physics models. Each data-taking year, the experiments, ATLAS and CMS combined, are expected to accumulate roughly 1 EB of new data and it is estimated that complete data set sizes may routinely exceed 100 PB.
⇾ Scientists at the Deep Underground Neutrino Experiment (DUNE) in South Dakota and at Fermilab in Illinois, will use high speed data transfer to identify supernova events, as part of ongoing measurement of neutrino interactions. Supernovae measured by DUNE will generate over 200TB of compressed data per event, and Research and Educational Networks (REN) must be able to supply highly reliable, predictable data transfer capabilities to provide telescope targeting data to global arrays.
⇾ The Cosmic Microwave Background, Stage 4 (CMB-S4) experiment will require data management and transfer capabilities in some of the most demanding locations on earth. Operating two observational locations, and multiple telescopes with a combined total of 500,000 cryogenically-cooled superconducting detectors at the South Pole and in the Chilean Atacama Desert, CMB-S4 will provide an unprecedented picture back into the start of the Universe. Operating for seven years in these conditions, 22 TB (~8 TB at the South Pole and ~14 TB in Chile) of data will be generated daily, leading to an accrual of 3 PB annually, and as much as 100 TB over the full program lifecycle.
Network Requirements Reviews analyze the current, near, and long-term needs of the HEP community, providing a network and data-centric understanding of the scientific process used by the researchers and scientists. These requirements reviews drive ESnet’s investments in new services and capabilities, and enable ESnet to build strong partnerships with Office of Science (SC) programs, PIs, and user facilities. More information on this ESnet requirements review process can be found here.
We would like to thank the 13 HEP projects, and all of the HEP & DOE Office of Science collaborators who generously gave of their time, expertise, and most importantly, their enthusiasm for the future of high energy physics, as part of creating this report.
We want to especially thank the entire Science Engagement team plus Kate Robinson, and Dale Carder with our Network Engineering group who all provided outstanding support and technical expertise.
5.5 Questions with Eli Dart (ESnet), C.S. Chang, and Michael Churchill (PPPL)
In 2025, when the International Thermonuclear Experimental Reactor (ITER) generates “first plasma”, it will be the culmination of almost 40 years of effort. First started in 1985, the project has grown to include the scientific talents of seven members (China, EU, India, Japan, Korea, Russia, and the US, with EU membership bringing the total to 35 countries) and if successful, will mark the first time that a large scale fusion reactor generates more thermal power than is used to heat isotopes of hydrogen gas to a plasma state.
ESnet is supporting this international scientific community as this dream of limitless, clean energy is pursued. When operational at full capacity, ITER will generate approximately a petabyte-per-day of data, much of which will need to be analyzed and fed back in near real-time to optimize the fusion reaction and manage distribution of data to a federated framework of geographically distributed “remote control rooms” or RCR. To prepare for this demanding ability to distribute both data and analytics, recently ESnet’s Eli Dart and the Princeton Plasma Physics Laboratory’s (PPPL) Michael Churchill and C.S. Chang were co-authors on a test exercise performed with collaborators at Pacific Northwest National Laboratory (PNNL), PPPL, Oak Ridge National Laboratory (ORNL), and with the Korean KREONET, KSTAR, National Fusion Research Institute, and the Ulsan National Institute of Science and Technology. This study (https://doi.org/10.1080/15361055.2020.1851073) successfully demonstrated the use of ESnet and the ScienceDMZ architecture as part of trans-Pacific large data transfer, and near real-time movie creation and analysis of the KSTAR electron cyclotron emission images, via links between multiple paths at high sustained speeds.
Q 1: This was a complex test, involving several sites and analytic workflows. Can you walk our readers through the end-to-end workflow?
Eli Dart: The data were streamed from a system at KSTAR, encoded into ADIOS format, streamed to PPPL, rendered into movie frames, and visualized at PPPL. One of the key attributes of this workflow is that it is a streaming workflow. Specifically, this means that the data passes through the workflow steps (encoding in ADIOS format, transfer, rendering movie frames, showing the movie) without being written to non-volatile storage. This allows for performance improvements, because no time is spent on storage I/O. It also removes the restriction of storage allocations from the operation of the workflow – only the final data products need to be stored (if desired).
Q 2: A big portion of this research supports the idea of federated, near real-time analysis of data. In order to make these data transfers performant, flexible, and adaptable enough to meet the requirements for a future ITER RCR, you had to carefully engineer and coordinate with many parties. What was the hardest part of this experiment, and what lessons does it offer ITER?
Eli Dart: It is really important to ensure that the network path is clean. By “clean” I mean that the network needs to provide loss-free IP service for the experiment traffic. Because the fusion research community is globally distributed, the data transfers cover long distances, which greatly magnifies the negative impact of packet loss on transfer performance. Test and measurement (using perfSONAR) is very important to ensure that the network is clean, as is operational excellence to ensure that problems are fixed quickly if they arise. KREONET is an example of a well-run production network – their operational excellence contributed significantly to the success of this effort.
Q 3: One of the issues you had to work around was a firewall at one institution. What was involved in working with their site security, and how should those working with Science DMZ work through these issues?
Eli Dart: Building and operating a Science DMZ involves a combination of technical and organizational work. Different institutions have different policies, and the need for different levels of assurance depending on the nature of the work being done on the Science DMZ. The key is to understand that security policy is there for a reason, and to work with the parties involved in the context that makes sense from their perspective. Then, it’s just a matter of working together to find a workable solution that preserves safety from a cybersecurity perspective and also allows the science mission to succeed.
Q 4: How did you build this collaboration and how did you keep everyone on the same page, any advice you can offer other experiments facing the same need to coordinate multi-national efforts?
Eli Dart: From my perspective, this result demonstrates the value of multi-institution, multi-disciplinary collaborations for achieving important scientific outcomes. Modern science is complex, and we are increasingly in a place where only teams can bring all the necessary expertise to bear on a complex problem. The members of this team have worked together in smaller groups on a variety of projects over the years – those relationships were very valuable in achieving this result.
Q 5: In the paper you present a model for a federated remote framework workflow. Looking beyond ITER, are there other applications you can see for the lessons learned from this experiment?
C.S. Chang: Lessons learned from this experiment can be applied to many other distributed scientific, industrial, and commercial applications which require collaborative data analysis and decision making. We do not need to look too far. Expensive scientific studies on exascale computers will most likely be collaborative efforts among geographically distributed scientists who want to analyze the simulation data and share/combine the findings in near-real-time for speedy scientific discovery and for steering of ongoing or next simulations. The lessons learned here can influence the remote collaboration workflow used in high energy physics, climate science, space physics, and others.
Q 5.5: What’s next? You mention quite a number of possible follow on activities in the paper? Which of these most interest you, and what might follow?
Michael Churchill: Continued work by this group has led to the recently developed open-source Python framework, DELTA, for streaming data from experiments to remote compute centers, using ADIOS for streaming over wide-area networks, and on the receiver side using asynchronous Message Passing Interface to do parallel processing of the data streams. We’ve used this for streaming data from KSTAR to the NERSC Cori supercomputer and completing a spectral analysis in parallel in less than 10 minutes, which normally in serial would take 12 hours. Frameworks such as this, enabling connecting experiments to remote high-performance computers, will open up the quality and quantity of analysis workflows that experimental scientists can run. It’s exciting to see how this can help accelerate the progress of science around the world.
Congratulations on your success! This is a significant step forward in building the data management capability that ITER will need.
Two graduate students working with ESnet have published their papers recently in IEEE and ACM workshops.
Bibek Shrestha, a graduate student at the University of Nevada, Reno, and his advisor Engin Arslan worked with Richard Cziva from ESnet to publish a work on “INT Based Network-Aware Task Scheduling for Edge Computing”. In the paper, Bibek investigated the use of in-band network telemetry (INT) for real-time in-network task scheduling. Bibek’s experimental analysis using various workload types and network congestion scenarios revealed that enhancing task scheduling of edge computing with high-precision network telemetry can lead up to a 40% reduction in data transfer times and up to 30% reduction in total task execution times by favoring edge servers in uncongested (or mildly congested) sections of the network when scheduling tasks. The paper will appear in the 3rd Workshop on Parallel AI and Systems for the Edge (PAISE) co-conducted with IEEE IPDPS 2021 conference to be held on May 21st, 2021, in Portland, Oregon.
Zhang Liu, a former ESnet intern and a current graduate student at the University of Colorado at Boulder, worked with the ESnet High Touch Team – Chin Guok, Bruce Mah, Yatish Kumar, and Richard Cziva – on fastcapa-ng, ESnet’s telemetry processing software. In the paper “Programmable Per-Packet Network Telemetry: From Wire to Kafka at Scale,” Zhang showed the scaling and performance characteristics of fastcapa-ng, and highlighted the most critical performance considerations that allow the pushing of 10.4 million telemetry packets per second to Kafka with only 5 CPU cores, which is more than enough to handle 170 Gbit/s of original traffic with 1512B MTU. This paper will appear in the 4th International Workshop on Systems and Network Telemetry and Analytics (SNTA 2021) held at the ACM HPCD 2021 conference in Stockholm, Sweden between 21-25 June 2021.
Congratulations Bibek and Zhang!
If you are a networked systems research student looking to collaborate with us on network measurements, please reach out to Richard Cziva. If you are interested in a summer internship with ESnet, please visit this page.
The dramatic collapse of the Arecibo Observatory Radio Telescope in Puerto Rico in December was a terrible loss for global science. The 305-meter dish had served for over 50 years, supporting a wide range of cosmic and earth science applications, including transmission of the famed “Arecibo Message” to globular star cluster M13 by a team led by Frank Drake and Carl Sagan in 1974.
When the 900-ton instrument platform crashed onto the observatory dish, the National Science Foundation was faced with a variety of challenges. Most immediately, how to ensure that several petabytes of historic (and now irreplaceable) data at the Arecibo Observatory (AO) data center, in the form of tapes, hard drives, and other physical media, could be preserved and moved off-site as an approximately $50M site cleanup and environmental remediation project begins to demobilize the iconic observatory.
This data recovery effort has required rapid mobilization of a team from the University of Central Florida (UCF), the Texas Advanced Computing Center (TACC), the University of Puerto Rico (UPR), the University of Chicago, and others. A more detailed description of this overall effort has just been released here. In this blog, I will describe the key role that ESnet and the Engagement and Performance Operations Center (EPOC) played in this effort to save valuable scientific data.
My colleagues Hans Addleman (Indiana University International Networks), George Robb, and I became part of science use case discussions with AO and UCF as part of an ESnet requirements review and EPOC Deep Dive support to Arecibo in early 2020. In the summer of 2020, these efforts became much more active after the first suspension cable failed and AO began activities to migrate data storage and processing to a commercial cloud. We provided support to the Arecibo team for data movement hardware and software deployment.
With the failure of a second cable in November 2020, it became apparent that the facility had become unstable; this increased pressure on the team to find a faster solution. The UCF site management team decided that migration to the commercial cloud over the available 1Gbps connection (a previous 10Gbps connection was damaged by Hurricane Maria in 2017) would not meet requirements, so another data migration strategy was needed.
By December the team developed an alternative data migration approach leveraging a timely offer of storage capacity at TACC. Because of the urgency, the team decided to move data using physical Network Attached Storage (NAS) appliances; data on tapes and other original sources were loaded onto the NAS at Arecibo. The NAS were then driven to data centers on the island: either at the UPR campuses located at Mayagüez or Río Piedras, or at a commercial data center on the island, each of which were connected to the global R&E network at 10Gbps. Using Globus data transfer software, the AO team then began the process of transferring the data to TACC. Using multiple devices, and by setting up a constantly moving supply line, they were able to fill a disk, transport to the better connected locations, start a transfer, take back a completed disk, and return to the AO data center to start the process all over.
EPOC team members (specifically George) spent a lot of time working with AO and Globus technical support to tune the NAS appliances (which are usually used in commercial/enterprise settings) to be able to transfer the data at higher rates of performance than the factory settings allowed. EPOC, AO, UCF, TACC, and UPR staff also ran perfSONAR tests to ensure the entire path was able to deliver on these faster speeds that were necessary. George will be presenting a talk at Globus World in May, and those interested in more information about how this networking and disk NAS tuning was done should plan to attend.
The data transfer operation started in late December 2020 and is expected to continue through the spring of 2021, as stored data (on disks and tapes at AO) is transferred to TACC. As data flows into TACC’s storage cluster from Arecibo’s holdings, ESnet and the entire collaboration team will ensure that it is made widely available to the scientific community to perform new studies with this valuable research data.
The destruction of the AO Radio Telescope was a catastrophe for global science; however, the quick response of the entire data recovery team helped prevent the loss of much of the valuable data collected by Arecibo over its lifetime. I’m very proud of this accomplishment: the work of the entire ESnet team and our data infrastructure ensured that we had the right capabilities at the right time to make a difference for science.
As a Network Engineer at ESnet, I am no stranger to the importance of designing and maintaining a robust fiber-optic network. To operate a network that will “enable and accelerate scientific discovery by delivering unparalleled network infrastructure, capabilities, and tools,” ESnet has acquired an impressive US continental footprint of more than 21,000 kilometers of leased fiber-optic cable. We spend a great deal of effort designing and sourcing redundant fiber-optic paths to support network data connectivity between scores of DOE Office of Science facilities and research collaborators across the country.
But network data transfer is only one of the uses for fiber-optic cable. What about using buried fiber-optic cable for some truly “ground-shaking” science? The answer is “Yes, absolutely!” – and I was fortunate to play a part in exploring new uses for fiber-optic cable networks this past year.
Back in 2017, the majority of our 21,000 km fiber footprint was still considered “dark fiber,” meaning it was not yet in use. At that time, ESnet was actively working on the design to upgrade from our current production network “ESnet5” to our next-generation network “ESnet6,” but we hadn’t yet put our fiber into production.
At the same time, Dr. Jonathan Ajo-Franklin, then graduate students Nate Lindsey and Shan Dou, and the Berkeley Lab’s Earth and Environmental Science Area (EESA) were exploring the use of distributed acoustic sensing (DAS) technology to detect seismic waves by using laser pulses across buried fiber optic cable. The timing was perfect to try and expand on the short-range tests that Dr. Ajo-Franklin and his team had been performing at the University of California’s Richmond Field Station by using a section of the unused ESnet dark fiber footprint in the West Sacramento area for more extensive testing. ESnet’s own Chris Tracy worked with Dr. Ajo-Franklin and team to demonstrate how the underground fiber-optic cables running from West Sacramento northwest toward Woodland in California’s Central Valley made an excellent sensor platform for early earthquake detection, monitoring groundwater, and mapping new sources of potential geothermal energy.
Fast forward to May 2019, and Dr. Ajo-Franklin was heading up a new collaborative scientific research project for the DOE’s Geothermal Technology Office based on his prior DAS experimentation successes using ESnet fiber. The intent was to map potential geothermal energy locations in the California Imperial Valley south of the Salton Sea, near Calipatria and El Centro. The team, including scientists in EESA, Lawrence Livermore National Laboratory (LLNL), and Rice University needed a fiber path to conduct the experiment. It would make sense to assume that ESnet’s fiber footprint, which runs through that area, would be an excellent candidate for this experiment. Fortunately for ESnet’s other users, but unfortunately for the DAS team, by 2018 the ESnet6 team was already “lighting” this previously dark fiber.
However, just because ESnet fiber in the Imperial Valley was no longer a candidate for DAS-based experiments, that didn’t mean there weren’t ways to gain access to unused dark fiber. For every piece of fiber that has been put into production to support ESnet6, there are dozens if not hundreds of other fibers running right alongside it. When fiber-optic providers install new fiber paths, they pull large cables consisting of many individual fibers to lease or sell to as many customers as possible. Because the ESnet fiber footprint was running right through the Imperial Valley, we knew that there was likely unused fiber in the ground, and only had to find a provider that would be willing to lease a small section to Berkeley Lab for Dr. Ajo-Franklin’s experiment.
Making the search a little more complicated, the DAS equipment utilized for this experiment has an effective sensing range that is limited to less than 30 kilometers. Most fiber providers expect to lease long sections of fiber connecting metropolitan areas. For example, the fiber circuits that run through the Imperial Valley are actually intended to connect metropolitan areas of Arizona to large cities in Southern California. Finding a provider that would be willing to break up a continuous 600 km circuit connecting Phoenix to Los Angeles just to sell a 30 km piece for a year-long research project would be a difficult task.
One of my contributions to the ESnet6 project was sourcing new dark fiber circuits and data center colocation spaces to “fill out” our existing footprint and get ready for our optical system deployments. Because of those efforts, I knew that there were often entire sections of fiber that had been damaged across the country and would likely not be repaired until there was a new customer that wanted to lease the fiber. I was asked to assist Dr. Ajo-Franklin and his team to engineer a new fiber solution for the experiment. I just had to find someone willing to lease us one of these small damaged sections.
After speaking with many providers in the area, the communications company Zayo was able to find a section of fiber starting in Calipatria, heading south through El Centro and then west to Plaster City, that was a great candidate for DAS use. This section of fiber had been accidentally cut near Plaster City and was considered unusable for networking purposes. Working with Zayo, we were able to negotiate a lease on this “broken” fiber span along with a small amount of rack space and power to house the DAS equipment that Dr. Ajo-Franklin’s team would need to move forward with their research.
This cut fiber segment was successfully “turned up” for the project on November 10, 2020 by a team including Co-PI Veronica Rodriguez Tribaldos, Michelle Robertson, and Todd Wood (EESA/LBNL), and seismic data collection equipment is now up and running. The figure above (D) shows some great initial data recorded on the array, a small earthquake many miles to the north. There will be many more articles and reports from the Imperial Valley Dark Fiber Team as they continue to gather data and perform their experiments, and I’m sure we’ll begin to see fiber across the country put to use for this type of sensing and research.
I’ve had a great experience working with the different groups that were assembled for this project. By seeing how new technologies and methods are being developed to use fiber-optic cable for important research outside of computing science, I’ve developed a greater appreciation for how our labs and universities are tackling some of our biggest energy and public safety challenges.
“The original Science DMZ model provided a way of securing high-throughput data transfer applications without the use of enterprise firewalls,” says Dart. “You can protect data transfers using technical controls that don’t impose performance limitations.”