Ezra Kissel

Eashan Adhikarla is pursuing a Ph.D. at Lehigh University and joined my group this summer to work on our “DTN as a Service” project. He contributed a lot of energy and novel insights into our work over the summer and I hope we have the opportunity to collaborate again in the near future. Here are some thoughts from Eashan on the summer student experience at ESnet.

This is my second internship at Berkeley Lab and my first at the Scientific Networking Division (SND). It has been full of excitement, thrills, challenges, and surprises, and it is a dream place to be.

This summer, I have been working on the intersection of machine learning and high-performance computing in data transfer nodes (DTNs). ESnet connects 40 DOE sites to 140 other networks and therefore has a high demand for data transfers ranging from megabytes to petabytes. The team is designing DTN-as-a-Service (DTNaaS), where the goal is to deploy and optimize the performance of the data movement across various sites. Managing the transmission control protocol (TCP) flows is a key factor in achieving good performance of transfers over a wide range of network infrastructure. My research helps automate DTN performance via machine learning – thus improving the overall DTNaaS framework.

At present, most DTN software is deployed on bare metal servers, limiting the flexibility for operational configuration changes and automation of transfer configurations. Manually inferring best tuning parameters for a dynamic network is a challenge. To optimize the throughput over TCP flow, we currently often use a pacing rate-function to control packet inter-arrival time. A part of my work proposes two different alternative approaches (supervised or sparse regression-based models) to better predict pacing rate, as well as automate change of related DTN settings based on the nature of the transfers.

Overall, my summer research involved getting experience with a wide set of networking areas of interest:

Improving the DTN-as-a-Service agent traffic control API with profiles and setting pacing
Creating a method for statistics retrieval for the harness toolkit for dynamic data visualization and analysis, and preparing these statistics to train the pacing model
Developing a pacing prediction approach that reduces much of the effort for manual pacing rate configuration.

I was also able to contribute to a separate team’s project on exploring the use of network congestion control algorithms for DTNs; the resulting paper will be submitted to an SC21 workshop.

For me, one of the best things at ESnet is that the summer interns get to work directly with quintessential research scientists and research engineers in the lab and learn a variety of skills to tackle the most challenging problems on a real-world scale. It’s a place from which I always come out as a better version of myself.

If you are interested in learning more about future summer opportunities with ESnet, please see this link (https://cs.lbl.gov/careers/summer-student-and-faculty-program/). We typically post notices and application information starting in January or February.

ESnet has recently completed an experiment testing high-performance, file-based data transfers using Data Transfer Nodes (DTNs) on the 100G ESnet Testbed. Within ESnet, new ways to provide optimized, on-demand data movement tools to our network users are being prototyped. One such potential new data movement tool is offered by Zettar, Inc. Zettar’s “zx” product integrates with several storage technologies with an API for automation. This ESnet data movement experiment allowed us to test the use of tools like zx on our network.

Two 100Gbps capable DTNs were deployed on the ESnet Testbed for this work, each with 8 x NVMe SSDs for fast disk-to-disk transfers, and connected using an approximately 90ms round trip time network path. As many readers are aware, this combination of fast storage and fast networking requires careful tuning from both a file I/O and network protocol standpoint to achieve expected end-to-end transfer rates, and this evaluation was no exception. With the help of a storage throughput baseline achieved using the freely available elbencho tool, a single tuning profile for zx was found that struck an impressive performance balance when moving a sweep of hyperscale data sets (>1TB total size or >1M total files or both, see figure below) between the testbed DTNs.

*A combined line chart with the measured storage throughput for each file size (****blue line****), together with both the Zettar zxtransfer data rates attained with a single run carried out by Zettar (****orange line****), and the average of five runs carried out by ESnet (****green line***)

To keep things interesting, the DTN software under evaluation was configured and launched within Docker containers to understand any performance and management impacts, and to establish a potential use case for more broadly deploying DTNs as-a-Service using containerization approaches. Spoiler: the testing was a great success! When configured appropriately, our evaluation has shown that modern container namespaces using performance-oriented Linux networking impart little to no impact on achievable storage and network performance at the 100Gbps scale while enabling a great deal of potential for distributed deployment of DTNs. More critically, the problem of service orchestration and automation becomes the next great challenge when considering any large-scale deployment of dynamic data movement endpoints.

Our takeaways:

When properly provisioned and configured, a containerized environment has a high potential to provide an optimized, on-demand data movement service.
Data movers such as zx demonstrate that when modern TCP is used efficiently to move data at scale and speed, network latency becomes less of a factor – the same level of data rates are attainable over LAN, Metro, and WAN as long as packet loss rates can be effectively kept low
Finally, creating a holistic data movement solution demands integrated consideration of storage, computing, networking, and highly concurrent and intrinsically scale-out data mover software that incorporates a proper understanding of the variety in data movement scenarios.

For more information, a project report detailing the testing environment, performance comparisons, and best practices may be found here.