The authors found that there is significant room for improvement in the data transfer capabilities currently in place for CMIP5, both in terms of workflow mechanics and in data transfer performance. In particular, the paper notes that performance improvements of at least an order of magnitude are within technical reach using current best practices.
ESnet’s Eli Dart (left), Salman Habib (center) of Argonne National Lab and Joel Brownstein of the University of Utah compare ideas during a workshop break.
ESnet and Internet2 hosted last week’s CrossConnects Workshop on “Improving Data Mobility & Management for International Cosmology,” a two-day meeting ESnet Director Greg Bell described as the best one yet in the series. More than 50 members of the cosmology and networking research community turned out for the event hosted at Lawrence Berkeley National Laboratory, while another 75 caught the live stream from the workshop.
The Feb. 10-11 workshop provided a forum for discussing the growing data challenges associated with the ever-larger cosmological and observational data sets, which are already reaching the petabyte scale. Speakers noted that network bandwidth is no longer the bottleneck into the major data centers, but storage capacity and performance from the network to storage remain a challenge. In addition, network connectivity to telescope facilities is often limited and expensive due to the remote location of the facilities. Science collaborations use a variety of techniques to manage these issues, but improved connectivity to telescope sites would have a significant scientific benefit in many cases.
In his opening keynote talk, Peter Nugent of Berkeley Lab’s Computational Research Division said that astrophysics is transforming from a data-starved to a data-swamped discipline. Today, when searching for supernovae, one object in the database consists of thousands of images, each 32 MB in size. That data needs to be processed and studied quickly so when an object of interest is found, telescopes around the world can begin tracking it in less than 24 hours, which is critical as the supernovae are at their most visible for just a few weeks. Specialized pipelines have been developed to handle this flow of images to and from NERSC.
Salman Habib of Argonne National Laboratory’s High Energy Physics and the Mathematics and Computer Science Divisions opened the second day of the workshop, focused on cosmology simulations and workflows. Habib leads DOE’s Computation-Driven Discovery for the Dark Universe project. Habib pointed out that large-scale simulations are critical for understanding observational data and that the size and scale of simulation datasets far exceed those of observational data. “To be able to observe accurately, we need to create accurate simulations,” he said. Simulations will soon create 100 petabyte sets of raw data, and the limiting factor for handling these will be the amount of available storage, so smaller “snapshots” of the datasets will need to be created. And while one person can run the simulation itself, analyzing the resulting data will involve the whole community.
Reijo Keskitalo of Berkeley Lab’s Computational Cosmology Center described how computational support for the Planck Telescope has relied on HPC to generate the largest and most complete simulation maps of the cosmic microwave background, or CMB. In 2006, the project was the first to run on all 6,000 CPUs of Seaborg, NERSC’s IBM flagship at the time. It took six hours on the machine to produce one map. Now, running on 32,000 CPUs on Edison, the project can generate 10,000 maps in just one hour.
Mike Norman, head of the San Diego Supercomputer Center, offered that high performance computing can become distorted by “chasing the almighty FLOP,” or floating point operations per second. “We need to focus on science outcomes, not TOP500 scores.”
Over the course of the workshop, ESnet Director Greg Bell noted that observation and simulation are no longer separate scientific endeavors.
The workshop drew a stellar group of participants. In addition to the leading lights mentioned above, attendees included Larry Smarr, founder of NCSA and current leader of the California Institute for Telecommunications and Information Technology, a $400 million academic research institution jointly run by the University of California, San Diego and UC Irvine; and Ian Foster, who leads the Computation Institute at the University of Chicago and is a senior scientist at Argonne National Lab. Foster is also recognized as one of the inventors of grid computing.
The next step for the workshop organizers is to publish a report and identify areas for further study and collaboration. Looming over them will be the thoughts of Steven T. Myers of the National Radio Astronomy Observatory after describing the data challenges coming with the Square Kilometer Array radio telescope: “The future is now. And the data is scary. Be afraid. But resistance is futile.”
ESnet and its collaborators successfully completed three days of demonstrating its End-to-End Circuit Service at Layer 2 (ECSEL) software at the Open Networking Summit held at Stanford a couple of weeks ago. Our goal is to build “zero-configuration circuits” to help science applications seamlessly use networks for optimized end-to-end data transport. ECSEL, developed in collaboration with NEC, Indiana University, and the University of Delaware builds on some exciting new conceptual thinking in networking.
Wrangling Big Data
To put ECSEL in context, the proliferating tide of scientific data flows – anticipated at 2 petabytes per second as planned large-scale experiments get in motion – is already challenging networks to be exponentially more efficient. Wide area networks have vastly increased bandwidth and enable flexible, distributed, scientific workflows that involve connecting multiple scientific labs to a supercomputing site, a university campus, or even a cloud data center.
The increasing adoption of distributed, service-oriented computing means that resource and vendor independence for service delivery is a key priority for users. Users expect seamless end-to-end performance and want the ability to send data flows on demand, no matter how many domains and service providers are involved. The hitch is that even though the Wide Area Network (WAN) can have turbocharged bandwidth, at these exponentially increasing rates of network traffic even a small blockage in the network can seriously impair the flow of data, trapping users in a situation resembling commute conditions on sluggish California freeways. These scientific data transport challenges that we and other R&E networks face are just a taste of what the commercial world will encounter with the increasing popularity of cloud computing and service-driven cloud storage.
Abstracting a solution
One of the key feedback from application developers, scientists and end-users is that they do not want to deal with the complexity at the infrastructure level while still accomplishing their mission. At ESnet, we are exploring various ways to make networks work better for users. A couple of concepts could be game-changers, according to Open Network Summit conference presenter and Berkeley professor Scott Shenker: 1) using abstraction to manage network complexity, and 2) extracting and exposing simplicity out of the network. Shenker himself cites Barbara Liskov’s Turing Lecture as inspiration.
ECSEL is leveraging OSCARS and OpenFlow within the Software Defined Networking (SDN) paradigm to elegantly prevent end-to-end network traffic jams. OpenFlow is an open standard to allow application-driven manipulation of network flows. ECSEL is using OSCARS-controlled MPLS virtual circuits with OpenFlow to dynamically stitch together a seamless data plane delivering services over multi-domain constructs. ECSEL also provides an additional level of simplicity to the application, as it can discover host-network interconnection points as necessary, removing the requirement of applications being “statically configured” with their network end-point connections. It also enables stitching of the paths end-to-end, while allowing each administrative entity to set and enforce its own policies. ECSEL can be easily enhanced to enable users to verify end-to-end performance, and dynamically select application-specific protocol forwarding rules in each domain.
The OpenFlow capabilities, whether it be in an enterprise/campus or within the data center, were demonstrated with the help of NEC’s ProgrammableFlow Switch (PFS) and ProgrammableFlow Controller (PFC). We leveraged a special interface developed by them to program a virtual path from ingress to egress of the OpenFlow domain. ECSEL accessed this special interface programmatically when executing the end-to-end path stitching workflow.
Our anticipated next step is to develop ECSEL as an end-to-end service by making it an integral part of a scientific workflow. The ECSEL software will essentially act as an abstraction layer, where the host (or virtual machine) doesn’t need to know how it is connected to the network–the software layer does all the work for it, mapping out the optimum topologies to direct data flow and make the magic happen. To implement this, ECSEL is leveraging the modular architecture and code of the new release of OSCARS 0.6. Developing this demonstration yielded sufficient proof that well-architected and modular software with simple APIs, like OSCARS 0.6, can speed up the development of new network services, which in turn validates the value-proposition of SDN. But we are not the only ones who think that ECSEL virtual circuits show promise as a platform for spurring further innovation. Vendors such as Brocade and Juniper, as well as other network providers attending the demo were enthusiastic about the potential of ECSEL.
But we are just getting started. We will reprise the ECSEL demo at SC11 in Seattle, this time with a GridFTP application using Remote Direct Memory Access (RDMA) which has been modified to include the XSP (eXtensible Session Protocol) that acts as a signaling mechanism enabling the application to become “network aware.” XSP, conceived and developed by Martin Swany and Ezra Kissel of Indiana University and University of Delaware, can directly interact with advanced network services like OSCARS – making the creation of virtual circuits transparent to the end user. In addition, once the application is network aware, it can then make more efficient use of scalable transport mechanisms like RDMA for very large data transfers over high capacity connections.
We look forward to seeing you there and exchanging ideas. Until Seattle, any questions or proposals on working together on this or other solutions to the “Big Data Problem,” don’t hesitate to contact me.
Eric Pouyoul, Vertika Singh (summer intern), Brian Tierney: ESnet