NSDF Distinguished Speaker Series

Date 12:30 PM ET May 26 2022

Title A Global Research Data Platform: How Globus Services Enable Scientific Discovery

Speaker Ian Foster, University of Chicago and Argonne National Laboratory

Seminar Recording

Abstract: The Globus team has spent more than a decade developing software-as-a-service methods for research data management, available at globus.org. Globus transfer, sharing, search, publication, identity and access management (IAM), automation, and other services enable reliable, secure, and efficient managed access to exabytes of scientific data on tens of thousands of storage systems. For developers, flexible and open platform APIs reduce greatly the cost of developing and operating customized data distribution, sharing, and analysis applications. With 200,000 registered users at more than 2,000 institutions, more than 1.5 exabytes and 100 billion files handled, and 100s of registered applications and services, the services that comprise the Globus platform have become essential infrastructure for many researchers, projects, and institutions. I describe the design of the Globus platform, present illustrative applications, and discuss lessons learned for cyberinfrastructure software architecture, dissemination, and sustainability.

Bio: Dr. Ian Foster is Senior Scientist and Distinguished Fellow, and also director of the Data Science and Learning Division, at Argonne National Laboratory, and the Arthur Holly Compton Distinguished Service Professor of Computer Science at the University of Chicago. Ian received a BSc degree from the University of Canterbury, New Zealand, and a PhD from Imperial College, United Kingdom, both in computer science. His research deals with distributed, parallel, and data-intensive computing technologies, and innovative applications of those technologies to scientific problems in such domains as materials science, climate change, and biomedicine. Foster is a fellow of the AAAS, ACM, BCS, and IEEE, and an Office of Science Distinguished Scientists Fellow.

Date 12:30 pm ET April 28 2022

Title Pangeo Forge - Crowdsourcing Analysis Ready Data in the Cloud

Speaker Ryan Abernathey, Columbia University, Department of Earth and Environmental Science

Seminar Recording

Abstract: Analysis-ready, cloud optimized (ARCO) scientific data is essential for scalable big data analytics in the cloud. ARCO can massively accelerate statistical analysis, visualization, and machine learning workflows on large-scale scientific datasets. However, most scientific data is distributed in archival formats that are not optimized for large-scale analysis.

Pangeo Forge (https://pangeo-forge.org/) is an open source framework for data Extraction, Transformation, and Loading (ETL) of scientific data. The goal of Pangeo Forge is to make it easy to extract data from traditional data archives and deposit it in cloud object storage in ARCO format.

Pangeo Forge is made of two main components:

  • Pangeo Forge Recipes: an open source Python package, which allows you to create and run ETL pipelines (“recipes”) and run them on your own computer.
  • Pangeo Forge Cloud: a cloud-based automation framework which executes these recipes in the cloud from code stored in GitHub and deposits the data into cloud object storage.

By storing data recipes in version-controlled GitHub repositories, we can maintain perfect provenance information from archival repository to ARCO copy. Using Pangeo Forge, we are collaboratively populating a petabyte-scale library of open ARCO climate data distributed across multiple cloud storage services, including Open Storage Network.

Pangeo Forge is inspired directly by Conda Forge, a community-led collection of recipes for building conda packages. We hope that Pangeo Forge can eventually play the same role for datasets, encouraging open, interdisciplinary collaboration around data curation.

Bio:: Ryan is a computational physical oceanographer who leads the Ocean Transport Group, whose mission is to advance scientific understanding of how stuff moves around the ocean and how this transport influences Earth’s large-scale climate and ecosystems. This research involves working with satellite data, numerical simulations, and observational datasets. Ryan is an enthusiastic advocate for open source scientific software and is an active contributor the Pangeo Project, a community platform for Big Data geoscience.

This material is based upon work supported by the National Science Foundation under Grant No. 2138811.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Copyright © 2021 National Science Data Fabric