Data intensive scientific workflows are at a pivotal time in which traditional local computing resources are no longer capable of meeting the storage or computing demands of scientists. In the Earth Sciences, we are facing an explosion of data volumes sourced from models, in-situ observations, and remote sensing platforms. Some agencies are starting to move data to commercial Cloud providers to facilitate access (e.g. NASA on Amazon Web Services). Fully leveraging these opportunities will require new approaches in the way the scientific community handles data access, processing and analysis. In particular, we need to stop downloading data and start uploading algorithms to wherever large archives reside. This session is targeted at researchers who pioneering such “data-proximate” computing on commercial Cloud infrastructure. We hope to hear current success stories, as well as failures, and identify ways to improve existing workflows.
Agenda- 3:30 - 3:35 Scott Henderson (eScience Institute) Introduction to the session - slides: http://bit.ly/2YhbWnr
- 3:35 - 3:55 Aimee Barciauskas (Development Seed): The Multi-Mission Algorithm and Analysis Platform (MAAP)
Slides: https://doi.org/10.6084/m9.figshare.8942108 - 3:55 - 4:15 Aji John (University of Washington) - Analyzing satellite imagery on the Cloud to understand wildflower phenology at Mt Rainier
- 4:15 - 4:35 Julien Chastang (UCAR/unidata) - Deploying a Unidata JupyterHub on the NSF Jetstream Cloud, Lessons Learned and Challenges Going Forward
Slides: https://doi.org/10.6084/m9.figshare.8944964 - 4:35 - 4:55 Rich Signell (USGS): Using the Pangeo ecosystem for model analysis and visualization
Slides: https://doi.org/10.6084/m9.figshare.9115229 - 4:55 - 5:00 Wrapup discussion
Session recording is here.Session Take-Aways
- A current challenge for cloud-based workflows is that datasets from different agencies are in different formats, different regions, and often have similar but slightly different access apis
- Platforms such as MAAP and Pangeo are very promising and exciting. They enable the benefits of scalable computing on datasets stored on the cloud.
- The cost model for scalable cloud computing is unclear. How to support platforms into the future and regulate user access to cluster resources.