• “Detecting radio-astronomical "Fast Radio Transient Events" via an OODT-based metadata processing pipeline”, Chris Mattmann, Andrew Hart , Luca Cinquini, David Thompson, Kiri Wagstaff, Shakeh Khudikyan. ApacheCon NA 2013, Februrary 2013
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
1) The document discusses the history and future of genomic sequencing technologies, from early discoveries in 1911 to current and future sequencing capabilities.
2) Next-generation sequencing technologies can now sequence 6 terabases of DNA per day at a lower cost, enabling new applications in medical diagnostics and treatment.
3) As sequencing costs continue to decline according to "even Moore's law", the effort will shift from sequencing itself to downstream analytics and clinical applications involving large genomic datasets.
- The document discusses the Virtual Observatory (VO), which aims to make astronomical data and services interoperable.
- It describes various VO standards for data discovery, access, and sharing including web services, data models, and ontologies.
- It notes that upcoming large surveys will generate huge datasets and require distributed computing architectures like grids and clouds to handle storage and processing.
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...Mario Juric
The document discusses the Large Synoptic Survey Telescope (LSST) data management system. It describes how LSST will image the entire visible sky every few nights over 10 years, generating 5 petabytes of data per year. It outlines the LSST data system, which will process and archive the data, producing catalogs and other data products that will be accessible to scientists. The ultimate goal is to transform the sky into a fully searchable database for astronomical research.
The science driving genomic analyses is rapidly changing, but the operational problems of processing data from DNA sequencers quickly and reliably are not new.
I present an analysis of the parallels in the fundamental limiting components of the '90s internet boom and the DNA sequencing boom that is currently underway, and illustrate how Hadoop, a proven application architecture used widely in BigData and commercial internet applications can be reused in the genomics sector.
This document discusses using cloud computing and virtualization for scientific research. Some key points:
- Scientists can access remote sensors, share data and workflows, and store personal data in the cloud. Beginners can click to code, while experts can build complex workflows.
- Services allow publishing, finding, and binding to distributed resources through registries. Data can be queried through standards like Simple Image Access Protocol.
- Distributed registries from various organizations harvest metadata to enable semantic search across sky regions, identifiers, tags, vocabularies, schemas, and service descriptions.
- Tools provide code/presentation environments and access to distributed data in the cloud. Services include astronomical cross-matching and event notification through Sky
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Laurent Lefort
Presentation of the SSN XG results at eResearch Australia 2011 https://eresearchau.files.wordpress.com/2012/06/74-semantically-enabling-the-web-of-things-the-w3c-semantic-sensor-network-ontology.pdf
How e-Science tools are needed for the new data intensive science, specifically targeted to the Square Kilometre Array. Talk given at the Special Symposium 15 on Data Intensive Astronomy, held during the General Assembly Meeting of the International Astronomical Union in Bejing, 2012.
The Square Kilometre Array is currently undergoing the Preliminary Design Reviews for its composing elements, and is thus at a critical point on its way to becoming ready for construction starting in 2018. In this talk we will provide an overview of the SKA, its composing elements, and their status, with emphasis on the Telescope Manager and the Science Data Processor, respectively the Monitoring & Control system, and Pipeline. We will see how do they compare with their ALMA equivalents, and how is the SKA similar/different from ALMA.
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
1) The document discusses the history and future of genomic sequencing technologies, from early discoveries in 1911 to current and future sequencing capabilities.
2) Next-generation sequencing technologies can now sequence 6 terabases of DNA per day at a lower cost, enabling new applications in medical diagnostics and treatment.
3) As sequencing costs continue to decline according to "even Moore's law", the effort will shift from sequencing itself to downstream analytics and clinical applications involving large genomic datasets.
- The document discusses the Virtual Observatory (VO), which aims to make astronomical data and services interoperable.
- It describes various VO standards for data discovery, access, and sharing including web services, data models, and ontologies.
- It notes that upcoming large surveys will generate huge datasets and require distributed computing architectures like grids and clouds to handle storage and processing.
ADASS XXV: LSST DM - Building the Data System for the Era of Petascale Optica...Mario Juric
The document discusses the Large Synoptic Survey Telescope (LSST) data management system. It describes how LSST will image the entire visible sky every few nights over 10 years, generating 5 petabytes of data per year. It outlines the LSST data system, which will process and archive the data, producing catalogs and other data products that will be accessible to scientists. The ultimate goal is to transform the sky into a fully searchable database for astronomical research.
The science driving genomic analyses is rapidly changing, but the operational problems of processing data from DNA sequencers quickly and reliably are not new.
I present an analysis of the parallels in the fundamental limiting components of the '90s internet boom and the DNA sequencing boom that is currently underway, and illustrate how Hadoop, a proven application architecture used widely in BigData and commercial internet applications can be reused in the genomics sector.
This document discusses using cloud computing and virtualization for scientific research. Some key points:
- Scientists can access remote sensors, share data and workflows, and store personal data in the cloud. Beginners can click to code, while experts can build complex workflows.
- Services allow publishing, finding, and binding to distributed resources through registries. Data can be queried through standards like Simple Image Access Protocol.
- Distributed registries from various organizations harvest metadata to enable semantic search across sky regions, identifiers, tags, vocabularies, schemas, and service descriptions.
- Tools provide code/presentation environments and access to distributed data in the cloud. Services include astronomical cross-matching and event notification through Sky
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Laurent Lefort
Presentation of the SSN XG results at eResearch Australia 2011 https://eresearchau.files.wordpress.com/2012/06/74-semantically-enabling-the-web-of-things-the-w3c-semantic-sensor-network-ontology.pdf
How e-Science tools are needed for the new data intensive science, specifically targeted to the Square Kilometre Array. Talk given at the Special Symposium 15 on Data Intensive Astronomy, held during the General Assembly Meeting of the International Astronomical Union in Bejing, 2012.
The Square Kilometre Array is currently undergoing the Preliminary Design Reviews for its composing elements, and is thus at a critical point on its way to becoming ready for construction starting in 2018. In this talk we will provide an overview of the SKA, its composing elements, and their status, with emphasis on the Telescope Manager and the Science Data Processor, respectively the Monitoring & Control system, and Pipeline. We will see how do they compare with their ALMA equivalents, and how is the SKA similar/different from ALMA.
The document describes the iMarine data platform, which provides a hybrid data infrastructure combining over 500 software components into a centralized system. It addresses a variety of user needs such as hosting applications, maintaining databases, data analysis and delivery. The platform offers various services including storage, computing, data management and analysis tools to support tasks like biological data curation and spatial data processing. It utilizes bundles of grouped services and technologies to enable applications and solutions for collaborative work.
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
This document discusses the development of data infrastructure for the Square Kilometre Array (SKA) radio telescope project. It provides timelines for SKA and its precursors, highlighting the exponential growth of data that will be produced. It outlines challenges of managing large data volumes and empowering users. It describes some existing and planned facilities for SKA regional science and data centers, including collaborations between South African and international institutions. These centers will be important for processing and analyzing data from pre-SKA telescopes like MeerKAT and distributing data to global research teams.
The document discusses how the Earth System Grid Federation (ESGF) leverages tools from Apache Solr and Apache Object Oriented Data Technology (OODT) to manage and distribute large amounts of climate science data. ESGF is an international collaboration that uses a distributed network of nodes running various software components to provide access to over 2.5 petabytes of climate model output and observational data. This infrastructure supports the research of the Intergovernmental Panel on Climate Change and projects like CMIP5, the largest coordinated climate modeling effort to date.
The story of how Globus helped the Arecibo Observatory save 50+ years of data for posterity and future research. Presented at the GlobusWorld 2021 conference by Julio Alvarado Negron.
GlobusWorld 2021: Arecibo Observatory Data MovementGlobus
The story of how Globus helped move petabytes of data from the Arecibo Observatory to TACC, and thereby save 50+ years of data for posterity and future research. Presented at the GlobusWorld 2021 conference by George Robb III.
This document provides an overview of the NPOESS Program. NPOESS is a tri-agency program between NOAA, NASA, and the Department of Defense to develop the next generation of US polar-orbiting environmental satellites. The goal is to converge the DoD and NOAA satellite programs to achieve cost savings while incorporating new technologies. NPOESS will provide global environmental data for weather forecasting, climate monitoring, and other applications. The first NPOESS satellite is scheduled for launch in 2013 and the system is expected to operate through 2026. The NPP satellite launching in 2011 will help reduce risks for NPOESS.
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
Talk at Mount Sinai School of Medicine. Introduction to the Hadoop ecosystem, problems in bioinformatics data analytics, and a specific use case of building a genome variant store backed by Cloudera Impala.
From the Hadoop Summit 2015 Session with Ted Dunning:
Just when we thought the last mile problem was solved, the Internet of Things is turning the last mile problem of the consumer internet into the first mile problem of the industrial internet. This inversion impacts every aspect of the design of networked applications. I will show how to use existing Hadoop ecosystem tools, such as Spark, Drill and others, to deal successfully with this inversion. I will present real examples of how data from things leads to real business benefits and describe real techniques for how these examples work.
This document discusses NASA's big data challenges in climate science. It notes that by 2020, climate data holdings from simulations, observations, and reanalyses are projected to grow to hundreds of exabytes worldwide. It describes NASA's efforts to build "data analytics platforms" like NEX and Obs4MIPs using ESGF to enable analysis of large amounts of observational and modeling data without needing to download entire datasets locally. The challenges of remote data visualization, distributed data analysis, and data management for big climate data are also discussed.
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayChris Mattmann
Keynote presentation at the HPC User Forum 2012 in Darborn, MI, September 19, 2012. http://www.hpcuserforum.com/registration/dearborn2012/dearbornagenda.pdf
Accelerating Science with Cloud Technologies in the ABoVE Science CloudGlobus
This document summarizes the use of the ABoVE Science Cloud (ASC) to support research for the Arctic-Boreal Vulnerability Experiment (ABoVE). The ASC provides researchers with large datasets, computing resources, and tools to process and analyze remote sensing and model data related to Alaska and northern Canada. Several examples are given of projects using the ASC, including analyzing satellite imagery to map forest structure, tracking surface water changes over time, characterizing fire history, and modeling future forest composition under climate change. The ASC aims to facilitate collaboration by allowing scientists to access common datasets and run computationally-intensive processes in the cloud without having to directly transfer large amounts of data.
This document discusses using web services and scientific workflows to analyze 3D astronomical data from archives. It describes the AMIGA project which studies isolated galaxies using data from multiple telescopes. Analyzing the large 3D data cubes requires intensive computation. The document proposes using distributed computing resources like grids and clouds along with a "cloud of services" to perform analysis without transferring entire datasets. Scientific workflows could automate and reproduce analyses by chaining web services from different astronomical archives. The Wf4Ever project aims to preserve digital experiments by curating all components of the research lifecycle.
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
The document discusses NASA's use of Apache Spark for big data analytics. It provides context on Chris Mattmann's involvement with Spark through his roles at NASA JPL and the Apache Software Foundation. It outlines some of NASA's big data challenges around handling large volumes of Earth observation data from instruments and simulations. NASA is interested in using Spark for tasks like data triage, archiving, and knowledge extraction to help address these challenges and enable new scientific insights.
Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks
The next decade promises to be exciting for both astronomy and computer science with a number of large-scale astronomical surveys in preparation. One of the most important ones is Large Scale Survey Telescope, or LSST. LSST will produce the first ‘video’ of the deep sky in history by continually scanning the visible sky and taking one 3.2 giga-pixel image every 20 seconds. In this talk we will describe LSST’s unique design and how its image processing pipeline produces catalogs of astronomical objects. To process and quickly cross-match catalog data we built AXS (Astronomy Extensions for Spark), a system based on Apache Spark. We will explain its design and what is behind its great cross-matching performance.
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...Larry Smarr
Invited Presentation
Symposium on Computational Biology and Bioinformatics:
Remembering John Wooley
National Institutes of Health
Bethesda, MD
July 29, 2016
The document provides information about an ITIC committee briefing at the Marshall Space Flight Center on November 29, 2012. It discusses the membership and activities of the ITIC committee. It also summarizes presentations and topics discussed at the briefing, including SPoRT weather modeling activities, mobile applications, high performance networking, and opportunities for IT innovation on the International Space Station.
The document discusses big data in astronomy and the LineA-DEXL case. It provides an outline and introduction to big data in science and hypothesis-driven research. It discusses data management techniques like data partitioning and parallel workflow processing. It then provides details on the Laboratorio Nacional de Computacao Cientifica (LNCC) and its role in supporting computational modeling and bioinformatics. It discusses astronomy surveys that generate large amounts of data like the Dark Energy Survey and challenges of data from the Large Synoptic Survey Telescope. Finally, it discusses the need for data infrastructure, metadata management, and distributed data management to support scientific research involving big data.
Mike Warren is the co-founder and CTO of Descartes Labs, a company that operates a geospatial analysis platform using multiple integrated satellite image datasets. The platform provides analysis-ready images with historical records for machine learning and allows users to find, measure, monitor changes over time, and predict future changes to minimize risk and optimize outcomes. It eliminates much of the data preparation time typically required by geospatial scientists by maintaining a growing archive of processed images and a robust pipeline for continuous updates as new images become available.
The document describes the iMarine data platform, which provides a hybrid data infrastructure combining over 500 software components into a centralized system. It addresses a variety of user needs such as hosting applications, maintaining databases, data analysis and delivery. The platform offers various services including storage, computing, data management and analysis tools to support tasks like biological data curation and spatial data processing. It utilizes bundles of grouped services and technologies to enable applications and solutions for collaborative work.
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
This document discusses the development of data infrastructure for the Square Kilometre Array (SKA) radio telescope project. It provides timelines for SKA and its precursors, highlighting the exponential growth of data that will be produced. It outlines challenges of managing large data volumes and empowering users. It describes some existing and planned facilities for SKA regional science and data centers, including collaborations between South African and international institutions. These centers will be important for processing and analyzing data from pre-SKA telescopes like MeerKAT and distributing data to global research teams.
The document discusses how the Earth System Grid Federation (ESGF) leverages tools from Apache Solr and Apache Object Oriented Data Technology (OODT) to manage and distribute large amounts of climate science data. ESGF is an international collaboration that uses a distributed network of nodes running various software components to provide access to over 2.5 petabytes of climate model output and observational data. This infrastructure supports the research of the Intergovernmental Panel on Climate Change and projects like CMIP5, the largest coordinated climate modeling effort to date.
The story of how Globus helped the Arecibo Observatory save 50+ years of data for posterity and future research. Presented at the GlobusWorld 2021 conference by Julio Alvarado Negron.
GlobusWorld 2021: Arecibo Observatory Data MovementGlobus
The story of how Globus helped move petabytes of data from the Arecibo Observatory to TACC, and thereby save 50+ years of data for posterity and future research. Presented at the GlobusWorld 2021 conference by George Robb III.
This document provides an overview of the NPOESS Program. NPOESS is a tri-agency program between NOAA, NASA, and the Department of Defense to develop the next generation of US polar-orbiting environmental satellites. The goal is to converge the DoD and NOAA satellite programs to achieve cost savings while incorporating new technologies. NPOESS will provide global environmental data for weather forecasting, climate monitoring, and other applications. The first NPOESS satellite is scheduled for launch in 2013 and the system is expected to operate through 2026. The NPP satellite launching in 2011 will help reduce risks for NPOESS.
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
Talk at Mount Sinai School of Medicine. Introduction to the Hadoop ecosystem, problems in bioinformatics data analytics, and a specific use case of building a genome variant store backed by Cloudera Impala.
From the Hadoop Summit 2015 Session with Ted Dunning:
Just when we thought the last mile problem was solved, the Internet of Things is turning the last mile problem of the consumer internet into the first mile problem of the industrial internet. This inversion impacts every aspect of the design of networked applications. I will show how to use existing Hadoop ecosystem tools, such as Spark, Drill and others, to deal successfully with this inversion. I will present real examples of how data from things leads to real business benefits and describe real techniques for how these examples work.
This document discusses NASA's big data challenges in climate science. It notes that by 2020, climate data holdings from simulations, observations, and reanalyses are projected to grow to hundreds of exabytes worldwide. It describes NASA's efforts to build "data analytics platforms" like NEX and Obs4MIPs using ESGF to enable analysis of large amounts of observational and modeling data without needing to download entire datasets locally. The challenges of remote data visualization, distributed data analysis, and data management for big climate data are also discussed.
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayChris Mattmann
Keynote presentation at the HPC User Forum 2012 in Darborn, MI, September 19, 2012. http://www.hpcuserforum.com/registration/dearborn2012/dearbornagenda.pdf
Accelerating Science with Cloud Technologies in the ABoVE Science CloudGlobus
This document summarizes the use of the ABoVE Science Cloud (ASC) to support research for the Arctic-Boreal Vulnerability Experiment (ABoVE). The ASC provides researchers with large datasets, computing resources, and tools to process and analyze remote sensing and model data related to Alaska and northern Canada. Several examples are given of projects using the ASC, including analyzing satellite imagery to map forest structure, tracking surface water changes over time, characterizing fire history, and modeling future forest composition under climate change. The ASC aims to facilitate collaboration by allowing scientists to access common datasets and run computationally-intensive processes in the cloud without having to directly transfer large amounts of data.
This document discusses using web services and scientific workflows to analyze 3D astronomical data from archives. It describes the AMIGA project which studies isolated galaxies using data from multiple telescopes. Analyzing the large 3D data cubes requires intensive computation. The document proposes using distributed computing resources like grids and clouds along with a "cloud of services" to perform analysis without transferring entire datasets. Scientific workflows could automate and reproduce analyses by chaining web services from different astronomical archives. The Wf4Ever project aims to preserve digital experiments by curating all components of the research lifecycle.
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
The document discusses NASA's use of Apache Spark for big data analytics. It provides context on Chris Mattmann's involvement with Spark through his roles at NASA JPL and the Apache Software Foundation. It outlines some of NASA's big data challenges around handling large volumes of Earth observation data from instruments and simulations. NASA is interested in using Spark for tasks like data triage, archiving, and knowledge extraction to help address these challenges and enable new scientific insights.
Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks
The next decade promises to be exciting for both astronomy and computer science with a number of large-scale astronomical surveys in preparation. One of the most important ones is Large Scale Survey Telescope, or LSST. LSST will produce the first ‘video’ of the deep sky in history by continually scanning the visible sky and taking one 3.2 giga-pixel image every 20 seconds. In this talk we will describe LSST’s unique design and how its image processing pipeline produces catalogs of astronomical objects. To process and quickly cross-match catalog data we built AXS (Astronomy Extensions for Spark), a system based on Apache Spark. We will explain its design and what is behind its great cross-matching performance.
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...Larry Smarr
Invited Presentation
Symposium on Computational Biology and Bioinformatics:
Remembering John Wooley
National Institutes of Health
Bethesda, MD
July 29, 2016
The document provides information about an ITIC committee briefing at the Marshall Space Flight Center on November 29, 2012. It discusses the membership and activities of the ITIC committee. It also summarizes presentations and topics discussed at the briefing, including SPoRT weather modeling activities, mobile applications, high performance networking, and opportunities for IT innovation on the International Space Station.
The document discusses big data in astronomy and the LineA-DEXL case. It provides an outline and introduction to big data in science and hypothesis-driven research. It discusses data management techniques like data partitioning and parallel workflow processing. It then provides details on the Laboratorio Nacional de Computacao Cientifica (LNCC) and its role in supporting computational modeling and bioinformatics. It discusses astronomy surveys that generate large amounts of data like the Dark Energy Survey and challenges of data from the Large Synoptic Survey Telescope. Finally, it discusses the need for data infrastructure, metadata management, and distributed data management to support scientific research involving big data.
Mike Warren is the co-founder and CTO of Descartes Labs, a company that operates a geospatial analysis platform using multiple integrated satellite image datasets. The platform provides analysis-ready images with historical records for machine learning and allows users to find, measure, monitor changes over time, and predict future changes to minimize risk and optimize outcomes. It eliminates much of the data preparation time typically required by geospatial scientists by maintaining a growing archive of processed images and a robust pipeline for continuous updates as new images become available.
Toward a Global Interactive Earth Observing CyberinfrastructureLarry Smarr
The document discusses the need for a new generation of cyberinfrastructure to support interactive global earth observation. It outlines several prototyping projects that are building examples of systems enabling real-time control of remote instruments, remote data access and analysis. These projects are driving the development of an emerging cyber-architecture using web and grid services to link distributed data repositories and simulations.
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...Larry Smarr
05.02.04
Invited Talk to the NASA Jet Propulsion Laboratory
Title: LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks and High Resolution Visualizations
Pasadena, CA
The document discusses multidimensional data in astronomy and the Virtual Observatory (VO). It describes how the VO provides standards for data sharing and discovery. It also discusses challenges around discovering and accessing complex multidimensional datasets. The author proposes a Generic Dataset Service for discovery of associated multidimensional data collections. This includes inputs for querying data, and outputs returning data declarations and characteristics for access. Future work includes developing standards for virtual data generation and access, and using VO protocols and scientific workflows for multidimensional data analysis.
The Pacific Research Platform Two Years InLarry Smarr
This document provides an overview of the Pacific Research Platform (PRP) after two years of operation. It describes several science drivers that are using the PRP, including biomedical research on cancer genomics and microbiomes, earth sciences like earthquake modeling, and astronomy. It highlights how the PRP is connecting sites like UC San Diego, UC Santa Cruz, UC Berkeley to share and analyze large datasets using high-speed networks. The PRP is expanding to support new areas like deep learning, cultural heritage projects, and connecting additional UC campuses through network upgrades.
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Spark Summit
This document describes a project at Novartis to use Apache Spark for high-dimensional data analysis from drug screening. Large datasets from various screening technologies were analyzed using Spark pipelines for quality control, normalization, and classification. Visualizations were built using WebGL. The goals were to speed up multi-day batch jobs, create a unified analysis workflow, and build an application for scientists. Future work includes elastic infrastructure, supervised learning of cell phenotypes, and contributing methods to open source.
Cyberinfrastructure to Support Ocean ObservatoriesLarry Smarr
05.03.18
Invited Talk to the Ocean Studies Board
National Research Council
Title: Cyberinfrastructure to Support Ocean Observatories
University of California San Diego
Cyberinfrastructure to Support Ocean Observatories
ApacheCon NA 2013 VFASTR
1. Detec%ng
radio-‐astronomical
"Fast
Radio
Transient
Events"
via
an
OODT-‐based
metadata
processing
pipeline
Chris
Ma>mann,
Andrew
Hart
,
Luca
Cinquini
David
Thompson,
Kiri
Wagstaff,
Shakeh
E.
Khudikyan
NASA
Jet
Propulsion
Laboratory,
California
Ins9tute
of
Technology
Copyright 2013 California Institute of Technology. Government Sponsorship Acknowledged.
2. Radio
Ini9a9ve:
Archiving
• Initiative Lead: Dayton Jones; Champion: Robert Preston
• We will define the necessary data services and underlying substrate to
position JPL to compete for and lead “big data” management efforts in
astronomy, specifically, SKA, HERA, SKA precursors, and NRAO.
• Perform prototyping and deployment to demonstrate JPL’s leadership
in the “big data” and astronomy space.
• Collaborate on Data Products and Algorithms from Adaptive Data
Processing task
• Establish partnerships with major SKA potential sites and pre-cursor
efforts (South Africa, Australia)
20-‐Sep-‐12
2
HPCUF-
4. JPL
“Big
Data”
Ini9a9ve
D
• The Big Picture
• Astronomy, Earth science, planetary science, life/
physical science all drowning in data
• Fundamental technologies and emerging techniques
in archiving and data science
• Largely center around open source communities
and related systems
• Research challenges (adapted from NSF)
• More data is being collected than we can store
• Many data sets are too large to download
• Many data sets are too poorly organized to be useful
!"#"$!%&'(%$
• Many data sets are heterogeneous in type, structure
2)))
http://www.sciencemag.org/site/special/data/
• Data utility is limited by our ability to use it
1)))
0)))
/)))
• Our Focus: Big Data Archiving
.)))
3'"4#5#6$78$9&7:"&
• Research methods for integrating intelligent " -)))
BIGDATA Webinar
!5(5#"&$!"#";
<=":6#%> Direct qu
algorithms for data triage, subsetting, summarization
,)))
• Construct technologies for smart data movement
+)))
*)))
• Evaluate cloud computing for storage/processing
)
• Construct data/metadata translators “Babel Fish”
+)). +)*) +)*+ +)*.
?7'@A%B$<CDEF!D$!5(5#"&$G45H%@>%$?#'I6;$+)**$
20-‐Sep-‐12
4
HPCUF-
BIGDATA Webinar Direct questions to bigdata@nsf.gov Vasant Honavar, May 2012
5. Some
“Big
Data”
Grand
Challenges
• How
do
we
handle
700
TB/sec
of
data
coming
off
the
wire
when
we
actually
have
to
keep
it
around?
– Required
by
the
Square
Kilometre
Array
-‐
will
talk
about
data
triage
here
• Joe
scien@st
says
I’ve
got
an
IDL
or
Matlab
algorithm
that
I
will
not
change
and
I
need
to
run
it
on
10
years
of
data
from
the
Colorado
River
Basin
and
store
and
disseminate
the
output
products
– Required
by
the
Western
Snow
Hydrology
project
• How
do
we
compare
petabytes
of
climate
model
output
data
in
a
variety
of
formats
(HDF,
NetCDF,
Grib,
etc.)
with
petabytes
of
remote
sensing
data
to
improve
climate
models
for
the
next
IPCC
assessment?
– Required
by
the
5
IPCC
assessment
and
the
Earth
System
Grid
and
NASA
th
• How
do
we
catalog
all
of
NASA s
current
planetary
science
data?
– Required
by
the
NASA
Planetary
Data
System
5
Copyright 2012. Jet Propulsion Laboratory, California Institute of
Image Credit: http://www.jpl.nasa.gov/news/news.cfm?
Technology. US Government Sponsorship Acknowledged.
release=2011-295
6. VFASTR
VFASTR
(“VLBA
Fast
Radio
Transients”)
is
a
project
that
aims
at
detec9ng
short
radio
pulses
(approx.
a
few
milliseconds)
from
extra-‐terrestrial
sources
within
the
large
amounts
of
data
collected
by
the
VLBA
(“Very
Large
Baseline
Array”)
Fast
Radio
Transients
may
be
generated
by
known
and
yet
unknown
sources:
• Pulsars
• Merging
neutron
stars
• Intermi]ent
pulsars
• Annihila9ng
black
holes
• X-‐Ray
binaries
• ET
signals
?
• Supernovae
• New
deep
space
objects
VFASTR
is
one
of
a
new
genera9on
of
Radio
Astronomy
experiments
that
aim
at
analyzing
the
“dynamic
radio
sky”
as
opposed
to
mapping
and
inspec9ng
known
sta9c
sources
7. VLBA
VLBA
(“Very
Large
Baseline
Array”)
is
a
group
of
10
large
radio-‐telescopes
(25m
diameter)
distributed
across
the
U.S.A.
from
Hawaii
to
the
Virgin
Islands.
• No
two
antennas
are
within
each
other’s
local
horizon
• The
overall
array
has
a
baseline
of
800
km
=>
resolu9on
of
milliarcsecond
8. VFASTR
Commensal
Approach
VFASTR
employs
a
commensal
(a.k.a.
“passive”)
approach
by
analyzing
data
that
is
collected
during
normal
VLBA
opera9ons
for
other
scien9fic
purposes:
• Raw
voltages
from
VLBA
antennas
are
transferred
to
NRAO,
9me-‐correlated,
corrected
for
dispersion
through
the
interstellar
medium
(“de-‐dispersion”)
and
separated
from
instrument
noise
• Candidate
events
are
staged
on
disk
and
remain
available
for
limited
9me
• VFASTR
team
must
review
tens
of
candidates
daily,
archive
the
promising
ones
and
disregard
the
others
Voltages! Sky image
DiFX Correlator
VLBA Correlator!
Radio spectra! Matched filter ! Events!
128 channels x ! 100 DMs x ! Sky image!
VLBA Antennas! 10 antennas! 10 antennas!
Event
V-FASTR Dedispersion! detection!
Incoming Save out Saved
V-FASTR
STA data baseband data baseband data
commensal Raw voltage
Pulse Self-
analysis! Dedispersion! data for events!
injection! tuning! Optimal k!
(Commensal) Transient Detection Pipeline
VFASTR
Science
Team:
Astron
(The
Netherlands),
Email candidates
to humans
excise RFI detect
reorder ICRAR
(Australia),
JPL
(U.S.A.),
NRAO
(U.S.A.)
(kurtosis filter)
dedisperse
transients
9. VFASTR
Data
System
Overview
• The
soiware
engineering
team
at
JPL
has
developed
an
end-‐to-‐end
data
system
in
support
of
VFASTR
ac9vi9es
with
two
major
goals:
‣ provide
a
web-‐based
plajorm
for
easy
and
9mely
review
of
candidate
events
by
the
science
team
‣ enable
the
automa9c
iden9fica9on
of
interes9ng
events
by
a
self-‐trained
machine
agent
• The
system
is
composed
of
three
major
components:
‣ Data
processing
pipeline:
responsible
for
data
transfer
from
NRAO
and
archiving
at
JPL,
and
for
metadata
extrac9on
and
cataloging
‣ Web
portal:
easy
accessible
applica9on
for
display
of
product
data
and
metadata,
and
selec9on
and
tagging
of
interes9ng
events
‣ Data
mining
algorithm:
analyzes
the
events
pool
and
tags
candidates
with
characteris9cs
similar
to
sample
interes9ng
events
10. Apache
OODT
Overview
The
VFASTR
data
processing
pipeline
was
built
by
using
Apache
OODT
in
combina9on
with
other
Apache
and
Open
Source
technologies
OODT:
Object
Oriented
Data
Technology:
framework
for
management,
discovery
and
access
of
distributed
data
resources.
Main
features:
• Modularity:
eco-‐system
of
standalone
components
that
can
be
deployed
in
various
configura9ons
to
fulfill
a
project
specific
requirements
• Configurability:
each
component
can
be
easily
configured
to
invoke
alternate
out-‐of-‐
the-‐box
func9onality
or
deployment
op9ons
• Extensibility:
components
can
be
extended
by
providing
alternate
implementa9ons
to
its
core
APIs
(expressed
as
Java
interfaces)
or
configuring
custom
plugins
11. Apache
OODT
Adop9on
OODT
is
used
opera9onally
to
manage
scien9fic
data
by
several
projects
in
disparate
scien9fic
domains:
• Earth
Sciences:
‣ NASA
satellite
missions
(SMAP,...)
are
using
OODT
components
as
the
base
for
their
data
processing
pipeline
for
genera9on
and
archiving
of
products
from
raw
observa9ons
‣ ESGF
(Earth
System
Grid
Federa9on)
used
OODT
to
build
and
publish
observa9onal
data
products
in
support
of
climate
change
research
• Health
Sciences:
EDRN
(Early
Detec9on
Research
Network)
uses
OODT
to
collect,
tag
and
distribute
data
products
to
support
research
in
early
cancer
detec9on
• Planetary
Science:
PDS
(Planetary
Data
System)
is
developing
data
transforma9on
and
delivery
services
based
on
OODT
as
part
of
its
world-‐wide
product
access
infrastructure
• Radio
Astronomy:
several
projects
(ALMA,
Haystack,...)
are
adop9ng
OODT
based
on
successful
VASTR
example
12. VFASTR
Data
System
Architecture
Data
products
are
con9nuously
generated
by
the
VLBA
ground
and
processing
system
and
stored
on
temporary
cache
at
NRAO.
Data
products
are
transferred
to
JPL
where
metadata
is
extracted,
products
are
made
available
for
review
by
scien9sts,
sub-‐selected
for
further
analysis,
and
tagged.
13. Data
System
Design
Considera9ons
• Some
of
the
architectural
decisions
that
factored
in
the
data
system
design
were
mo9vated
by
specific
project
constraints:
‣ Minimize
impact
on
NRAO
resources:
because
VFASTR
is
a
guest
project
at
NRAO,
a]en9on
had
to
be
paid
to
limit
use
of
disk
storage,
network
bandwidth
and
CPU
resources
‣ Security:
all
NRAO
resources
were
exposed
as
read-‐only :
no
ac9on
ini9ated
at
JPL
could
result
in
any
modifica9on
of
the
original
products
(or
compromise
the
NRAO
system)
• Architecture
evolved
over
9me
as
a
result
of
new
requirements
such
as
increased
data
volumes
and
higher
frequency
updates
‣ Use
of
different
OODT
and
Apache
components
(Lucene
vs
MySQL
data
store
back-‐ends,
Solr
for
fast
metadata
retrieval)
‣ Development
of
new
OODT
func9onality
(RESTful
API
for
metadata
updates)
14. VFASTR
Data
Products
VFASTR
data
is
logically
organized
into
three
levels:
jobs,
scans
and
events
• Job:
a
batch
of
data
that
is
processed
at
one
9me,
and
stored
together
on
physical
disk.
They
are
associated
with
a
specific
inves9gator
scien9st.
Each
contains
1-‐100+
scans.
• Scan:
a
physical
telescope
poin9ng,
e.g.
a
period
where
the
antennas
are
all
directed
to
a
common
point
on
the
sky.
They
have
dura9ons
of
1-‐100+
seconds.
• Event:
a
9me
segment
that
the
system
thinks
is
interes9ng.
Dura9on
is
usually
about
1-‐2
seconds.
Most
scans
have
no
such
events,
but
some
have
a
dozen
or
more.
The
interes9ng
part
of
an
event
is
much
shorter:
5-‐50
milliseconds.
VFASTR
Data
Product:
directory
tree
containing
all
data
recorded
for
a
single
job
( tns_bmp360p2_44 )
• Job
calibra9on
files
• Scan
output
files
• Event
raw
voltages
• Event
reconstructed
images
• ...and
other
files....
• Approximately
1-‐100
GB
15. Pulsar
PSR
J0826+2637
Signal
Telescope
signal
processing:
• Time
correla9on
• “De-‐dispersion”
(i.e.
corrected
for
dispersion
in
interstellar
medium)
• “Adap9ve
excision”
(some
telescopes
are
disregarded
based
on
self-‐
learning
algorithm)
18. Rsync
Rsync:
freely
available
u9lity
for
Unix
systems
that
can
be
used
to
synchronize
the
content
of
directory
trees
between
two
hosts
with
minimal
human
interven9on.
Features:
• Easy
deployment
• Extensive
range
of
configura9on
op9ons
• High
performance:
only
file
changes
are
transferred
( delta
encoding )
between
sub-‐sequent
invoca9ons,
+
op9onal
compression
• Op9onal
recursion
into
sub-‐directories
• Reliability:
turn-‐key
toolkit
19. Rsync
VFASTR
deployment:
• rsync
server
daemon
was
deployed
at
NRAO
to
make
VFASTR
products
available
for
download
‣ Configured
for
read-‐only
mode
‣ Limited
to
requests
coming
from
JPL
IPs
• rsync
client
running
at
JPL
as
system
cron
job
to
pull
data
every
hour
‣ Configured
to
only
transfer
a
subset
of
the
product
files
(images,
output,
calibra9on
files)
Measured
Data
Transfer
Rates:
• ~
2MB/sec
between
NRAO
and
JPL
• Approximately
10-‐20
products
per
day
• Average
volume
for
transferred
product:
~50MB
(reduced
from
50GB)
• Can
transfer
all
(reduced)
daily
products
in
a
few
minutes!
20. CAS
Crawler
The
CAS
Crawler
is
an
OODT
component
that
can
be
used
to
list
the
contents
of
a
staging
area
and
submit
products
for
inges9on
to
the
CAS
File
manager.
Typically
used
for
automa9c
detec9on
of
new
products
transferred
from
a
remote
source.
VFASTR
deployment:
• Run
as
daemon
every
300
seconds
• In-‐place
archiving
of
products
(no
movement)
• Precondi9ons:
‣ Product
must
be
complete
‣ Product
must
be
no
older
than
10
days
‣ Product
must
not
exist
in
catalog
already
• Post-‐ingest
ac9on
on
success:
‣ Trigger
metadata
harves9ng
by
Solr
script
21. CAS
File
Manager
The
CAS
File
Manager
is
an
OODT
service
for
cataloging,
archiving
and
delivery
of
data
products
(files
and
directories)
and
associated
metadata.
It
is
used
as
core
data
management
component
in
most
OODT-‐based
data
systems.
VFASTR
Deployment:
• Policy
Files:
define
a
single
VFASTR
metadata
type
to
capture
ALL
informa9on
associated
with
a
single
product
(job,
scans
and
events)
‣ The
full
metadata
for
a
product
can
be
retrieved
by
a
client
with
one
request
‣ Metadata
keys
must
be
named
dynamically
to
capture
job-‐scan-‐event
references
‣ Example:
key=EventStartDateTime_s6
values=2013-‐01-‐12T15:48:21.800-‐0800,
2013-‐01-‐12T15:48:22.830-‐0800
(scan
6
contains
2
events)
22. CAS
File
Manager
• Valida9on
Layer:
no
valida9on
applied
as
metadata
fields
are
not
known
a-‐priori
‣ Back-‐end
catalog
implementa9ons
had
to
be
extended
to
allow
for
op9onal
lenient
behavior
for
ingested
metadata
• Metadata
Extractors:
custom
metadata
extractors
wri]en
to
parse
informa9on
for
job,
scans,
events
from
directory
structure,
calibra9on
and
output
files,
and
to
assign
detec9on
images
to
the
events
that
generated
them
• Metadata
Catalog:
used
both
Lucene
and
MySQL
back-‐ends
‣ Switched
to
MySQL
to
support
high-‐frequency
updates
‣ Lucene
File
Manager
implementa9on
now
fixed
to
support
high
frequencies
• Data
Transfer
Protocol:
archive
products
in
place
‣ Otherwise
they
would
be
re-‐transferred
by
rsync
23. CAS
Curator
The
CAS
Curator
is
a
web
applica9on
for
interac9ng
with
the
File
Manager
(i.e.
web-‐
based
client
for
File
Manager
service):
• Submit
data
product
inges9on
jobs
• Inspect,
add
and
update
product
metadata
( cura9on )
Features:
provides
two
interfaces
for
interac9ng
with
the
File
Manager:
• Web
User
Interface:
used
by
humans
to
manually
interact
with
the
system
‣ Drag-‐and-‐drop
selec9on
of
files
from
the
staging
area
‣ Selec9on
of
metadata
extractor,
versioner
from
available
pool
‣ Submission
of
job
for
bulk
inges9on
to
File
Manager
‣ Widget
for
display
and
update
product
metadata
24. CAS
Curator
• Web
RESTful
API:
used
by
programs
and
scripts
for
machine-‐machine
interac9on
‣ Based
on
Apache
JAX-‐RS
project
(project
for
RESTful
web
services)
‣ Allows
to
annotate
exis9ng
products
with
enhanced
metadata
‣ Example
HTTP/POST
invoca9on:
‣ curl
-‐-‐data
id=<product_id>&metadata.<name>=<value>
h]p://<hostname>/
curator/services/metadata/update
VFASTR
deployment:
• REST
API
used
by
Web
Portal
and
MatLab
script
to
tag
interes9ng
events
• Updated
product
metadata
submi]ed
to
FM
via
XML/RPC
request
‣ VFASTR
Curator
was
wired
with
JAXRS
ResponseHandler
(servlet
filter
invoked
before
response
is
sent
back
to
client)
to
invoke
the
script
for
upda9ng
the
Solr
metadata
• metadata.event_s0_e1= pulsar|machine|date
25. Apache
Solr
Solr
is
a
high-‐performance
web-‐enabled
search
engine
built
on
top
of
Lucene
• Used
in
many
e-‐commerce
web
sites
• Free
text
searches
(w/
stemming,
stop
words,
...)
• Faceted
searches
(w/
facet
counts)
• Other
features:
highligh9ng,
word
comple9on,...
• Flat
metadata
model:
(key,
value+)
pairs
Metadata Apache Solr < HTTP GET
(XML) XML/JSON >
Tomcat/Jetty Client
Lucene
Index
Scalability:
Solr
includes
features
to
scale
to
10-‐100
M
of
records:
• Mul9ple
Cores
to
par99on
records
into
dis9nct
Indexes
• Mul9ple
Shards
to
distribute
the
query
across
complementary
Indexes
• Replicated
Indexes
for
high
availability
and
low
latency
26. Solr
Indexing
Script
VFASTR
deployment:
Solr
is
used
to
enable
high
performance
metadata
querying
by
clients:
Web
Portal
and
MatLab
script
• Solr
web
applica9on
deployed
within
same
Tomcat
container
as
CAS
Curator
• Python
Indexing
script
harvests
metadata
from
CAS
Catalog
to
Solr
Index
‣ Triggered
by
CAS
Crawler
when
a
product
is
first
ingested
‣ Triggered
by
CAS
Curator
when
the
product
metadata
is
updated
• VFASTR
Solr
schema
specifies
name
and
data
type
for
all
metadata
fields
‣ Type=job/scan/event
field
used
to
discriminate
among
records
Examples
of
VFASTR
queries
to
Solr
Index:
‣ List
of
latest
products
by
date
‣ Full
metadata
for
a
given
job,
scan
or
event
‣ All
events
that
were
assigned
a
given
tag
‣ All
tags
assigned
to
all
events
27. Review
Data
Portal
What
is
it?
• Web-‐based
view
of
the
metadata
associated
with
nightly
observa9ons
• Collabora9ve
environment
for
review
by
V-‐FASTR
science
team
28. Review
Data
Portal
Why
does
it
exist?
• Provide
distributed
science
team
with
convenient
access
to
metadata
29. Review
Data
Portal
How
does
it
fit?
• Focal
point
for
end-‐user
access
to
the
data
pipeline
Product
Rsync
Server
Genera9on
Daemon
Rsync
Client
Script
Archive
Staging
(cron,
hourly)
Metadata
Metadata
File
Crawler
Web
Portal
Extrac9on
Catalog
30. Review
Data
Portal
What
is
it
built
with?
• Some
of
the
technologies
behind
the
data
portal:
36. Review
Data
Portal
What
are
tags?
• Descrip9ve
metadata
associated
with
an
event
• Enable
classifica9on
and
filtering
• Serve
as
training
for
AI
(now)
• Serve
as
guide
for
what
to
archive
(soon)
38. Review
Data
Portal
What
can
users
do?
• Event
tagging
(now)
• Nightly
classifica9on,
viewable
by
all
users
• Used
as
training
input
to
the
automa9c
candidate
detec9on
• Job
archiving
(soon)
• Ini9ate
archival
of
job
on
NRAO
hardware
based
upon
the
contents
of
the
tags
and/or
other
metrics
39. Review
Data
Portal
What
have
users
thought?
• Interface
with
all
event
imagery
on-‐screen
is
an
improvement
over
command-‐line
methods
(more
efficient)
• Organiza9on
of
the
interface
should
support
rapid
evalua9on
of
an
en9re
job
(minimize
clicks)
• Improved
accessibility
of
the
informa9on
a
big
plus
(mobile
access)
40. Review
Data
Portal
What
is
next?
• Provide
more
comprehensive
job/event
search
capability
• Facet
by
tag
values,
index
metadata
• Con9nued
efficiency
improvements
• Bulk
tagging
of
all
events
in
a
job
• Implement
front-‐end
capability
to
ini9ate
back-‐end
archive
process
41. Ques9ons
?
Acknowledgments
Walter
F.
Brisken
Sarah
Burke-‐Spolar
Catheryn
Tro]
Adam
T.
Deller
Dayton
L.
Jones
Walid
A.
Majid
Divya
Palaniswamy
Steven
J.
Tingay
Randall
B.
Wayth