SlideShare a Scribd company logo
1 of 23
Download to read offline
Serverless Workflows for
Indexing Large Scientific Data
Tyler J. Skluzacek, Ryan Chard, Ryan Wong, Zhuozhao Li, Yadu Babuji,
Logan Ward, Ben Blaiszik, Kyle Chard, Ian Foster
Data are big, diverse, and distributed
Big: petabytes → exabytes
Diverse: thousands → millions of unique file extensions
Distributed: IoT (edge), HPC, cloud; from many individuals
Generally, scientific data are not FAIR
Findable , Accessible, Interoperable, Reusable
Root of the problem: files lack descriptive metadata
Root of the root of the problem: humans are lazy, metadata are hard
{location, physical attributes,
derived information, provenance}
{. . .}
{. . .}
Search
Index
Files Humans Metadata Extraction Ingestion
We need an automated
metadata extraction system
{location, physical attributes,
derived information, provenance}
{. . .}
{. . .}
Search
Index
Metadata Extraction Service
Ideally, to cancel* humans
Files HTTPS Ingestion
We need a flexible, decentralized, scalable
metadata extraction system
1. Send metadata extraction functions to data
No need to ship big data
2. Decentralized
Extract the data in their natural habitats (e.g., edge)
3. Scalable
Run many concurrent metadata extraction processes
wc –l $FILE1
wc –l $FILE1
wc –l $FILE2
wc –l $FILE1
wc –l $FILE2
. . .
wc –l $FILE600000
funcX for FaaS anywhere
Enable secure, isolated, on-demand function
serving on myriad compute resources
(cloud, HPC, laptops, edge)
Abstract away underlying infrastructure via Parsl
parallel scripting language
Users deploy endpoints on available compute
resources, use Globus Auth for access control, and
access a library of containers for running functions
wc –l $FILE1
Function
Compute Resource
funcX Endpoint
Worker
Container
Result
funcX service
Metadata Extractor = Function
Metadata Extractor: Instructions to create a mapping from input file to
output JSON – e.g., looks like a function
Function: Python/BASH metadata extraction instruction
Payload: File or group of files which from which to extract
Function Containers: Containers containing all execution dependencies
{location, physical attributes,
derived information, provenance}
?
Xtract: the serverless metadata extraction system
Built atop funcX
Deploy endpoints at heterogeneous compute resources
on cloud, laptops, HPC, scientific instruments
Central web service determines extractors to send to
endpoints
Send extractors to data, receive results, determine future
extractors
Secure
Use Globus Auth for access control on data collections and
compute
Crawls any Globus-connected endpoints
Recursively generates file groups dir-by-dir
Prototype
Xtract: the serverless metadata extraction system
Site A: Compute at data
Site B: Compute elsewhere
Consider how one could automatically move data to available endpoints to suit constraints
Step 1: Crawl the File System
Recursively crawls all files in all nested dirs located on Globus Endpoint
Generate an initial metadata index for each file/file-group
Extracts physical metadata (path, size, extension, last-edited)
User Authentication: Globus Auth (OAuth 2)
File queue
Directory queue
Crawl Worker
Crawl Worker
Crawl Worker
Step 2: File Type Identification
Need to “guess” a file’s type
Impractical to apply all extractors to all files (most yield no metadata)
Applying an incorrect extractor to a file can waste significant time
Random Forests model trained on 5% of files in a given repo
Features: 512 bytes from header
Training:
File’s type determined by first applicable metadata extractor to file
Feasible because extractors can find other applicable extractors
Step 3: Metadata Extractor Orchestration
Xtract uses file type identity to choose the first appropriate extractor
Extractors return results to service and may immediately deploy
additional extractors to endpoint. This can be done recursively.
One file will likely receive multiple metadata extraction functions
Step 4: Ingest Metadata Document
Currently Xtract supports ingesting JSON directly to Globus Search
Diverse, Plentiful Data in Materials Science
The Materials Data Facility (MDF):
• is a centralized hub for publishing,
storing, discovering materials data
• stores many terabytes of data from
myriad research groups
• is spread across tens of millions of files
• is co-hosted by ANL and NCSA (at UIUC)
Thus, manual metadata curation is difficult
The Materials Extractor
Atomistic simulations, crystal structures, density
functional theory (DFT) calculations, electron
microscopy outputs, images, papers, tabular data,
abstracts, . . .
MaterialsIO is a library of tools to generate
summaries of materials science data files
We developed a ‘materials extractor’ to return
summary as metadata
https://materialsio.readthedocs.io/en/latest/
Extractor Library
We operate a (growing!) suite of metadata extractors, including:
Extractor Description
File Type Generate hints to guide extractor selection
Images SVM analysis to determine image type (map, plot, photo, etc.)
Semi-Structured Extract headings and compute attribute-level metadata
Keyword Extract keyword tags from text
Materials Extract information from identifiable materials science formats
Hierarchical Extract and derive attributes from hierarchical files (NetCDF, HDF)
Tabular Column-level metadata and aggregates, nulls, and headers
Experimental Machinery
Xtract Service
AWS EC2 t2.small instance (Intel Xeon; 1 vCPU, 2GB RAM)
Endpoint
funcX deployed at ANL’s PetrelKube
14-node Kubernetes cluster
Data
Stored on the Petrel data service (3 PB, Globus-accessible
endpoint at ANL)
255,000 randomly selected files from Materials Data Facility
We evaluate Xtract on the
following dimensions:
1. Crawling Performance
2. File Type Training
3. Extractor Latency
Future work will evaluate:
4. Metadata quality
5. Tradeoff optimization (transfer or move if nonuniform resource usage)
1. Crawling Performance
Sequential crawling: 2.2 million files in ~5.2 hours
Parallelization? Soon. The remote ls command was previously rate-
limited, and a majority of directories have 0 or 1 files.
File queue
Directory queue
Crawl Worker
Crawl Worker
Crawl Worker
Crawler
2. File Type Training
Train file type identification model on 110,900 files in MDF
Total time: 5.3 hours (one-time cost)
Label generation: 5.3 hours
Feature collection + random forests training: 45 seconds
Accuracy: 97%
Precision: 97%
Recall: 91%
3. Extraction Performance
BatchingExtractor Latency
Extractor # Files Avg. Size
(MB)
Avg. Extract
Time (ms)
Avg. Stage
Time (ms)
File Type 255,132 1.52 3.48 714
Images 76,925 4.17 19.30 1,198
Semi-Str. 29,850 0.38 8.97 412
Keyword 25,997 0.06 0.20 346
Materials 95,434 0.001 24 1,760
Hierarch. 3,855 695 1.90 9,150
Tabular 1,227 1.03 113 625
Conclusion
Data are big, diverse and distributed and are not FAIR (by default)
Xtract is a prototype that enables scalable, distributed metadata
extraction on heterogeneous data stores and compute resources
Future work predicates on taking advantage of heterogeneous,
distributed resources subject to a number of usage and cost constraints
Next up: index the full 30+ million file Materials Data Facility
Learn more about future work
at the Doctoral Symposium
skluzacek@uchicago.edu
Doctoral Symposium Article:
“Dredging a Data Lake: Decentralized Metadata Extraction”. Tyler J. Skluzacek. Middleware ‘19

More Related Content

What's hot

Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Globus publication demo screenshots
Globus publication demo screenshotsGlobus publication demo screenshots
Globus publication demo screenshotsIan Foster
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data AnalyticsFelipe
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudIJERA Editor
 
Elasticsearch - under the hood
Elasticsearch - under the hoodElasticsearch - under the hood
Elasticsearch - under the hoodSmartCat
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4Andreas Dewes
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)Robert Grossman
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableAM Publications
 
Research Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataResearch Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataRicard de la Vega
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representationsMarco Quartulli
 
Assessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsAssessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsPeter van Heusden
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesTanu Malik
 

What's hot (20)

Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Globus publication demo screenshots
Globus publication demo screenshotsGlobus publication demo screenshots
Globus publication demo screenshots
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data Analytics
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private Cloud
 
Elasticsearch - under the hood
Elasticsearch - under the hoodElasticsearch - under the hood
Elasticsearch - under the hood
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash Table
 
Research Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataResearch Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories Metadata
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Assessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsAssessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformatics
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging Services
 

Similar to WoSC19: Serverless Workflows for Indexing Large Scientific Data

The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Chris Mattmann
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object Sandeep Patil
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
 
Filebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxFilebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxKnoldus Inc.
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013Kirill Osipov
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryIan Foster
 
Hypatia for dlf 2011
Hypatia for dlf 2011Hypatia for dlf 2011
Hypatia for dlf 2011DLFCLIR
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusGlobus
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outlineIan Duncan
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceIan Foster
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaChris Mattmann
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfekobelasting
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Simeon Warner
 

Similar to WoSC19: Serverless Workflows for Indexing Large Scientific Data (20)

The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
Filebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxFilebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptx
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Hypatia for dlf 2011
Hypatia for dlf 2011Hypatia for dlf 2011
Hypatia for dlf 2011
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outline
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache Tika
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdf
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)
 

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

WoSC19: Serverless Workflows for Indexing Large Scientific Data

  • 1. Serverless Workflows for Indexing Large Scientific Data Tyler J. Skluzacek, Ryan Chard, Ryan Wong, Zhuozhao Li, Yadu Babuji, Logan Ward, Ben Blaiszik, Kyle Chard, Ian Foster
  • 2. Data are big, diverse, and distributed Big: petabytes → exabytes Diverse: thousands → millions of unique file extensions Distributed: IoT (edge), HPC, cloud; from many individuals
  • 3. Generally, scientific data are not FAIR Findable , Accessible, Interoperable, Reusable Root of the problem: files lack descriptive metadata Root of the root of the problem: humans are lazy, metadata are hard {location, physical attributes, derived information, provenance} {. . .} {. . .} Search Index Files Humans Metadata Extraction Ingestion
  • 4. We need an automated metadata extraction system {location, physical attributes, derived information, provenance} {. . .} {. . .} Search Index Metadata Extraction Service Ideally, to cancel* humans Files HTTPS Ingestion
  • 5. We need a flexible, decentralized, scalable metadata extraction system 1. Send metadata extraction functions to data No need to ship big data 2. Decentralized Extract the data in their natural habitats (e.g., edge) 3. Scalable Run many concurrent metadata extraction processes wc –l $FILE1 wc –l $FILE1 wc –l $FILE2 wc –l $FILE1 wc –l $FILE2 . . . wc –l $FILE600000
  • 6. funcX for FaaS anywhere Enable secure, isolated, on-demand function serving on myriad compute resources (cloud, HPC, laptops, edge) Abstract away underlying infrastructure via Parsl parallel scripting language Users deploy endpoints on available compute resources, use Globus Auth for access control, and access a library of containers for running functions wc –l $FILE1 Function Compute Resource funcX Endpoint Worker Container Result funcX service
  • 7. Metadata Extractor = Function Metadata Extractor: Instructions to create a mapping from input file to output JSON – e.g., looks like a function Function: Python/BASH metadata extraction instruction Payload: File or group of files which from which to extract Function Containers: Containers containing all execution dependencies {location, physical attributes, derived information, provenance} ?
  • 8. Xtract: the serverless metadata extraction system Built atop funcX Deploy endpoints at heterogeneous compute resources on cloud, laptops, HPC, scientific instruments Central web service determines extractors to send to endpoints Send extractors to data, receive results, determine future extractors Secure Use Globus Auth for access control on data collections and compute Crawls any Globus-connected endpoints Recursively generates file groups dir-by-dir Prototype
  • 9. Xtract: the serverless metadata extraction system Site A: Compute at data Site B: Compute elsewhere Consider how one could automatically move data to available endpoints to suit constraints
  • 10. Step 1: Crawl the File System Recursively crawls all files in all nested dirs located on Globus Endpoint Generate an initial metadata index for each file/file-group Extracts physical metadata (path, size, extension, last-edited) User Authentication: Globus Auth (OAuth 2) File queue Directory queue Crawl Worker Crawl Worker Crawl Worker
  • 11. Step 2: File Type Identification Need to “guess” a file’s type Impractical to apply all extractors to all files (most yield no metadata) Applying an incorrect extractor to a file can waste significant time Random Forests model trained on 5% of files in a given repo Features: 512 bytes from header Training: File’s type determined by first applicable metadata extractor to file Feasible because extractors can find other applicable extractors
  • 12. Step 3: Metadata Extractor Orchestration Xtract uses file type identity to choose the first appropriate extractor Extractors return results to service and may immediately deploy additional extractors to endpoint. This can be done recursively. One file will likely receive multiple metadata extraction functions
  • 13. Step 4: Ingest Metadata Document Currently Xtract supports ingesting JSON directly to Globus Search
  • 14. Diverse, Plentiful Data in Materials Science The Materials Data Facility (MDF): • is a centralized hub for publishing, storing, discovering materials data • stores many terabytes of data from myriad research groups • is spread across tens of millions of files • is co-hosted by ANL and NCSA (at UIUC) Thus, manual metadata curation is difficult
  • 15. The Materials Extractor Atomistic simulations, crystal structures, density functional theory (DFT) calculations, electron microscopy outputs, images, papers, tabular data, abstracts, . . . MaterialsIO is a library of tools to generate summaries of materials science data files We developed a ‘materials extractor’ to return summary as metadata https://materialsio.readthedocs.io/en/latest/
  • 16. Extractor Library We operate a (growing!) suite of metadata extractors, including: Extractor Description File Type Generate hints to guide extractor selection Images SVM analysis to determine image type (map, plot, photo, etc.) Semi-Structured Extract headings and compute attribute-level metadata Keyword Extract keyword tags from text Materials Extract information from identifiable materials science formats Hierarchical Extract and derive attributes from hierarchical files (NetCDF, HDF) Tabular Column-level metadata and aggregates, nulls, and headers
  • 17. Experimental Machinery Xtract Service AWS EC2 t2.small instance (Intel Xeon; 1 vCPU, 2GB RAM) Endpoint funcX deployed at ANL’s PetrelKube 14-node Kubernetes cluster Data Stored on the Petrel data service (3 PB, Globus-accessible endpoint at ANL) 255,000 randomly selected files from Materials Data Facility
  • 18. We evaluate Xtract on the following dimensions: 1. Crawling Performance 2. File Type Training 3. Extractor Latency Future work will evaluate: 4. Metadata quality 5. Tradeoff optimization (transfer or move if nonuniform resource usage)
  • 19. 1. Crawling Performance Sequential crawling: 2.2 million files in ~5.2 hours Parallelization? Soon. The remote ls command was previously rate- limited, and a majority of directories have 0 or 1 files. File queue Directory queue Crawl Worker Crawl Worker Crawl Worker Crawler
  • 20. 2. File Type Training Train file type identification model on 110,900 files in MDF Total time: 5.3 hours (one-time cost) Label generation: 5.3 hours Feature collection + random forests training: 45 seconds Accuracy: 97% Precision: 97% Recall: 91%
  • 21. 3. Extraction Performance BatchingExtractor Latency Extractor # Files Avg. Size (MB) Avg. Extract Time (ms) Avg. Stage Time (ms) File Type 255,132 1.52 3.48 714 Images 76,925 4.17 19.30 1,198 Semi-Str. 29,850 0.38 8.97 412 Keyword 25,997 0.06 0.20 346 Materials 95,434 0.001 24 1,760 Hierarch. 3,855 695 1.90 9,150 Tabular 1,227 1.03 113 625
  • 22. Conclusion Data are big, diverse and distributed and are not FAIR (by default) Xtract is a prototype that enables scalable, distributed metadata extraction on heterogeneous data stores and compute resources Future work predicates on taking advantage of heterogeneous, distributed resources subject to a number of usage and cost constraints Next up: index the full 30+ million file Materials Data Facility
  • 23. Learn more about future work at the Doctoral Symposium skluzacek@uchicago.edu Doctoral Symposium Article: “Dredging a Data Lake: Decentralized Metadata Extraction”. Tyler J. Skluzacek. Middleware ‘19