SlideShare a Scribd company logo
1 of 23
Download to read offline
Serverless Workflows for
Indexing Large Scientific Data
Tyler J. Skluzacek, Ryan Chard, Ryan Wong, Zhuozhao Li, Yadu Babuji,
Logan Ward, Ben Blaiszik, Kyle Chard, Ian Foster
Data are big, diverse, and distributed
Big: petabytes → exabytes
Diverse: thousands → millions of unique file extensions
Distributed: IoT (edge), HPC, cloud; from many individuals
Generally, scientific data are not FAIR
Findable , Accessible, Interoperable, Reusable
Root of the problem: files lack descriptive metadata
Root of the root of the problem: humans are lazy, metadata are hard
{location, physical attributes,
derived information, provenance}
{. . .}
{. . .}
Search
Index
Files Humans Metadata Extraction Ingestion
We need an automated
metadata extraction system
{location, physical attributes,
derived information, provenance}
{. . .}
{. . .}
Search
Index
Metadata Extraction Service
Ideally, to cancel* humans
Files HTTPS Ingestion
We need a flexible, decentralized, scalable
metadata extraction system
1. Send metadata extraction functions to data
No need to ship big data
2. Decentralized
Extract the data in their natural habitats (e.g., edge)
3. Scalable
Run many concurrent metadata extraction processes
wc –l $FILE1
wc –l $FILE1
wc –l $FILE2
wc –l $FILE1
wc –l $FILE2
. . .
wc –l $FILE600000
funcX for FaaS anywhere
Enable secure, isolated, on-demand function
serving on myriad compute resources
(cloud, HPC, laptops, edge)
Abstract away underlying infrastructure via Parsl
parallel scripting language
Users deploy endpoints on available compute
resources, use Globus Auth for access control, and
access a library of containers for running functions
wc –l $FILE1
Function
Compute Resource
funcX Endpoint
Worker
Container
Result
funcX service
Metadata Extractor = Function
Metadata Extractor: Instructions to create a mapping from input file to
output JSON – e.g., looks like a function
Function: Python/BASH metadata extraction instruction
Payload: File or group of files which from which to extract
Function Containers: Containers containing all execution dependencies
{location, physical attributes,
derived information, provenance}
?
Xtract: the serverless metadata extraction system
Built atop funcX
Deploy endpoints at heterogeneous compute resources
on cloud, laptops, HPC, scientific instruments
Central web service determines extractors to send to
endpoints
Send extractors to data, receive results, determine future
extractors
Secure
Use Globus Auth for access control on data collections and
compute
Crawls any Globus-connected endpoints
Recursively generates file groups dir-by-dir
Prototype
Xtract: the serverless metadata extraction system
Site A: Compute at data
Site B: Compute elsewhere
Consider how one could automatically move data to available endpoints to suit constraints
Step 1: Crawl the File System
Recursively crawls all files in all nested dirs located on Globus Endpoint
Generate an initial metadata index for each file/file-group
Extracts physical metadata (path, size, extension, last-edited)
User Authentication: Globus Auth (OAuth 2)
File queue
Directory queue
Crawl Worker
Crawl Worker
Crawl Worker
Step 2: File Type Identification
Need to “guess” a file’s type
Impractical to apply all extractors to all files (most yield no metadata)
Applying an incorrect extractor to a file can waste significant time
Random Forests model trained on 5% of files in a given repo
Features: 512 bytes from header
Training:
File’s type determined by first applicable metadata extractor to file
Feasible because extractors can find other applicable extractors
Step 3: Metadata Extractor Orchestration
Xtract uses file type identity to choose the first appropriate extractor
Extractors return results to service and may immediately deploy
additional extractors to endpoint. This can be done recursively.
One file will likely receive multiple metadata extraction functions
Step 4: Ingest Metadata Document
Currently Xtract supports ingesting JSON directly to Globus Search
Diverse, Plentiful Data in Materials Science
The Materials Data Facility (MDF):
• is a centralized hub for publishing,
storing, discovering materials data
• stores many terabytes of data from
myriad research groups
• is spread across tens of millions of files
• is co-hosted by ANL and NCSA (at UIUC)
Thus, manual metadata curation is difficult
The Materials Extractor
Atomistic simulations, crystal structures, density
functional theory (DFT) calculations, electron
microscopy outputs, images, papers, tabular data,
abstracts, . . .
MaterialsIO is a library of tools to generate
summaries of materials science data files
We developed a ‘materials extractor’ to return
summary as metadata
https://materialsio.readthedocs.io/en/latest/
Extractor Library
We operate a (growing!) suite of metadata extractors, including:
Extractor Description
File Type Generate hints to guide extractor selection
Images SVM analysis to determine image type (map, plot, photo, etc.)
Semi-Structured Extract headings and compute attribute-level metadata
Keyword Extract keyword tags from text
Materials Extract information from identifiable materials science formats
Hierarchical Extract and derive attributes from hierarchical files (NetCDF, HDF)
Tabular Column-level metadata and aggregates, nulls, and headers
Experimental Machinery
Xtract Service
AWS EC2 t2.small instance (Intel Xeon; 1 vCPU, 2GB RAM)
Endpoint
funcX deployed at ANL’s PetrelKube
14-node Kubernetes cluster
Data
Stored on the Petrel data service (3 PB, Globus-accessible
endpoint at ANL)
255,000 randomly selected files from Materials Data Facility
We evaluate Xtract on the
following dimensions:
1. Crawling Performance
2. File Type Training
3. Extractor Latency
Future work will evaluate:
4. Metadata quality
5. Tradeoff optimization (transfer or move if nonuniform resource usage)
1. Crawling Performance
Sequential crawling: 2.2 million files in ~5.2 hours
Parallelization? Soon. The remote ls command was previously rate-
limited, and a majority of directories have 0 or 1 files.
File queue
Directory queue
Crawl Worker
Crawl Worker
Crawl Worker
Crawler
2. File Type Training
Train file type identification model on 110,900 files in MDF
Total time: 5.3 hours (one-time cost)
Label generation: 5.3 hours
Feature collection + random forests training: 45 seconds
Accuracy: 97%
Precision: 97%
Recall: 91%
3. Extraction Performance
BatchingExtractor Latency
Extractor # Files Avg. Size
(MB)
Avg. Extract
Time (ms)
Avg. Stage
Time (ms)
File Type 255,132 1.52 3.48 714
Images 76,925 4.17 19.30 1,198
Semi-Str. 29,850 0.38 8.97 412
Keyword 25,997 0.06 0.20 346
Materials 95,434 0.001 24 1,760
Hierarch. 3,855 695 1.90 9,150
Tabular 1,227 1.03 113 625
Conclusion
Data are big, diverse and distributed and are not FAIR (by default)
Xtract is a prototype that enables scalable, distributed metadata
extraction on heterogeneous data stores and compute resources
Future work predicates on taking advantage of heterogeneous,
distributed resources subject to a number of usage and cost constraints
Next up: index the full 30+ million file Materials Data Facility
Learn more about future work
at the Doctoral Symposium
skluzacek@uchicago.edu
Doctoral Symposium Article:
“Dredging a Data Lake: Decentralized Metadata Extraction”. Tyler J. Skluzacek. Middleware ‘19

More Related Content

What's hot

Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Globus publication demo screenshots
Globus publication demo screenshotsGlobus publication demo screenshots
Globus publication demo screenshotsIan Foster
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data AnalyticsFelipe
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudIJERA Editor
 
Elasticsearch - under the hood
Elasticsearch - under the hoodElasticsearch - under the hood
Elasticsearch - under the hoodSmartCat
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4Andreas Dewes
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)Robert Grossman
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableAM Publications
 
Research Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataResearch Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataRicard de la Vega
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representationsMarco Quartulli
 
Assessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsAssessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsPeter van Heusden
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesTanu Malik
 

What's hot (20)

Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Globus publication demo screenshots
Globus publication demo screenshotsGlobus publication demo screenshots
Globus publication demo screenshots
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data Analytics
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private Cloud
 
Elasticsearch - under the hood
Elasticsearch - under the hoodElasticsearch - under the hood
Elasticsearch - under the hood
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash Table
 
Research Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataResearch Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories Metadata
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Assessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsAssessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformatics
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging Services
 

Similar to WoSC19: Serverless Workflows for Indexing Large Scientific Data

The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Chris Mattmann
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object Sandeep Patil
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
 
Filebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxFilebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxKnoldus Inc.
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013Kirill Osipov
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryIan Foster
 
Hypatia for dlf 2011
Hypatia for dlf 2011Hypatia for dlf 2011
Hypatia for dlf 2011DLFCLIR
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusGlobus
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outlineIan Duncan
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceIan Foster
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaChris Mattmann
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfekobelasting
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Simeon Warner
 

Similar to WoSC19: Serverless Workflows for Indexing Large Scientific Data (20)

The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
Filebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxFilebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptx
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Hypatia for dlf 2011
Hypatia for dlf 2011Hypatia for dlf 2011
Hypatia for dlf 2011
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outline
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache Tika
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdf
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

WoSC19: Serverless Workflows for Indexing Large Scientific Data

  • 1. Serverless Workflows for Indexing Large Scientific Data Tyler J. Skluzacek, Ryan Chard, Ryan Wong, Zhuozhao Li, Yadu Babuji, Logan Ward, Ben Blaiszik, Kyle Chard, Ian Foster
  • 2. Data are big, diverse, and distributed Big: petabytes → exabytes Diverse: thousands → millions of unique file extensions Distributed: IoT (edge), HPC, cloud; from many individuals
  • 3. Generally, scientific data are not FAIR Findable , Accessible, Interoperable, Reusable Root of the problem: files lack descriptive metadata Root of the root of the problem: humans are lazy, metadata are hard {location, physical attributes, derived information, provenance} {. . .} {. . .} Search Index Files Humans Metadata Extraction Ingestion
  • 4. We need an automated metadata extraction system {location, physical attributes, derived information, provenance} {. . .} {. . .} Search Index Metadata Extraction Service Ideally, to cancel* humans Files HTTPS Ingestion
  • 5. We need a flexible, decentralized, scalable metadata extraction system 1. Send metadata extraction functions to data No need to ship big data 2. Decentralized Extract the data in their natural habitats (e.g., edge) 3. Scalable Run many concurrent metadata extraction processes wc –l $FILE1 wc –l $FILE1 wc –l $FILE2 wc –l $FILE1 wc –l $FILE2 . . . wc –l $FILE600000
  • 6. funcX for FaaS anywhere Enable secure, isolated, on-demand function serving on myriad compute resources (cloud, HPC, laptops, edge) Abstract away underlying infrastructure via Parsl parallel scripting language Users deploy endpoints on available compute resources, use Globus Auth for access control, and access a library of containers for running functions wc –l $FILE1 Function Compute Resource funcX Endpoint Worker Container Result funcX service
  • 7. Metadata Extractor = Function Metadata Extractor: Instructions to create a mapping from input file to output JSON – e.g., looks like a function Function: Python/BASH metadata extraction instruction Payload: File or group of files which from which to extract Function Containers: Containers containing all execution dependencies {location, physical attributes, derived information, provenance} ?
  • 8. Xtract: the serverless metadata extraction system Built atop funcX Deploy endpoints at heterogeneous compute resources on cloud, laptops, HPC, scientific instruments Central web service determines extractors to send to endpoints Send extractors to data, receive results, determine future extractors Secure Use Globus Auth for access control on data collections and compute Crawls any Globus-connected endpoints Recursively generates file groups dir-by-dir Prototype
  • 9. Xtract: the serverless metadata extraction system Site A: Compute at data Site B: Compute elsewhere Consider how one could automatically move data to available endpoints to suit constraints
  • 10. Step 1: Crawl the File System Recursively crawls all files in all nested dirs located on Globus Endpoint Generate an initial metadata index for each file/file-group Extracts physical metadata (path, size, extension, last-edited) User Authentication: Globus Auth (OAuth 2) File queue Directory queue Crawl Worker Crawl Worker Crawl Worker
  • 11. Step 2: File Type Identification Need to “guess” a file’s type Impractical to apply all extractors to all files (most yield no metadata) Applying an incorrect extractor to a file can waste significant time Random Forests model trained on 5% of files in a given repo Features: 512 bytes from header Training: File’s type determined by first applicable metadata extractor to file Feasible because extractors can find other applicable extractors
  • 12. Step 3: Metadata Extractor Orchestration Xtract uses file type identity to choose the first appropriate extractor Extractors return results to service and may immediately deploy additional extractors to endpoint. This can be done recursively. One file will likely receive multiple metadata extraction functions
  • 13. Step 4: Ingest Metadata Document Currently Xtract supports ingesting JSON directly to Globus Search
  • 14. Diverse, Plentiful Data in Materials Science The Materials Data Facility (MDF): • is a centralized hub for publishing, storing, discovering materials data • stores many terabytes of data from myriad research groups • is spread across tens of millions of files • is co-hosted by ANL and NCSA (at UIUC) Thus, manual metadata curation is difficult
  • 15. The Materials Extractor Atomistic simulations, crystal structures, density functional theory (DFT) calculations, electron microscopy outputs, images, papers, tabular data, abstracts, . . . MaterialsIO is a library of tools to generate summaries of materials science data files We developed a ‘materials extractor’ to return summary as metadata https://materialsio.readthedocs.io/en/latest/
  • 16. Extractor Library We operate a (growing!) suite of metadata extractors, including: Extractor Description File Type Generate hints to guide extractor selection Images SVM analysis to determine image type (map, plot, photo, etc.) Semi-Structured Extract headings and compute attribute-level metadata Keyword Extract keyword tags from text Materials Extract information from identifiable materials science formats Hierarchical Extract and derive attributes from hierarchical files (NetCDF, HDF) Tabular Column-level metadata and aggregates, nulls, and headers
  • 17. Experimental Machinery Xtract Service AWS EC2 t2.small instance (Intel Xeon; 1 vCPU, 2GB RAM) Endpoint funcX deployed at ANL’s PetrelKube 14-node Kubernetes cluster Data Stored on the Petrel data service (3 PB, Globus-accessible endpoint at ANL) 255,000 randomly selected files from Materials Data Facility
  • 18. We evaluate Xtract on the following dimensions: 1. Crawling Performance 2. File Type Training 3. Extractor Latency Future work will evaluate: 4. Metadata quality 5. Tradeoff optimization (transfer or move if nonuniform resource usage)
  • 19. 1. Crawling Performance Sequential crawling: 2.2 million files in ~5.2 hours Parallelization? Soon. The remote ls command was previously rate- limited, and a majority of directories have 0 or 1 files. File queue Directory queue Crawl Worker Crawl Worker Crawl Worker Crawler
  • 20. 2. File Type Training Train file type identification model on 110,900 files in MDF Total time: 5.3 hours (one-time cost) Label generation: 5.3 hours Feature collection + random forests training: 45 seconds Accuracy: 97% Precision: 97% Recall: 91%
  • 21. 3. Extraction Performance BatchingExtractor Latency Extractor # Files Avg. Size (MB) Avg. Extract Time (ms) Avg. Stage Time (ms) File Type 255,132 1.52 3.48 714 Images 76,925 4.17 19.30 1,198 Semi-Str. 29,850 0.38 8.97 412 Keyword 25,997 0.06 0.20 346 Materials 95,434 0.001 24 1,760 Hierarch. 3,855 695 1.90 9,150 Tabular 1,227 1.03 113 625
  • 22. Conclusion Data are big, diverse and distributed and are not FAIR (by default) Xtract is a prototype that enables scalable, distributed metadata extraction on heterogeneous data stores and compute resources Future work predicates on taking advantage of heterogeneous, distributed resources subject to a number of usage and cost constraints Next up: index the full 30+ million file Materials Data Facility
  • 23. Learn more about future work at the Doctoral Symposium skluzacek@uchicago.edu Doctoral Symposium Article: “Dredging a Data Lake: Decentralized Metadata Extraction”. Tyler J. Skluzacek. Middleware ‘19