SlideShare a Scribd company logo
Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying 
Mohammad Rashti – RNET Technologies 1
Outline 
 
Data Loading in HPCC Systems 
 
Potential Optimizations 
 
iNFORMER - The Big Picture of the SBIR project 
 
iNFORMER Modules 
 
The HPCC on AWS Case – S3 Parallel Data Access 
 
Feasibility study and experiments 
 
Ongoing and Future Work 2
Data Loading in HPCC Systems 
Potential Optimizations 
3
Data Loading in HPCC Systems 4 
 
In Thor, the data needs to be transferred into a single drop zone (Landing Zone) 
 
Landing zone node sprays the data over the DFS on the Thor nodes 
 
Imposes overhead on data loading 
 
Landing zone and spraying can become a bottleneck 
 
Especially for large data sets or partial loads
Optimizations for Spraying Process 
 
Eliminate the landing zone bottleneck 
 
Read data records directly from their original data format in their original place 
 
Use distributed data handling 
 
Use multiple Thor nodes to load (and potentially decompress) the data 
 
Each Thor node can read its own data 
 
Load the data directly into destination files on DFS 
 
Optimize partial data loads 
 
Add support for in-memory processing 
 
Skip trips to disk for iterative computing 5
iNFORMER 
The Big Picture of the SBIR Project 6
SBIR Project 
 
This work is part of an SBIR project from the Department of Energy 
 
Carried on in two phases 
 
Phase I (9 months) 
 
Feasibility study of the ideas and commercial potential 
 
Design of the prototype 
 
Phase II (2 years) 
 
Prototype implementation 
 
Commercialization of the product 
 
Competitive renewal proposal submission 
 
The Phase I is currently in progress (ending in Dec.) 
 
Phase II may start in 2Q 2015 7
iNFORMER’s General Architecture 
 
iNFORMER is the ultimate product developed in this project 
 
It is designed to provide: 
 
Efficient batch and in-situ data analytics 
 
Low-overhead native-format data access through a unified API 
 
iNFORMER components can be integrated into HPCC Systems 
 
Next slide shows a big-picture diagram of: 
 
iNFORMER components (the dark boxes) 
 
Where it can fit in HPCC systems. 8
iNFORMER in HPCC – The Big Picture 
DFS 
Any FS 
Various data stores 
Thor Cluster 
iNFORMER MATE 
data transformation 
Access API 
iNFORMER Virtual Data Integrator 
ECL Program 
Big Picture view of iNFORMER components and potential integration with HPCC 
new module 
existing link 
new link 
Landing Zone 9
iNFORMER MATE 
 
MATE is meant to replace MapReduce 
 
Using the concept of Reduction Object 
 
Replace map and reduce with generalized reduction 
 
Avoid intermediate shuffle/sort of key/value pairs 
 
Main benefits of the Reduction Object 
 
Not tied to the limitations of the key/value model 
 
More efficient data processing with less data movement 
 
Lower memory requirements 
 
Performance improvement: 1.50 to 20 fold 10
Virtual Data Integrator (VDI) 
 
VDI offers a unified API for native-format data access 
 
Access the data in their original storage format 
 
Improves data access experience over various file- systems/data formats 
 
Parallel File Systems: Thor DFS, HDFS, PVFS, GPFS, etc. 
 
Data Formats: Plain Text, XML, NetCDF, HDF5, etc. 
 
Eliminates the need to transform the entire data into a platform-specific format (e.g., DFS) 
 
Especially for in-memory processing 
 
VDI has two parts: 
 
Data Storage Adapter (DSA) 
 
Virtual Storage API (VSA) 11
iNFORMER Virtual Data Integrator (VDI) with HPCC 
Data Storage Adapter (DSA) Modules 
data records 
configure with FS/format specifics 
Virtual Storage API (VSA) 
FS/format customized 
modules 
Fixed API 
HPCC Systems’ Thor 
Any FS 
(e.g., S3) 
HPCC DFS 
data copy 
iNOFORMER VDI and potential integration with HPCC Systems 
new data flow 
existing data flow 
Drop (Landing) Zone 
data transformation 12
Data Storage Adapter (DSA) 
 
A set of adapters that interface with various storage formats 
 
Access the respective data records in their original format 
 
Consolidates data integration process 
 
Allows for partial data load/access 
 
Eliminates the need for excessive data transformation 
 
Data records on the distributed FS will be extracted and forwarded to the analysis engine and back. 13
Some Internals of the DSA Component 
Internal Architecture of the DSA File Format Adapter 14
Virtual Storage API (VSA) 
 
VSA is a unified FS/format agnostic interface: 
 
The data access calls are not bound to a format 
 
The storage format specifics will guide the VSA implementation to choose the right DSA module 
 
VSA will have well-defined hooks for easy implementation of modules for new formats 
 
A set of pre-implemented FS/format adapters will be developed. 15
The HPCC on AWS Case 
S3 Parallel Data Access 16
Phase I - HPCC Feasibility Study 17 
 
Efficient loading of AWS S3-resident compressed files into Thor 
 
Data loading / spraying optimizations in this project 
 
Distributed data load / decompress / read from S3 into the memory of Thor nodes 
 
Distributed data Spray to DFS 
 
Each node should load its own portion of data to avoid shuffling (communication) 
 
Use a zip index file to facilitate this 
 
Study the potential for in-memory processing 
 
Avoid multiple disk accesses
S3 to EC2 Distributed Data Load 18 
 
Two potential plans for the proposed distributed S3 loading module 
 
Test data: Elsevier sample data sets hosted on S3 
 
XML-based records 
 
Multi-entry zip files 
Amazon S3 
Thor @EC2 
S3 Parallel Load / Decompress 
Kafka Producer 
Kafka to DFS 
Spray 
Memory to DFS Spray 
1 
2 
Landing Zone 
Decompress & Spray 
S3 Load 
Current approach
Development Steps 19 
 
Step 1: Distributed Spraying (currently implemented) 
 
Parallel loading, decompression and spraying of XML files from an S3 archive 
 
Each process extracts an XML file data directly from the archive 
 
One process per node 
 
Load balancing applied among the processes 
 
Message Passing Interface (MPI) used for inter-node communication and shuffling 
 
Step II: Skipping the Shuffling and the Landing Zone (next step) 
 
Each process reads it own relevant portion of the file directly from the archive and decompresses on the fly. 
 
Allows for partial data load from S3 
 
May need full compression index for the archives 
 
A separate module to be developed to do this
Step I – Distributed Spraying 20 
Extract XML file 
Decompress 
Spray (shuffle) 
DFS 
Processing on four (4) EC2-based Thor Nodes 
Multi-file Zip archive 
Amazon S3
Step I Results 21 
Comparing Sequential and Parallel Load/Spray from S3 into Thor
Ongoing and Future Work 22
Future Development Steps 
 
Next immediate steps (SBIR Phase I) 
 
Complete spraying into Thor DFS 
 
Evaluation using more data sets 
 
Fully eliminate the scattering process by partial file reading 
 
Design the zip indexing module 
 
Full VDI module prototype design 
 
Future (SBIR Phase II) Plans 
 
S3 adapter module as part of the VDI 
 
Expansion of the adapter module to support general non-S3 cases 
 
Full integration with HPCC to bypass the landing zone 
 
Support for in-memory spraying and data processing 
23
24 
Contact Info 
Mohammad Rashti 
240 W. Elmwood Dr., Dayton, OH 
(937) – 433 - 2886 
mrashti@RNET-Tech.com

More Related Content

Viewers also liked

BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012
BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012
BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012
Amazon Web Services
 
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Amazon Web Services
 
HPC in the Cloud
HPC in the CloudHPC in the Cloud
HPC in the Cloud
Amazon Web Services
 
Probabilidad y estadistica
Probabilidad y estadisticaProbabilidad y estadistica
Probabilidad y estadistica
David Rodriguez
 
Telemedicine SW Clinical Society
Telemedicine SW Clinical Society Telemedicine SW Clinical Society
Telemedicine SW Clinical Society
David Voran
 
HPCC Systems Brings Deeper Customer Understanding and More Exact Social Marke...
HPCC Systems Brings Deeper Customer Understanding and More Exact Social Marke...HPCC Systems Brings Deeper Customer Understanding and More Exact Social Marke...
HPCC Systems Brings Deeper Customer Understanding and More Exact Social Marke...
HPCC Systems
 
My favorite anecdote
My favorite anecdoteMy favorite anecdote
My favorite anecdote
LORE OSORIO
 
Partnership and open data as enablers of INSPIREd innovative services
Partnership and open data as enablers of INSPIREd innovative servicesPartnership and open data as enablers of INSPIREd innovative services
Partnership and open data as enablers of INSPIREd innovative servicessmespire
 
Deploying office 2010 via group policy
Deploying office 2010 via group policyDeploying office 2010 via group policy
Deploying office 2010 via group policy
Naresh Gotad
 
Tugas imk
Tugas imkTugas imk
Tugas imk
roji muhidin
 
20130410 het abc van sociale media kalken
20130410 het abc van sociale media kalken20130410 het abc van sociale media kalken
20130410 het abc van sociale media kalkenkwb_eensgezind
 
The Complete Web Design Strategy Checklist
The Complete Web Design Strategy ChecklistThe Complete Web Design Strategy Checklist
The Complete Web Design Strategy ChecklistBrightIdeas.co
 
Ricerca sui fregi
Ricerca sui fregiRicerca sui fregi
Ricerca sui fregichristian98
 
LR новогодний каталог2012 часть2
LR новогодний каталог2012 часть2LR новогодний каталог2012 часть2
LR новогодний каталог2012 часть2
t575ae
 
Real estate craig feigin
Real estate  craig feiginReal estate  craig feigin
Real estate craig feigin
Craig Feigin
 
Aaaaa english trabajo 11 de maritza,tatiana ,sofia,maria,leydi
Aaaaa english trabajo 11  de maritza,tatiana ,sofia,maria,leydiAaaaa english trabajo 11  de maritza,tatiana ,sofia,maria,leydi
Aaaaa english trabajo 11 de maritza,tatiana ,sofia,maria,leydiMaritza Campaz Valencia
 

Viewers also liked (20)

BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012
BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012
BDT301 High Performance Computing in the Cloud - AWS re: Invent 2012
 
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
 
HPC in the Cloud
HPC in the CloudHPC in the Cloud
HPC in the Cloud
 
Probabilidad y estadistica
Probabilidad y estadisticaProbabilidad y estadistica
Probabilidad y estadistica
 
Telemedicine SW Clinical Society
Telemedicine SW Clinical Society Telemedicine SW Clinical Society
Telemedicine SW Clinical Society
 
HPCC Systems Brings Deeper Customer Understanding and More Exact Social Marke...
HPCC Systems Brings Deeper Customer Understanding and More Exact Social Marke...HPCC Systems Brings Deeper Customer Understanding and More Exact Social Marke...
HPCC Systems Brings Deeper Customer Understanding and More Exact Social Marke...
 
My favorite anecdote
My favorite anecdoteMy favorite anecdote
My favorite anecdote
 
Il condizionale-Basico 2
Il condizionale-Basico 2Il condizionale-Basico 2
Il condizionale-Basico 2
 
Partnership and open data as enablers of INSPIREd innovative services
Partnership and open data as enablers of INSPIREd innovative servicesPartnership and open data as enablers of INSPIREd innovative services
Partnership and open data as enablers of INSPIREd innovative services
 
Whats the point of learning this?
Whats the point of learning this?Whats the point of learning this?
Whats the point of learning this?
 
4
44
4
 
Deploying office 2010 via group policy
Deploying office 2010 via group policyDeploying office 2010 via group policy
Deploying office 2010 via group policy
 
Tugas imk
Tugas imkTugas imk
Tugas imk
 
20130410 het abc van sociale media kalken
20130410 het abc van sociale media kalken20130410 het abc van sociale media kalken
20130410 het abc van sociale media kalken
 
The Complete Web Design Strategy Checklist
The Complete Web Design Strategy ChecklistThe Complete Web Design Strategy Checklist
The Complete Web Design Strategy Checklist
 
Ricerca sui fregi
Ricerca sui fregiRicerca sui fregi
Ricerca sui fregi
 
LR новогодний каталог2012 часть2
LR новогодний каталог2012 часть2LR новогодний каталог2012 часть2
LR новогодний каталог2012 часть2
 
Real estate craig feigin
Real estate  craig feiginReal estate  craig feigin
Real estate craig feigin
 
Aaaaa english trabajo 11 de maritza,tatiana ,sofia,maria,leydi
Aaaaa english trabajo 11  de maritza,tatiana ,sofia,maria,leydiAaaaa english trabajo 11  de maritza,tatiana ,sofia,maria,leydi
Aaaaa english trabajo 11 de maritza,tatiana ,sofia,maria,leydi
 
Ac 3 averescu5
Ac 3  averescu5Ac 3  averescu5
Ac 3 averescu5
 

Similar to HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying

Hortonworks Data Platform with IBM Spectrum Scale
Hortonworks Data Platform with IBM Spectrum ScaleHortonworks Data Platform with IBM Spectrum Scale
Hortonworks Data Platform with IBM Spectrum Scale
Abhishek Sood
 
Large Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefLarge Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster Relief
Robert Grossman
 
Sector Cloudcom Tutorial
Sector Cloudcom TutorialSector Cloudcom Tutorial
Sector Cloudcom Tutorial
lilyco
 
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Robert Grossman
 
Mainframe Architecture & Product Overview
Mainframe Architecture & Product OverviewMainframe Architecture & Product Overview
Mainframe Architecture & Product Overviewabhi1112
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
EUDAT
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
JasmineMichael1
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
Rob Gillen
 
Cloud Computing_ICT Concepts & Trends.pptx
Cloud Computing_ICT Concepts & Trends.pptxCloud Computing_ICT Concepts & Trends.pptx
Cloud Computing_ICT Concepts & Trends.pptx
ssuser6063b0
 
DB2 for z/O S Data Sharing
DB2 for z/O S  Data  SharingDB2 for z/O S  Data  Sharing
DB2 for z/O S Data Sharing
Surekha Parekh
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
shanker_uma
 
Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)
Ulrich Krause
 
DAOS Middleware overview
DAOS Middleware overviewDAOS Middleware overview
DAOS Middleware overview
Andrey Kudryavtsev
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
Alluxio, Inc.
 
A cloud environment for backup and data storage
A cloud environment for backup and data storageA cloud environment for backup and data storage
A cloud environment for backup and data storage
IGEEKS TECHNOLOGIES
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
Robert Grossman
 
A cloud enviroment for backup and data storage
A cloud enviroment for backup and data storageA cloud enviroment for backup and data storage
A cloud enviroment for backup and data storage
IGEEKS TECHNOLOGIES
 

Similar to HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying (20)

Hortonworks Data Platform with IBM Spectrum Scale
Hortonworks Data Platform with IBM Spectrum ScaleHortonworks Data Platform with IBM Spectrum Scale
Hortonworks Data Platform with IBM Spectrum Scale
 
Large Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefLarge Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster Relief
 
Sector Cloudcom Tutorial
Sector Cloudcom TutorialSector Cloudcom Tutorial
Sector Cloudcom Tutorial
 
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
 
Mainframe Architecture & Product Overview
Mainframe Architecture & Product OverviewMainframe Architecture & Product Overview
Mainframe Architecture & Product Overview
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Cloud Computing_ICT Concepts & Trends.pptx
Cloud Computing_ICT Concepts & Trends.pptxCloud Computing_ICT Concepts & Trends.pptx
Cloud Computing_ICT Concepts & Trends.pptx
 
DB2 for z/O S Data Sharing
DB2 for z/O S  Data  SharingDB2 for z/O S  Data  Sharing
DB2 for z/O S Data Sharing
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
 
Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)
 
DAOS Middleware overview
DAOS Middleware overviewDAOS Middleware overview
DAOS Middleware overview
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
 
A cloud environment for backup and data storage
A cloud environment for backup and data storageA cloud environment for backup and data storage
A cloud environment for backup and data storage
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
A cloud enviroment for backup and data storage
A cloud enviroment for backup and data storageA cloud enviroment for backup and data storage
A cloud enviroment for backup and data storage
 

More from HPCC Systems

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
HPCC Systems
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
HPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex Systems
HPCC Systems
 
Welcome
WelcomeWelcome
Welcome
HPCC Systems
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn
HPCC Systems
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon Cutting
HPCC Systems
 
Path to 8.0
Path to 8.0 Path to 8.0
Path to 8.0
HPCC Systems
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle Changes
HPCC Systems
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index
HPCC Systems
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
HPCC Systems
 
Docker Support
Docker Support Docker Support
Docker Support
HPCC Systems
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
HPCC Systems
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
HPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch
HPCC Systems
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem
HPCC Systems
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis Tool
HPCC Systems
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony
HPCC Systems
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL Neater
HPCC Systems
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
HPCC Systems
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
HPCC Systems
 

More from HPCC Systems (20)

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex Systems
 
Welcome
WelcomeWelcome
Welcome
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon Cutting
 
Path to 8.0
Path to 8.0 Path to 8.0
Path to 8.0
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle Changes
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
 
Docker Support
Docker Support Docker Support
Docker Support
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis Tool
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL Neater
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
 

Recently uploaded

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 

Recently uploaded (20)

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 

HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying

  • 1. Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying Mohammad Rashti – RNET Technologies 1
  • 2. Outline  Data Loading in HPCC Systems  Potential Optimizations  iNFORMER - The Big Picture of the SBIR project  iNFORMER Modules  The HPCC on AWS Case – S3 Parallel Data Access  Feasibility study and experiments  Ongoing and Future Work 2
  • 3. Data Loading in HPCC Systems Potential Optimizations 3
  • 4. Data Loading in HPCC Systems 4  In Thor, the data needs to be transferred into a single drop zone (Landing Zone)  Landing zone node sprays the data over the DFS on the Thor nodes  Imposes overhead on data loading  Landing zone and spraying can become a bottleneck  Especially for large data sets or partial loads
  • 5. Optimizations for Spraying Process  Eliminate the landing zone bottleneck  Read data records directly from their original data format in their original place  Use distributed data handling  Use multiple Thor nodes to load (and potentially decompress) the data  Each Thor node can read its own data  Load the data directly into destination files on DFS  Optimize partial data loads  Add support for in-memory processing  Skip trips to disk for iterative computing 5
  • 6. iNFORMER The Big Picture of the SBIR Project 6
  • 7. SBIR Project  This work is part of an SBIR project from the Department of Energy  Carried on in two phases  Phase I (9 months)  Feasibility study of the ideas and commercial potential  Design of the prototype  Phase II (2 years)  Prototype implementation  Commercialization of the product  Competitive renewal proposal submission  The Phase I is currently in progress (ending in Dec.)  Phase II may start in 2Q 2015 7
  • 8. iNFORMER’s General Architecture  iNFORMER is the ultimate product developed in this project  It is designed to provide:  Efficient batch and in-situ data analytics  Low-overhead native-format data access through a unified API  iNFORMER components can be integrated into HPCC Systems  Next slide shows a big-picture diagram of:  iNFORMER components (the dark boxes)  Where it can fit in HPCC systems. 8
  • 9. iNFORMER in HPCC – The Big Picture DFS Any FS Various data stores Thor Cluster iNFORMER MATE data transformation Access API iNFORMER Virtual Data Integrator ECL Program Big Picture view of iNFORMER components and potential integration with HPCC new module existing link new link Landing Zone 9
  • 10. iNFORMER MATE  MATE is meant to replace MapReduce  Using the concept of Reduction Object  Replace map and reduce with generalized reduction  Avoid intermediate shuffle/sort of key/value pairs  Main benefits of the Reduction Object  Not tied to the limitations of the key/value model  More efficient data processing with less data movement  Lower memory requirements  Performance improvement: 1.50 to 20 fold 10
  • 11. Virtual Data Integrator (VDI)  VDI offers a unified API for native-format data access  Access the data in their original storage format  Improves data access experience over various file- systems/data formats  Parallel File Systems: Thor DFS, HDFS, PVFS, GPFS, etc.  Data Formats: Plain Text, XML, NetCDF, HDF5, etc.  Eliminates the need to transform the entire data into a platform-specific format (e.g., DFS)  Especially for in-memory processing  VDI has two parts:  Data Storage Adapter (DSA)  Virtual Storage API (VSA) 11
  • 12. iNFORMER Virtual Data Integrator (VDI) with HPCC Data Storage Adapter (DSA) Modules data records configure with FS/format specifics Virtual Storage API (VSA) FS/format customized modules Fixed API HPCC Systems’ Thor Any FS (e.g., S3) HPCC DFS data copy iNOFORMER VDI and potential integration with HPCC Systems new data flow existing data flow Drop (Landing) Zone data transformation 12
  • 13. Data Storage Adapter (DSA)  A set of adapters that interface with various storage formats  Access the respective data records in their original format  Consolidates data integration process  Allows for partial data load/access  Eliminates the need for excessive data transformation  Data records on the distributed FS will be extracted and forwarded to the analysis engine and back. 13
  • 14. Some Internals of the DSA Component Internal Architecture of the DSA File Format Adapter 14
  • 15. Virtual Storage API (VSA)  VSA is a unified FS/format agnostic interface:  The data access calls are not bound to a format  The storage format specifics will guide the VSA implementation to choose the right DSA module  VSA will have well-defined hooks for easy implementation of modules for new formats  A set of pre-implemented FS/format adapters will be developed. 15
  • 16. The HPCC on AWS Case S3 Parallel Data Access 16
  • 17. Phase I - HPCC Feasibility Study 17  Efficient loading of AWS S3-resident compressed files into Thor  Data loading / spraying optimizations in this project  Distributed data load / decompress / read from S3 into the memory of Thor nodes  Distributed data Spray to DFS  Each node should load its own portion of data to avoid shuffling (communication)  Use a zip index file to facilitate this  Study the potential for in-memory processing  Avoid multiple disk accesses
  • 18. S3 to EC2 Distributed Data Load 18  Two potential plans for the proposed distributed S3 loading module  Test data: Elsevier sample data sets hosted on S3  XML-based records  Multi-entry zip files Amazon S3 Thor @EC2 S3 Parallel Load / Decompress Kafka Producer Kafka to DFS Spray Memory to DFS Spray 1 2 Landing Zone Decompress & Spray S3 Load Current approach
  • 19. Development Steps 19  Step 1: Distributed Spraying (currently implemented)  Parallel loading, decompression and spraying of XML files from an S3 archive  Each process extracts an XML file data directly from the archive  One process per node  Load balancing applied among the processes  Message Passing Interface (MPI) used for inter-node communication and shuffling  Step II: Skipping the Shuffling and the Landing Zone (next step)  Each process reads it own relevant portion of the file directly from the archive and decompresses on the fly.  Allows for partial data load from S3  May need full compression index for the archives  A separate module to be developed to do this
  • 20. Step I – Distributed Spraying 20 Extract XML file Decompress Spray (shuffle) DFS Processing on four (4) EC2-based Thor Nodes Multi-file Zip archive Amazon S3
  • 21. Step I Results 21 Comparing Sequential and Parallel Load/Spray from S3 into Thor
  • 23. Future Development Steps  Next immediate steps (SBIR Phase I)  Complete spraying into Thor DFS  Evaluation using more data sets  Fully eliminate the scattering process by partial file reading  Design the zip indexing module  Full VDI module prototype design  Future (SBIR Phase II) Plans  S3 adapter module as part of the VDI  Expansion of the adapter module to support general non-S3 cases  Full integration with HPCC to bypass the landing zone  Support for in-memory spraying and data processing 23
  • 24. 24 Contact Info Mohammad Rashti 240 W. Elmwood Dr., Dayton, OH (937) – 433 - 2886 mrashti@RNET-Tech.com