SlideShare a Scribd company logo
Kai Rothauge, Alex Gittens, Michael W. Mahoney
An Apache Spark ⇔ MPI Interface
#Res1SAIS
Thanks to NERSC and Cray Inc. for help and support!
2#Res1SAIS
• MPI = Message Passing Interface
• A specification for the developers and users of message passing libraries
• Message-Passing Parallel Programming Model:
• cooperative operations between processes
• data moved from address space of one process to that of another
• Dominant model in high-performance computing
• Requires installation of MPI implementation on system
• Popular implementations: MPICH, Open MPI
What is MPI?
More on MPI
3#Res1SAIS
• Generally regarded as “low-level” for purposes of distributed computing
• Efficient implementations of collective operations
• Works with distributed memory, shared memory, GPUs
• Communication between MPI processes:
• via TCP/IP sockets, or
• optimized for underlying interconnects (InfiniBand, Cray Aries, Intel Omni-Path,
etc.)
• Communicator objects connect groups of MPI processors
• Con: No fault tolerance or elasticity
• Numerical linear algebra (NLA) using Spark vs. MPI
• Why do linear algebra in Spark?
Spark for data-centric workloads and scientific
analysis
Characterization of linear algebra in Spark
Customers demand Spark; want to understand
performance concerns
4#Res1SAIS
Case Study: Spark vs. MPI
5#Res1SAIS
Case Study: Spark vs. MPI
• Numerical linear algebra (NLA) using Spark vs. MPI
• Why do linear algebra in Spark?
Pros:
Faster development, easier reuse
Simple dataset abstractions (RDDs, DataFrames, DataSets)
An entire ecosystem that can be used before and after NLA computations
Spark can take advantage of available local linear algebra codes
Automatic fault-tolerance, elasticity, out-of-core support
Con:
Classical MPI-based linear algebra implementations will be faster and
more efficient
6#Res1SAIS
Case Study: Spark vs. MPI
• Numerical linear algebra (NLA) using Spark vs. MPI
• Computations performed on NERSC supercomputer Cori Phase 1, a
Cray XC40
• 2,388 compute nodes
• 128 GB RAM/node, 32 2.3GHz Haswell cores/node
• Lustre storage system, Cray Aries interconnect
A. Gittens et al. “Matrix factorizations at scale: A comparison of scientific data analytics in
Spark and C+MPI using three case studies”, 2016 IEEE International Conference on Big Data
(Big Data), pages 204–213, Dec 2016.
7#Res1SAIS
A. Gittens et al. “Matrix factorizations at scale: A comparison of scientific data analytics in
Spark and C+MPI using three case studies”, 2016 IEEE International Conference on Big Data
(Big Data), pages 204–213, Dec 2016.
Case Study: Spark vs. MPI
• Numerical linear algebra (NLA) using Spark vs. MPI
• Matrix factorizations considered include truncated Singular Value
Decomposition (SVD)
• Data sets include
• Ocean temperature data - 2.2 TB
• Atmospheric data - 16 TB
8#Res1SAIS
Case Study: Spark vs. MPI
Rank 20 SVD of
2.2TB ocean
temperature data
9#Res1SAIS
Rank 20 SVD of
16TB atmospheric
data using 48K+
cores
Case Study: Spark vs. MPI
10#Res1SAIS
Case Study: Spark vs. MPI
Lessons learned:
• With favorable data (tall and skinny) and well-adapted algorithms, linear
algebra in Spark is 2x-26x slower than MPI when I/O is included
• Spark’s overheads:
• Can be order of magnitude higher than the actual computation times
• Anti-scale
• The gaps in performance suggest it may be better to interface with
MPI-based codes from Spark
11#Res1SAIS
• Alchemist interfaces between Apache Spark and existing or custom
MPI-based libraries for large-scale linear algebra, machine learning, etc.
• Idea:
• Use Spark for regular data analysis workflow
• When computationally intensive calculations are required, call relevant
MPI-based codes from Spark using Alchemist, send results to Spark
• Combine high productivity of Spark with high performance of MPI
12#Res1SAIS
Target users:
• Scientific community: Use Spark for analysis of large scientific datasets
by calling existing MPI-based libraries where appropriate
• Machine learning practitioners and data analysts:
• Better performance of a wide range of large-scale,
computationally intensive ML and data analysis algorithms
• For instance, SVD for principal component analysis,
recommender systems, leverage scores, etc.
13#Res1SAIS
• Alchemist: Acts as bridge between Spark and MPI-based libraries
• Alchemist-Client Interface: API for user, communicates with Alchemist
via TCP/IP sockets
• Alchemist-Library Interface: Shared object, imports MPI library,
provides generic interface for Alchemist to communicate with library
Basic Framework
14#Res1SAIS
Basic workflow:
• Spark application:
• Sends distributed dataset from RDD (IndexedRowMatrix) to Alchemist
• Tells Alchemist what MPI-based code should be called
• Alchemist loads relevant MPI-based library, calls function, sends results
to Spark
Basic Framework
15#Res1SAIS
• Alchemist can also load data from file
• Alchemist needs to store distributed data in appropriate format that can
be used by MPI-based libraries:
• Candidates: ScaLAPACK, Elemental, PLAPACK
• Alchemist currently uses Elemental, support for ScaLAPACK under
development
•
Basic Framework
16#Res1SAIS
Alchemist Architecture
17#Res1SAIS
Sample API
import alchemist.{Alchemist, AlMatrix}
import alchemist.libA.QRDecomposition // libA is sample MPI lib
// other code here ...
// sc is instance of SparkContext
val ac = new Alchemist.AlchemistContext(sc, numWorkers)
ac.registerLibrary("libA", ALIlibALocation)
// maybe other code here ...
val alA = AlMatrix(A) // A is IndexedRowMatrix
// routine returns QR factors of A as AlMatrix objects
val (alQ, alR) = QRDecomposition(alA)
// send data from Alchemist to Spark once ready
val Q = alQ.toIndexedRowMatrix() // convert AlMatrix alQ to RDD
val R = alR.toIndexedRowMatrix() // convert AlMatrix alR to RDD
// maybe other code here ...
ac.stop() // release resources once no longer required
18#Res1SAIS
Example: Matrix Multiplication
• Requires expensive shuffles in Spark, which is impractical:
• Matrices/RDDs are row-partitioned
• One matrix must be converted to be column-partitioned
• Requires an all-to-all shuffle that often fails once the matrix is distributed
19#Res1SAIS
Example: Matrix Multiplication
GB/nodes Spark+Alchemist Spark
Send (s) Multiplication (s) Receive (s) Total (s) Total (s)
0.8/1 5.90±2.17 6.60±0.07 2.19±0.58 14.68±2.69 160.31±8.89
12/1 16.66±0.88 75.69±0.42 19.43±0.45 111.78±1.26 809.31±51.90
56/2 32.50±2.88 178.68±24.8 55.83±0.37 267.02±27.38 -Failed-
144/4 69.40±1.85 171.73±0.81 66.80±3.46 307.94±4.57 -Failed-
• Generated random matrices and used same number of Spark and
Alchemist nodes
• Take-away: Spark’s matrix multiply is slow even on one executor, and
unreliable once there are more
20#Res1SAIS
Example: Truncated SVD
Use Alchemist and MLlib to get rank 20
truncated SVD
• Spark: 22 nodes; Alchemist: 8 nodes
• A: m-by-10K, where m = 5M, 2.5M,
1.25M, 625K, 312.5K
• Ran jobs for at most 30 minutes
(1800 s)
Experiment Setup
21#Res1SAIS
Example: Truncated SVD
• 2.2TB (6,177,583-by-46,752) ocean
temperature data read in from HDF5 file
Experiment Setup
• Data replicated column-wise
22#Res1SAIS
Upcoming Features
• PySpark, SparkR ⇔ MPI Interface
• Interface for Python => PySpark support
• Future work: Interface for R
• More Functionality
• Support for sparse matrices
• Support for MPI-based libraries built on ScaLAPACK
• Alchemist and Containers
• Alchemist running in Docker and Kubernetes
• Will enable Alchemist on clusters and the cloud
23#Res1SAIS
Limitations and Constraints
• Two copies of data in memory, more during a relayout during
computation
• Data transfer overhead between Spark and Alchemist when data on
different nodes
• Subject to network disruptions and overload
• MPI is not fault tolerant or elastic
• Lack of MPI-based libraries for machine learning
• No equivalent to MLlib currently available, closest is MaTEx
• On Cori, need to run Alchemist and Spark on separate nodes -> more
data transfer over interconnects -> larger overheads
24#Res1SAIS
Future Work
• Apache Spark ⇔ X Interface
• Interest in connecting Spark with other libraries for distributed computing
(e.g. Cray Chapel, Apache REEF)
• Reduce communication costs
• Exploit locality
• Reduce number of messages
• Use communication protocols designed for underlying network
infrastructure
• Run as network service
• MPI computations with (basic) fault tolerance and elasticity
25#Res1SAIS
github.com/project-alchemist/
Thanks to Cray Inc., DARPA and NSF for financial support
References
● A. Gittens, K. Rothauge, M. W. Mahoney, et al., “Alchemist: Accelerating Large-Scale Data
Analysis by offloading to High-Performance Computing Libraries”, 2018, Proceedings of the
24th ACM SIGKDD International Conference, Aug 2018, to appear
● A. Gittens, K. Rothauge, M. W. Mahoney, et al., “Alchemist: An Apache Spark ⇔ MPI
Interface”, 2018, to appear in CCPE Special Issue on Cray User Group Conference 2018

More Related Content

What's hot

Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
Helena Edelson
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
Data Integration
Data IntegrationData Integration
Data Integration
Datio Big Data
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
StreamNative
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
Helena Edelson
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
Databricks
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
DataWorks Summit/Hadoop Summit
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Pulsar summit asia 2021   apache pulsar with mqtt for edge computingPulsar summit asia 2021   apache pulsar with mqtt for edge computing
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Timothy Spann
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
Rahul Kumar
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdfMLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
Timothy Spann
 

What's hot (20)

Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Data Integration
Data IntegrationData Integration
Data Integration
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Pulsar summit asia 2021   apache pulsar with mqtt for edge computingPulsar summit asia 2021   apache pulsar with mqtt for edge computing
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdfMLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
 

Similar to Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rothauge

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Lviv Startup Club
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
Mohammed Fazuluddin
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
BBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.com
Cedric Vidal
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
Databricks
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
Marco Quartulli
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
Avi Levi
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 

Similar to Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rothauge (20)

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
BBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.com
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 

Recently uploaded (20)

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 

Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rothauge

  • 1. Kai Rothauge, Alex Gittens, Michael W. Mahoney An Apache Spark ⇔ MPI Interface #Res1SAIS Thanks to NERSC and Cray Inc. for help and support!
  • 2. 2#Res1SAIS • MPI = Message Passing Interface • A specification for the developers and users of message passing libraries • Message-Passing Parallel Programming Model: • cooperative operations between processes • data moved from address space of one process to that of another • Dominant model in high-performance computing • Requires installation of MPI implementation on system • Popular implementations: MPICH, Open MPI What is MPI?
  • 3. More on MPI 3#Res1SAIS • Generally regarded as “low-level” for purposes of distributed computing • Efficient implementations of collective operations • Works with distributed memory, shared memory, GPUs • Communication between MPI processes: • via TCP/IP sockets, or • optimized for underlying interconnects (InfiniBand, Cray Aries, Intel Omni-Path, etc.) • Communicator objects connect groups of MPI processors • Con: No fault tolerance or elasticity
  • 4. • Numerical linear algebra (NLA) using Spark vs. MPI • Why do linear algebra in Spark? Spark for data-centric workloads and scientific analysis Characterization of linear algebra in Spark Customers demand Spark; want to understand performance concerns 4#Res1SAIS Case Study: Spark vs. MPI
  • 5. 5#Res1SAIS Case Study: Spark vs. MPI • Numerical linear algebra (NLA) using Spark vs. MPI • Why do linear algebra in Spark? Pros: Faster development, easier reuse Simple dataset abstractions (RDDs, DataFrames, DataSets) An entire ecosystem that can be used before and after NLA computations Spark can take advantage of available local linear algebra codes Automatic fault-tolerance, elasticity, out-of-core support Con: Classical MPI-based linear algebra implementations will be faster and more efficient
  • 6. 6#Res1SAIS Case Study: Spark vs. MPI • Numerical linear algebra (NLA) using Spark vs. MPI • Computations performed on NERSC supercomputer Cori Phase 1, a Cray XC40 • 2,388 compute nodes • 128 GB RAM/node, 32 2.3GHz Haswell cores/node • Lustre storage system, Cray Aries interconnect A. Gittens et al. “Matrix factorizations at scale: A comparison of scientific data analytics in Spark and C+MPI using three case studies”, 2016 IEEE International Conference on Big Data (Big Data), pages 204–213, Dec 2016.
  • 7. 7#Res1SAIS A. Gittens et al. “Matrix factorizations at scale: A comparison of scientific data analytics in Spark and C+MPI using three case studies”, 2016 IEEE International Conference on Big Data (Big Data), pages 204–213, Dec 2016. Case Study: Spark vs. MPI • Numerical linear algebra (NLA) using Spark vs. MPI • Matrix factorizations considered include truncated Singular Value Decomposition (SVD) • Data sets include • Ocean temperature data - 2.2 TB • Atmospheric data - 16 TB
  • 8. 8#Res1SAIS Case Study: Spark vs. MPI Rank 20 SVD of 2.2TB ocean temperature data
  • 9. 9#Res1SAIS Rank 20 SVD of 16TB atmospheric data using 48K+ cores Case Study: Spark vs. MPI
  • 10. 10#Res1SAIS Case Study: Spark vs. MPI Lessons learned: • With favorable data (tall and skinny) and well-adapted algorithms, linear algebra in Spark is 2x-26x slower than MPI when I/O is included • Spark’s overheads: • Can be order of magnitude higher than the actual computation times • Anti-scale • The gaps in performance suggest it may be better to interface with MPI-based codes from Spark
  • 11. 11#Res1SAIS • Alchemist interfaces between Apache Spark and existing or custom MPI-based libraries for large-scale linear algebra, machine learning, etc. • Idea: • Use Spark for regular data analysis workflow • When computationally intensive calculations are required, call relevant MPI-based codes from Spark using Alchemist, send results to Spark • Combine high productivity of Spark with high performance of MPI
  • 12. 12#Res1SAIS Target users: • Scientific community: Use Spark for analysis of large scientific datasets by calling existing MPI-based libraries where appropriate • Machine learning practitioners and data analysts: • Better performance of a wide range of large-scale, computationally intensive ML and data analysis algorithms • For instance, SVD for principal component analysis, recommender systems, leverage scores, etc.
  • 13. 13#Res1SAIS • Alchemist: Acts as bridge between Spark and MPI-based libraries • Alchemist-Client Interface: API for user, communicates with Alchemist via TCP/IP sockets • Alchemist-Library Interface: Shared object, imports MPI library, provides generic interface for Alchemist to communicate with library Basic Framework
  • 14. 14#Res1SAIS Basic workflow: • Spark application: • Sends distributed dataset from RDD (IndexedRowMatrix) to Alchemist • Tells Alchemist what MPI-based code should be called • Alchemist loads relevant MPI-based library, calls function, sends results to Spark Basic Framework
  • 15. 15#Res1SAIS • Alchemist can also load data from file • Alchemist needs to store distributed data in appropriate format that can be used by MPI-based libraries: • Candidates: ScaLAPACK, Elemental, PLAPACK • Alchemist currently uses Elemental, support for ScaLAPACK under development • Basic Framework
  • 17. 17#Res1SAIS Sample API import alchemist.{Alchemist, AlMatrix} import alchemist.libA.QRDecomposition // libA is sample MPI lib // other code here ... // sc is instance of SparkContext val ac = new Alchemist.AlchemistContext(sc, numWorkers) ac.registerLibrary("libA", ALIlibALocation) // maybe other code here ... val alA = AlMatrix(A) // A is IndexedRowMatrix // routine returns QR factors of A as AlMatrix objects val (alQ, alR) = QRDecomposition(alA) // send data from Alchemist to Spark once ready val Q = alQ.toIndexedRowMatrix() // convert AlMatrix alQ to RDD val R = alR.toIndexedRowMatrix() // convert AlMatrix alR to RDD // maybe other code here ... ac.stop() // release resources once no longer required
  • 18. 18#Res1SAIS Example: Matrix Multiplication • Requires expensive shuffles in Spark, which is impractical: • Matrices/RDDs are row-partitioned • One matrix must be converted to be column-partitioned • Requires an all-to-all shuffle that often fails once the matrix is distributed
  • 19. 19#Res1SAIS Example: Matrix Multiplication GB/nodes Spark+Alchemist Spark Send (s) Multiplication (s) Receive (s) Total (s) Total (s) 0.8/1 5.90±2.17 6.60±0.07 2.19±0.58 14.68±2.69 160.31±8.89 12/1 16.66±0.88 75.69±0.42 19.43±0.45 111.78±1.26 809.31±51.90 56/2 32.50±2.88 178.68±24.8 55.83±0.37 267.02±27.38 -Failed- 144/4 69.40±1.85 171.73±0.81 66.80±3.46 307.94±4.57 -Failed- • Generated random matrices and used same number of Spark and Alchemist nodes • Take-away: Spark’s matrix multiply is slow even on one executor, and unreliable once there are more
  • 20. 20#Res1SAIS Example: Truncated SVD Use Alchemist and MLlib to get rank 20 truncated SVD • Spark: 22 nodes; Alchemist: 8 nodes • A: m-by-10K, where m = 5M, 2.5M, 1.25M, 625K, 312.5K • Ran jobs for at most 30 minutes (1800 s) Experiment Setup
  • 21. 21#Res1SAIS Example: Truncated SVD • 2.2TB (6,177,583-by-46,752) ocean temperature data read in from HDF5 file Experiment Setup • Data replicated column-wise
  • 22. 22#Res1SAIS Upcoming Features • PySpark, SparkR ⇔ MPI Interface • Interface for Python => PySpark support • Future work: Interface for R • More Functionality • Support for sparse matrices • Support for MPI-based libraries built on ScaLAPACK • Alchemist and Containers • Alchemist running in Docker and Kubernetes • Will enable Alchemist on clusters and the cloud
  • 23. 23#Res1SAIS Limitations and Constraints • Two copies of data in memory, more during a relayout during computation • Data transfer overhead between Spark and Alchemist when data on different nodes • Subject to network disruptions and overload • MPI is not fault tolerant or elastic • Lack of MPI-based libraries for machine learning • No equivalent to MLlib currently available, closest is MaTEx • On Cori, need to run Alchemist and Spark on separate nodes -> more data transfer over interconnects -> larger overheads
  • 24. 24#Res1SAIS Future Work • Apache Spark ⇔ X Interface • Interest in connecting Spark with other libraries for distributed computing (e.g. Cray Chapel, Apache REEF) • Reduce communication costs • Exploit locality • Reduce number of messages • Use communication protocols designed for underlying network infrastructure • Run as network service • MPI computations with (basic) fault tolerance and elasticity
  • 25. 25#Res1SAIS github.com/project-alchemist/ Thanks to Cray Inc., DARPA and NSF for financial support References ● A. Gittens, K. Rothauge, M. W. Mahoney, et al., “Alchemist: Accelerating Large-Scale Data Analysis by offloading to High-Performance Computing Libraries”, 2018, Proceedings of the 24th ACM SIGKDD International Conference, Aug 2018, to appear ● A. Gittens, K. Rothauge, M. W. Mahoney, et al., “Alchemist: An Apache Spark ⇔ MPI Interface”, 2018, to appear in CCPE Special Issue on Cray User Group Conference 2018