Power Software Development with Apache Spark

Josiah Samuel
IBM Systems Development Labs – Bangalore
Apache
1
Power Software Development

Agenda
§ Motivation
§ Introduction to Apache Spark
§ Spark SQL
§ Spark Internals
§ Spark ML Pipelines
2

Big-Data Era
§ Collect, Store & Process information at scale
§ Rise of Open Source Software
– Leverage clusters of commodity computers to process the data
§ Data Science
– Bridge between Data and the tools
– Starts with running simple queries
§ Placing Schema on the data and run SQL queries
– R, Octave, Python Scikit learn
§ Data is partitioned and spread across nodes ( HDFS )
– Algorithms with wide data dependency will suffer from n/w delays
– Probability of Node failure increases
3

Examples of Big-Data Processing
§ Build a model to detect credit card fraud using thousands of features and
billions of transactions.
§ Intelligently recommend millions of products to millions of users.
§ Estimate financial risk through simulations of portfolios including millions of
instruments.
§ Easily manipulate data from thousands of human genomes to detect genetic
associations with disease.
4

Parallel Systems Distributed Systems
§ Tightly coupled Systems
§ Multiple processors shared same
memory address space
§ Scale Up Servers
§ High Performance Computing(HPC)
§ Disadvantages:
– Scalability
– Expensive
5
§ Loosely coupled Systems
§ Communicate with each other over
Network
§ Scale Out Servers
§ Capable on collaborating to complete
a task
§ Disadvantages:
– Difficult in developing distributed
software
– Network problems
– Reliability & Fault Tolerance

Apache Hadoop
§ Hadoop emerges as a leader
– Filesystem Abstraction
– M/R programming model
– Linear scalability
– Automatic failure recovery
– Cheaper solution
§ Challenges
– Transformational APIs missing for Feature Engineering
– Not suitable for ML modeling
§ Multiple passes on same data sets
6

Apache Spark
§ Analytical Operation System
§ No need to write intermediate results in disk
§ All transformations are represented in DAG
– Acyclic Graph of Operators
§ Pass directly the results to next step in the
pipeline
§ In-Memory Processing
8
Performance Aspect

10
Apache Spark APIs:
• Easy to use APIs
• Improves productivity when operating on
large dataset
• APIs are intuitive and expressive
• RDDs, DataFrames/Datasets
v APIs help seamless movement between DataFrame or
Dataset and RDDs
v DataFrames and Datasets are built on top of RDDs.
v REPL – Interactive Analytics
Development Aspect

Hardware Trends
14
Storage
Network
2010 2017 Rate of Increase
50+MB/s
(HDD)
1Gbps
~3GHz
500+MB/s
(SSD)
10Gbps
~3GHz
10X
10X
??

Spark Software Stack Trend
§ IO has been optimized
– Reduce IO by pruning input data that is not needed
– New shuffle and network implementations
(2014 sort record – Ref: http://sortbenchmark.org/)
§ Data formats have improved
– E.g. Parquet is a “dense” columnar format
§ CPU increasingly the bottleneck; trend expected to continue
15

Project Tungsten
§ Phase 1 – Foundation – Spark 1.6.1
– Memory Management
– Code Generation
– Cache-aware Algorithms
§ Phase 2 - Spark 2.0.1
– Whole-stage Codegen
– Vectorization
17

Project Tungsten: Phase 1
§ Perform explicit memory management instead of relying on Java objects
– Reduce memory footprint
– Eliminate garbage collection overheads
– Use sun.misc.unsafe rows and off heap memory
§ Code generation for expression evaluation
– Reduce virtual function calls and interpretation overhead
§ Cache conscious sorting
– Reduce bad memory access patterns
Ref:
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
https://www.youtube.com/watch?v=5ajs8EIPWGI&feature=youtu.be
18

Project Tungsten: Phase 1: Memory Management
19
String str = “abcd”;
Java takes 48 bytes

Project Tungsten: Phase 1: Memory Management
20
In Spark, entire Object fits in 80 bytes
Note: Its applicable only for Dataset/DataFrame as the Schema is known

21
Volcano Iterator Model Simplicity

§ Short-coming of Volcano Iterator Model
– Too many virtual function calls
– Intermediate data in memory (or L1/L2/L3 cache)
– Can’t take advantage of modern CPU features
§ no loop unrolling
§ SIMD
§ Pipelining
§ Prefetching
§ Branch Prediction
22

§ Fusing operators together so the generated code looks like hand optimized code
§ Identify chains of operators (“stages”)
§ Compile each stage into a single function
§ Functionality of a general purpose execution engine;
23

Challenges with Data Science
§ Majority of work lies in preprocessing the data
– Feature Engineering
– Choosing the Algorithms
– Convert data to vectors for ML Algorithms
§ Iteration
– Scans over input vector till model converges
– Results are based on experimentation
§ Put the models in Production
– Evaluate its accuracy over time
– Rebuilt the model periodically
25
System should support more flexible
transformation
Multiple Data access from disk
should be effectively handled
Ease Model creation & suitable for
production use

SparkML – High level functionality
26
• Built on top of DataFrames
• org.apache.spark.mllib.* - deprecated

SparkML – TF IDF
§ Term Frequency – Inverse Document Frequency
§ Used to build Search Engines
– Score indicate how important a word is to a collection of documents
§ If a word appears frequently in a doc, it’s important
§ But if a word appears in many docs (the, and, of - stop-words) the word is not
meaningful, so lower its score
27

SparkML – KMeans Clustering
§ Unsupervised learning
§ Classify items into K different groups
§ Randomly initialize the centroid for these group
§ Compute Euclidean distance between the datapoints & centroid to assign the
group
§ Recompute the centroid once again with all the datapoints which are part of the
same group
§ Repeat till the centroid movement is negligible
28

Spark ML Pipeline
§ DataFrame:
– uses DF from Spark SQL as a ML dataset. Different columns can store text, feature
vectors, true labels and predictions
§ Transformer:
– an algorithm which can transform one DataFrame into another DataFrame
(example: ML model is a transformer that transforms a DF with features into a DF with predictions)
§ Estimator:
– an algorithm which can be fit on a DF to produce a Model
(example: a learning algorithm is an Estimator which trains on a DF and produces a model)
§ Pipeline:
– chains multiple Transformers and Estimators together to specify a ML workflow
29

Spark ML Pipeline
30
Tokenizer KMeans
Kmeans
Model
Hashing
TFPipeline
(Estimator)
Pipeline.fit()
Raw
Text
Words
Feature
Vectors

GPU Acceleration
§ Target Computation Heavy Spark Applications.
• Machine learning algorithms like linear regression, logistics regression etc.
• Same lambda function is applied on huge set of rows
§ Rising need to offload CPU work as CPUs has become bottleneck on Spark
§ Goal is to shorten execution time of a long-running Computation-heavy Spark application
§ Approach
– Accelerate a Spark application by using GPUs effectively and transparently
– Minimum change to user’s Spark Program
– No change to existing Spark code base
33

GPUEnabler in Catalyst
§ Put GPU kernel launcher and code generator into Catalyst
34
User’s Spark program
DataFrame
Dataset
Tungsten
Catalyst
Off-heap
UnsafeRow
GPU device memory
Columnar
Logical optimizer
Memory manager
CPU code generator
GPU code generatorGPU kernel launcher
Columnar

How to use GPUEnabler Plugin
35
Available at https://github.com/IBMSparkGPU/GPUEnabler
Build the package:
This will install the package to the maven local repository.
To include this package to your Spark Application include the dependency in the application’s pom.xml file:
More information can be found in the github repository regarding the APIs and sample programs

API usage compared to Spark APIs
§ map & reduce is replaced with mapExtFunc & reduceExtFunc
§ Handles are passed to it which holds the mapping of CUDA Kernels
§ Handles are also used in specifying the input parameters & output parameters mappings
36

Power Software Development with Apache Spark

More Related Content

What's hot

Similar to Power Software Development with Apache Spark

More from OpenPOWERorg

Recently uploaded

Power Software Development with Apache Spark