Speed it up and Spark it up at Intel

2017 DataWorks Summit , San Jose California
Sandra Guija and Snehal Sakhare
Speed it up and Spark it up
at Intel
IT@Intel

About Sandra
IT@Intel
• Capability Engineer in the Big Data
Team @ Intel IT
• Past 5 years working in extending Big
Data Capabilities beyond Hadoop
• Master of Science in Computer
Science CSU Sacramento
• Specialization in memory parallel
processing and distributed systems.
2

About Snehal
• Working with Hadoop 3+ years as
Application Developer @ IT-Intel
Corporation.
• Master of Science in Computer
Science- CSU, Sacramento
• Publications - Power Efficient
MapReduce Workload Acceleration
using Integrated-GPU
• Zumba lover
IT@Intel 3

Objective
• Share our experience and key learnings in the journey of data
ingestion framework using Spark
• How to speed up application development using reusable ingestion
framework
• How to improve job performance
IT@Intel 4

Reusable Framework.
Speed it up
IT@Intel
1. Rapid Data Ingestion
2. Variety of data sources
3. Reusable solution
4. Skill Challenges
5. Increase productivity
Reusable Framework.
Speed it up
1. Rapid Data Ingestion
2. Variety of data sources
3. Reusable solution
4. Skill Challenges
5. Increase productivity

IT@Intel 6
Reusable Framework with Spark
• Spark takes us to next level
• Data is distributed in memory
• Lazy Computations – Optimize the job before executing
• Efficient pipelining – Avoid data hitting the hard disk.
• Uniform Data access using Spark SQL
• Spark SQL is the core of the ingestion framework

16pt Intel Clear Subhead, Date, Etc.
Video
IT@Intel

IT@Intel 8
Spark SQL Ingestion: Code snippet
JDBC extract:
Google analytics extract:
SFTP extract:

IT@Intel 9
BDRF: Seamless Spark job deployment
• Job Deployment is easier than ever in -local, yarn-client or yarn-cluster mode.
spark_mode: yarn-cluster
• Dependencies management in yarn-cluster mode.
#Run spark batch job
if [ "$SPARK_MODE" = 'yarn-cluster' ]; then
spark-submit --master ${SPARK_MODE}
--conf "spark.driver.extraClassPath=${SPARK_CLUSTER_EXTRACLASSPATH}"
--conf "spark.executor.extraClassPath=${SPARK_CLUSTER_EXTRACLASSPATH}"
--conf "spark.executor.memory = ${exec-mem}"
--conf "spark.yarn.dist.files=${SPARK_YARN_DIST_FILES},${INI_FILE}"
--py-files "${SPARK_PYTHON_FILES}"
${SPARK_SCRIPT} $(basename ${INI_FILE}) ${HDENV} ${NUM_EXEC} ${GATEWAY}
• Put a framework utilization dashboard for projects.

Spark it Up
Spark @ Intel IT
IT@Intel

16pt Intel Clear Subhead, Date, Etc.
Speed it up
IT@Intel
Spark it upBetter resources utilization
Lessons learned:
1. JDBC single executor
2. Tune-up executor memory
3. Large datasets/numerous files
4. Custom package cluster mode
Spark it up
Better resources utilization
Lessons learned:
1. JDBC single executor
2. Tune-up executor memory
3. Large datasets/numerous files
4. Custom package cluster mode
Spark it up
Better resources utilization
1. Parallelism
2. Challenge with Big datasets
3. Deploy dependencies
4. Optimization

Parallelism
IT@Intel 12
How to ensure data and processing is distributed evenly
across the worker nodes?
 Increase number of executors
 Use transformations:
– reduceByKey(), repartition(), coalesce(),
 Set properties:
– spark.sql.shuffle.partitions
– spark.default.parallelism

IT@Intel 13
Solution :
• JDBC ingest data in a single partition.
df =
sqlContext.read.format('jdbc').options(...).load()
options.put("partitionColumn", "emp_no");
options.put("lowerBound", "10001");
options.put("upperBound", "499999");
options.put("numPartitions", "10");
Partition options
JDBC Single Executor

IT@Intel 14
# executors
1
10
Solution :
 Tuning the number of partitions
options.put("numPartitions", "10");  options.put("numPartitions", “1000");
 Tuning the number of executors
Data Skewed

IT@Intel 15
#
executors
# cores per
executor
memory per
executor
# partitions Processing
Time
Performance
1 3 4 gb 1 18m, 47s 1.0 x
Ingest 4,587,847 records
Performance Comparison
num-executors & partitions
5 3 4 gb 1000 9m, 15s 2.0 x
10 3 4 gb 1000 8m, 52s 2.1 x
10 3 4 gb 500 4m, 44s 3.9 x

IT@Intel 16
#
executors
# cores per
executor
memory per
executor
#
partitions
#
columns
Processing
Time
Performance
10 3 4 gb 500 121 14m, 23s 1.0 x
10 3 4 gb 500 90 9m, 44s ~ 1.5 x
10 3 4 gb 500 60 6m, 12s ~ 2.3 x
10 3 4 gb 500 30 4m, 29s ~ 3.2 x
Ingest 11,513,057 records with 121 columns
Filtering number of columns

Challenge with Big datasets
IT@Intel 17
 Failures when running queries:
– > 2.3 TB
– > 6 billion records
WARN servlet.ServletHandler: Error for /jobs/job/
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
ERROR server.TransportRequestHandler: Error sending result RpcResponse{requestId=8891220538372697062,
body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=166 cap=166]}} to node019:40175; closing connection
org.apache.spark.SparkException: Error sending message
Caused by: java.nio.channels.ClosedChannelException
ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
Container
is killed

Challenge with Big datasets
IT@Intel 18
Use formula to set executor memory and cores
(spark.executor.memory * shuffle.memFraction * shuffle.safetyFraction )
spark.executor.cores
= memory available
to each task
Solution :
(spark.executor.memory * 0.2 * 0.8)
= X
8 x 1024
6
memory per task = 218 MB
(spark.executor.memory * 0.2 * 0.8)
= X
6 x 1024
4
memory per task = 245 MB

Deploy dependencies on cluster
IT@Intel 19
• Spark Python API non-JVM language
 Non platform independence
 Non jars files or code packing
• Managing and deploy dependencies on a cluster can be a pain
 Executors need access to third-party or custom libraries
 Python setup in each node
• How to deploy packages?
 Create conda environment
 Install libraries
 Zip conda environment and load it in HDFS
 Set Python environment variables and run PySpark job
Solution :

Deploy dependencies on cluster
IT@Intel 20
1. Create conda environments directory
conda config --add envs_dirs /path_to_conda_dir/
2. Create conda env and Install packages (env-name=py35)
conda create -n py35 --copy -y -q python=3.5
3. Install Python favorite packages
conda install -c conda-forge fuzzywuzzy -n py35
4. Zip conda-environment and ship it
zip -r py35.zip py35
ln -sf "/path_to_conda_dir/py35.zip" "PY35"
hdfs dfs -put py35.zip /my_hdfs_path/.
5. Run command to launch PySpark job in cluster mode
PYSPARK_DRIVER_PYTHON=./PY35/py35/bin/python spark-submit
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON="./PY35/py35/bin/python"
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON="./PY35/py35/bin/python"
--master yarn-cluster
--archives hdfs://path_to_conda_dir/py35.zip#PY35 my_conda_test.py

IT@Intel 21
BroadcastJoin Processing
Time
Performance
Local-mode 40m, 16s 1.0 x
Cluster-mode 6m, 23s 6.3 x
– Custom libraries: fuzzywuzzy
Deploy dependencies

Optimization
IT@Intel 22
• spark.sql.autoBroadcastJoinThreshold
− Enabled  SET spark.sql.autoBroadcastJoinThreshold= 20,000,000
− Disabled  SET spark.sql.autoBroadcastJoinThreshold = -1
• explain()
sqlContext.sql("SET
spark.sql.autoBroadcastJoinThreshold=20000000")
Joined.explain()
Enabled
sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold = -1 ")
Joined.explain()
Disabled

IT@Intel 23
BroadcastJoin Processing
Time
Performance
Default 14m, 56s 1.0 x
set=20000000 9m, 19s 1.6 x
 Big Table
– Size: 2.3 TB
– Records: > 6 billion
 Small Table
– Size: 324 KB
– Records: 1,707
Set autoBroadcastJoinThreshold

Optimization
IT@Intel 24
• Caching DataFrames
– Repeatedly loads, transformations or joins
– Avoid re-processing time
– df.cache()
Use DataFrame Caching

... In summary
IT@Intel 25
Run your
spark job
Check
Logs!!!
Find
bottlenecks
Spark it
Up

Legal Disclaimer
THE INFORMATION PROVIDED IN THIS PAPER IS INTENDED TO BE GENERAL IN NATURE AND IS
NOT SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE
BASED UPON INTEL’S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE
OR WARRANT OTHERS WILL OBTAIN SIMILAR RESULTS.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS AND
SERVICES. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED
IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY,
RELATING TO SALE AND/OR USE OF INTEL PRODUCTS AND SERVICES INCLUDING LIABILITY
OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY,
OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Intel, the Intel logo, Intel. Experience What’s Inside, the Intel. Experience What’s Inside logo, and Xeon
are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2017 Intel Corporation. All rights reserved.
IT@Intel 26

Speed it up and Spark it up at Intel

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Speed it up and Spark it up at Intel

Similar to Speed it up and Spark it up at Intel (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Speed it up and Spark it up at Intel