ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

ETL to ML: Use Apache
Spark as an end to end tool
for Advanced Analytics
Miklos Christine
Solutions Architect
mwc@databricks.com, @Miklos_C

About Me: Miklos Christine
Solutions Architect @ Databricks
- Assist customers architect big data platforms
- Help customers understand big data best practices
Previously:
- Systems Engineer @ Cloudera
- Supported customers running a few of the largest clusters in the
world
- Software Engineer @ Cisco

We are Databricks, the company behind Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
3
Data Value
Created Databricks on top of Spark to make big data simple.

…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads &
environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries

History of Spark APIs
RDD
(2011)
DataFrame
(2013)
Distribute collection
of JVM objects
Functional Operators (map,
filter, etc.)
Distribute collection
of Row objects
Expression-based
operations and UDFs
Logical plans and optimizer
Fast/efficient internal
representations
DataSet
(2015)
Internally rows, externally
JVM objects
Almost the “Best of both
worlds”: type safe + fast
But slower than DF
Not as good for interactive
analysis, especially Python

Apache Spark 2.0 API
DataSet
(2016)
• DataFrame = Dataset[Row]
• Convenient for interactive analysis
• Faster
DataFrame
DataSet
Untyped API
Typed API
• Optimized for data engineering
• Fast

Benefit of Logical Plan:
Performance Parity Across Languages
DataFram
e
RDD

Spark Community Growth
• Spark Survey 2015
Highlights
• End of Year Spark Highlights

Spark adoption is
growing rapidly
Spark use is growing
beyond Hadoop
Spark is increasing
access to big data
Spark Survey Report 2015 Highlights
TOP 3 APACHE SPARK TAKEAWAYS

HOW RESPONDENTS ARE RUNNING SPARK
51%
on a public cloud
TOP ROLES USING SPARK
of respondents identify
themselves as Data Engineers
41%
of respondents identify
themselves as Data Scientists
22%

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN
FRANCISCO
Source: Slide 5 of Spark Community Update

Large-Scale Usage
Largest cluster:
8000 Nodes (Tencent)
Largest single job:
1 PB (Alibaba, Databricks)
Top Streaming
Intake:
1 TB/hour (HHMI
Janelia Farm)
2014 On-Disk Sort Record
Fastest Open Source
Engine for sorting a PB

Source: How Spark is Making an Impact at Goldman Sachs
• Started with Hadoop, RDBMS, Hive, HBASE, PIG, Java MR, etc.
• Challenges: Java, Debugging PIG, Code-Compile-Deploy-Debug
• Solution: Spark
• Language Support: Scala, Java, Python, R
• In-Memory: Faster than other solutions
• SQL, Stream Processing, ML, Graph
• Both Batch and stream processing

• Scale:
• > 1000 nodes (20,000 cores, 100TB RAM)
• Daily Jobs: 2000-3000
• Supports: Ads, Search, Map, Commerce, etc.
• Cool project: Enabling Interactive Queries with Spark and
Tachyon
• >50X acceleration of Big Data Analytics workloads

ETL: Extract, Transform, Load
● Key factor for big data platforms
● Provides Speed Improvements in All Workloads
● Typically Executed by Data Engineers

File Formats
● Text File Formats
○ CSV
○ JSON
● Avro Row Format
● Parquet Columnar Format

File Formats + Compression
● File Formats
○ JSON
○ CSV
○ Avro
○ Parquet
● Compression Codecs
○ No compression
○ Snappy
○ Gzip
○ LZO

● Industry Standard File Format: Parquet
○ Write to Parquet:
df.write.format(“parquet”).save(“namesAndAges.parquet”)
df.write.format(“parquet”).saveAsTable(“myTestTable”)
○ For compression:
spark.sql.parquet.compression.codec = (gzip, snappy)
Spark Parquet Properties

Small Files Problem
● Small files problem still exists
● Metadata loading
● APIs:
df.coalesce(N)
df.repartition(N)
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

● RDD / DataFrame Partitions
df.rdd.getNumPartitions()
● SparkSQL Shuffle Partitions
spark.sql.shuffle.partitions
● Table Level Partitions
df.write.partitionBy(“year”).
save(“data.parquet”)
All About Partitions

Common ETL Problems
● Malformed JSON Records
sqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE
_corrupt_record IS NOT NULL")
● Mismatched DataFrame Schema
○ Null Representation vs Schema DataType
● Many Small Files / No Partition Strategy
○ Parquet Files: ~128MB - 256MB Compressed
Ref: https://databricks.gitbooks.io/databricks-spark-knowledge-
base/content/best_practices/dealing_with_bad_data.html

Debugging Spark
Spark Driver Error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 362.0 failed
4 times, most recent failure: Lost task 1.3 in stage 362.0 (TID 275202, ip-10-111-225-98.ec2.internal):
java.nio.channels.ClosedChannelException
Spark Executor Error:
16/04/13 20:02:16 ERROR DefaultWriterContainer: Aborting task.
java.text.ParseException: Unparseable number: "N"
at java.text.NumberFormat.parse(NumberFormat.java:385)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply$mcD$sp(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at scala.util.Try.getOrElse(Try.scala:77)
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:58)

SparkSQL Best Practices
● DataFrames and SparkSQL are synonyms
● Use builtin functions instead of custom UDFs
○ import pyspark.sql.functions
○ import org.apache.spark.sql.functions
● Examples:
○ to_date()
○ get_json_object()
○ regexp_extract()
○ hour() / minute()
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

SparkSQL Best Practices
● Large Table Joins
○ Largest Table on LHS
○ Increase Spark Shuffle Partitions
○ Leverage “cluster by” API included in Spark 1.6
sqlCtx.sql("select * from large_table_1 cluster by num1")
.registerTempTable("sorted_large_table_1");
sqlCtx.sql(“cache table sorted_large_table_1”);

Machine Learning: What and Why?
ML uses data to identify patterns and make decisions.
Core value of ML is automated decision making
• Especially important when dealing with TB or PB of data
Many Use Cases including:
• Marketing and advertising optimization
• Security monitoring / fraud detection
• Operational optimizations

Why Spark MLlib
Provide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability
• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)
Advantages of MLlib’s Design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility

Spark ML Best Practices
● Spark MLLib vs SparkML
○ Understand the differences
● Don’t Pipeline Too Many Stages
○ Check Results Between Stages

Source: Toyota Customer 360 Insights on Apache Spark and MLlib
• Performance
• Original batch job: 160 hours
• Same Job re-written using Apache Spark: 4 hours
• Categorize
• Prioritize incoming social media in real-time using Spark MLlib (differentiate
campaign, feedback, product feedback, and noise)
• ML life cycle: Extract features and train:
• V1: 56% Accuracy -> V9: 82% Accuracy
• Remove False Positives and Semantic Analysis (distance similarity between concepts)

Thanks!
Sign Up For Databricks Community Edition!
http://go.databricks.com/databricks-community-
edition-beta-waitlist

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

Similar to ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics (20)

Recently uploaded

Recently uploaded (20)

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics