The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

The Nitty Gritty of Advanced
Analytics
Using Apache Spark in Python
Miklos Christine
Solutions Architect
mwc@databricks.com, @Miklos_C

About Me
Miklos Christine
Solutions Architect @ Databricks
- mwc@databricks.com
- Miklos_C@twitter
Systems Engineer @ Cloudera
Supported a few of the largest clusters in the world
Software Engineer @ Cisco
UC Berkeley Graduate

We are Databricks, the company behind Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
3
Data Value
Created Databricks on top of Spark to make big data simple.

…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads &
environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries

2012
started
@
Berkeley
2010
research
paper
2013
Databricks
started
& donated
to ASF
2014
Spark 1.0 &
libraries
(SQL, ML, GraphX)
2015
DataFrames
Tungsten
ML
Pipelines
2016
Spark
2.0

Spark Community Growth
• Spark Survey 2015
Highlights
• End of Year Spark Highlights

2015: A Great Year for Spark
Most active open source project in (big) data
• 1000+ code contributors
New language: R
Widespread industry support & adoption

HOW RESPONDENTS ARE RUNNING SPARK
51%
on a public cloud
TOP ROLES USING SPARK
of respondents identify
themselves as Data Engineers
41%
of respondents identify
themselves as Data Scientists
22%

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN
FRANCISCO
Source: Slide 5 of Spark Community Update

Large-Scale Usage
Largest cluster:
8000 Nodes (Tencent)
Largest single job:
1 PB (Alibaba, Databricks)
Top Streaming
Intake:
1 TB/hour (HHMI
Janelia Farm)
2014 On-Disk Sort Record
Fastest Open Source
Engine for sorting a PB

History of Spark APIs
RDD
(2011)
DataFrame
(2013)
Distribute collection
of JVM objects
Functional Operators (map,
filter, etc.)
Distribute collection
of Row objects
Expression-based
operations and UDFs
Logical plans and optimizer
Fast/efficient internal
representations
DataSet
(2015)
Internally rows, externally
JVM objects
Almost the “Best of both
worlds”: type safe + fast
But slower than DF
Not as good for interactive
analysis, especially Python

Benefit of Logical Plan:
Performance Parity Across Languages
DataFrame
RDD

ETL: Extract, Transform, Load
● Key factor for big data platforms
● Provides Speed Improvements in All Workloads
● Typically Executed by Data Engineers

File Formats
● Text File Formats
○ CSV
○ JSON
● Avro Row Format
● Parquet Columnar Format

File Formats + Compression
● File Formats
○ JSON
○ CSV
○ Avro
○ Parquet
● Compression Codecs
○ No compression
○ Snappy
○ Gzip
○ LZO

● Industry Standard File Format: Parquet
○ Write to Parquet:
df.write.format(“parquet”).save(“namesAndAges.parquet”)
df.write.format(“parquet”).saveAsTable(“myTestTable”)
○ For compression:
spark.sql.parquet.compression.codec = (gzip, snappy)
Spark Parquet Properties

Small Files Problem
● Small files problem still exists
● Metadata loading
● APIs:
df.coalesce(N)
df.repartition(N)
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

● RDD / DataFrame Partitions
df.rdd.getNumPartitions()
● SparkSQL Shuffle Partitions
spark.sql.shuffle.partitions
● Table Level Partitions
df.write.partitionBy(“year”).
save(“data.parquet”)
All About Partitions

# CSV
df = sqlContext.read.
format('com.databricks.spark.csv').
options(header='true', inferSchema='true').
load('/path/to/data')
# JSON
df = sqlContext.read.json("/tmp/test.json")
df.write.json("/tmp/test_output.json")
PySpark ETL APIs - Text Formats

PySpark ETL APIs - Container Formats
# Binary Container Formats
# Avro
df = sqlContext.read.
format("com.databricks.spark.avro").
load("/path/to/files/")
# Parquet
df = sqlContext.read.parquet("/path/to/files/")
df.write.parquet("/path/to/files/")

● Manage Number of Files
○ APIs manage the number of files per directory
df.repartition(80).
write.
parquet("/path/to/parquet/")
df.repartition(80)
partitionBy("year")
write.
parquet("/path/to/parquet/")
PySpark ETL APIs

Common ETL Problems
● Malformed JSON Records
sqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE
_corrupt_record IS NOT NULL")
● Mismatched DataFrame Schema
○ Null Representation vs Schema DataType
● Many Small Files / No Partition Strategy
○ Parquet Files: ~128MB - 256MB Compressed
Ref: https://databricks.gitbooks.io/databricks-spark-knowledge-
base/content/best_practices/dealing_with_bad_data.html

Debugging Spark
Spark Driver Error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 362.0 failed
4 times, most recent failure: Lost task 1.3 in stage 362.0 (TID 275202, ip-10-111-225-98.ec2.internal):
java.nio.channels.ClosedChannelException
Spark Executor Error:
16/04/13 20:02:16 ERROR DefaultWriterContainer: Aborting task.
java.text.ParseException: Unparseable number: "N"
at java.text.NumberFormat.parse(NumberFormat.java:385)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply$mcD$sp(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at scala.util.Try.getOrElse(Try.scala:77)
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:58)

SparkSQL Best Practices
● DataFrames and SparkSQL are synonyms
● Use builtin functions instead of custom UDFs
○ import pyspark.sql.functions
● Examples:
○ to_date()
○ get_json_object()
○ regexp_extract()
○ hour() / minute()
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

SparkSQL Best Practices
● Large Table Joins
○ Largest Table on LHS
○ Increase Spark Shuffle Partitions
○ Leverage “cluster by” API included in Spark 1.6
sqlCtx.sql("select * from large_table_1 cluster by num1")
.registerTempTable("sorted_large_table_1");
sqlCtx.sql(“cache table sorted_large_table_1”);

PySpark API Best Practices
● User Defined Functions (UDFs)
from pyspark.sql import functions as F
add_n = udf(lambda x, y: x + y, IntegerType())
# We register a UDF that adds a column to the DataFrame, and we cast the
id column to an Integer type.
df = df.withColumn('id_offset',
add_n( F.lit(1000), df.id.cast(IntegerType())))

● Built-in Functions
corpus_df = df.select(
F.lower( F.col('body')).alias('corpus'),
F.monotonicallyIncreasingId().alias('id'))
corpus_df = df.select(
F.date_format( F.from_utc_timestamp(
F.from_unixtime(F.col('created_utc'), "PST"), 'EEEE')).alias
('dayofweek'))
Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

● User Defined Functions (UDFs)
def squared(s):
return s * s
sqlContext.udf.register("squaredWithPython", squared)
display(df.select("id", squared_udf("id").alias("id_squared")))

Why Spark ML
Provide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability
• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)
Advantages of MLlib’s Design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility

High-level functionality in MLlib
Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent
itemsets
42
Workflow utilities
• Model import/export
• Pipelines
• DataFrames
• Cross validation
Data utilities
• Feature
extraction &
selection
• Statistics
• Linear algebra

Machine Learning: What and Why?
ML uses data to identify patterns and make decisions.
Core value of ML is automated decision making
• Especially important when dealing with TB or PB of data
Many Use Cases including:
• Marketing and advertising optimization
• Security monitoring / fraud detection
• Operational optimizations

Algorithm coverage in MLlib
Classification
• Logistic regression w/ elastic net
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
• Multilayer perceptron
• One-vs-rest
Regression
• Least squares w/ elastic net
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
Recommendation
• Alternating Least Squares
Frequent itemsets
• FP-growth
• Prefix span
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors & matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Model import/export
Pipelines
Feature extraction &
selection
• Binarizer
• Bucketizer
• Chi-Squared selection
• CountVectorizer
• Discrete cosine transform
• ElementwiseProduct
• Hashing term frequency
• Inverse document frequency
• MinMaxScaler
• Ngram
• Normalizer
• One-Hot Encoder
• PCA
• PolynomialExpansion
• RFormula
• SQLTransformer
• Standard scaler
• StopWordsRemover
• StringIndexer
• Tokenizer
• StringIndexer
• VectorAssembler
• VectorIndexer
• VectorSlicer
• Word2Vec
List based on Spark
1.5 44

Spark ML Best Practices
● Spark MLLib vs SparkML
○ Understand the differences
● Don’t Pipeline Too Many Stages
○ Check Results Between Stages

● DataFrame to RDD Mapping
def tokenize(text):
tokens = word_tokenize(text)
lowercased = [t.lower() for t in tokens]
no_punctuation = []
for word in lowercased:
punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION])
no_punctuation.append(punct_removed)
no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]
stemmed = [STEMMER.stem(w) for w in no_stopwords]
return [w for w in stemmed if w]
rdd = wordsDataFrame.map(lambda x: (x.__getitem__('id'), tokenize(x.__getitem__('corpus'))))
PySpark ML API Best Practices

Learning more about MLlib
Guides & examples
• Example workflow using ML Pipelines (Python)
• The above 2 links are part of the Databricks Guide, which contains many more
examples and references.
References
• Apache Spark MLlib User Guide
• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API
documentation.
• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.
org/abs/1505.06807 (academic paper)
49

Thanks!
Sign Up For Databricks Community Edition!
http://go.databricks.com/databricks-community-
edition-beta-waitlist

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Similar to The Nitty Gritty of Advanced Analytics Using Apache Spark in Python (20)

Recently uploaded

Recently uploaded (20)

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python