SlideShare a Scribd company logo
Frustration-Reduced
Spark
DataFrames and the Spark Time-Series Library
Ilya Ganelin
Why are we here?
 Spark for quick and easy batch ETL (no streaming)
 Actually using data frames
 Creation
 Modification
 Access
 Transformation
 Time Series analysis
 https://github.com/cloudera/spark-timeseries
 http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-
for-analyzing-time-series-data-with-apache-spark/
Why Spark?
 Batch/micro-batch processing of large datasets
 Easy to use, easy to iterate, wealth of common
industry-standard ML algorithms
 Super fast if properly configured
 Bridges the gap between the old (SQL, single machine
analytics) and the new (declarative/functional
distributed programming)
Why not Spark?
 Breaks easily with poor usage or improperly specified
configs
 Scaling up to larger datasets 500GB -> TB scale
requires deep understanding of internal configurations,
garbage collection tuning, and Spark mechanisms
 While there are lots of ML algorithms, a lot of them
simply don’t work, don’t work at scale, or have poorly
defined interfaces / documentation
Scala
 Yes, I recommend Scala
 Python API is underdeveloped, especially for ML Lib
 Java (until Java 8) is a second class citizen as far as
convenience vs. Scala
 Spark is written in Scala – understanding Scala helps you
navigate the source
 Can leverage the spark-shell to rapidly prototype new
code and constructs
 http://www.scala-
lang.org/docu/files/ScalaByExample.pdf
Why DataFrames?
 Iterate on datasets MUCH faster
 Column access is easier
 Data inspection is easier
 groupBy, join, are faster due to under-the-hood
optimizations
 Some chunks of ML Lib now optimized to use data
frames
Why not DataFrames?
 RDD API is still much better developed
 Getting data into DataFrames can be clunky
 Transforming data inside DataFrames can be clunky
 Many of the algorithms in ML Lib still depend on RDDs
Creation
 Read in a file with an embedded header
 http://tinyurl.com/zc5jzb2
 Create a DF
 Option A – Map schema to Strings, convert to Rows
 Option B – default type (case-classes or tuples)
DataFrame Creation
 Option C – Define the schema explicitly
 Check your work with df.show()
DataFrame Creation
Column Manipulation
 Selection
 GroupBy
 Confusing! You get a GroupedData object, not an RDD or
DataFrame
 Use agg or built-ins to get back to a DataFrame.
 Can convert to RDD with dataFrame.rdd
Custom Column Functions
 Option A: Add a column with a custom function:
 http://stackoverflow.com/questions/29483498/append-a-
column-to-data-frame-in-apache-spark-1-3
 Option B: Match the Row, get explicit names (yields
RDD, not DataFrame!)
Row Manipulation
 Filter
 Range:
 Equality:
 Column functions
Joins
 Option A (inner join)
 Option B (explicit)
 Join types: “inner”, “outer”, “left_outer”, “right_outer”,
“leftsemi”
 DataFrame joins benefit from Tungsten optimizations
Null Handling
 Built in support for handling nulls in data frames.
 Drop, fill, replace
 https://spark.apache.org/docs/1.6.0/api/java/org/apache
/spark/sql/DataFrameNaFunctions.html
Spark-TS
 https://github.com/cloudera/spark-timeseries
 Uses Java 8 ZonedDateTime as of 0.2 release:
Dealing with timestamps
Why Spark TS?
 Each row of the TimeSeriesRDD is a keyed vector of
doubles (indexed by your time index)
 Easily and efficiently slice data-sets by time:
 Generate statistics on the data:
Why Spark TS?
 Feature generation
 Moving averages over time
 Outlier detection (e.g. daily activity > 2 std-dev from
moving average)
 Constant time lookups in RDD by time vs. default O(m),
where m is the partition size
What doesn’t work?
 Cannot have overlapping entries per time index, e.g. data
with identical date time (e.g. same day for DayFrequency)
 If time zones are not aligned in your data, data may not
show up in the RDD
 Limited input format: must be built from two columns => Key
(K) and a Double
 Documentation/Examples not up to date with v0.2
 0.2 version => There will be bugs
 But it’s open source! Go fix them 
How do I use it?
 Download the binary (version 0.2 with dependencies)
 http://tinyurl.com/z6oo823
 Add it as a jar dependency when launching Spark:
 spark-shell --jars sparkts-0.2.0-SNAPSHOT-jar-with-
dependencies_ilya_0.3.jar
What else?
 Save your work => Write completed datasets to file
 Work on small data first, then go to big data
 Create test data to capture edge cases
 LMGTFY
By popular demand:
screen spark-shell
--driver-memory 100g 
--num-executors 60 
--executor-cores 5 
--master yarn-client 
--conf "spark.executor.memory=20g” 
--conf "spark.io.compression.codec=lz4" 
--conf "spark.shuffle.consolidateFiles=true" 
--conf "spark.dynamicAllocation.enabled=false" 
--conf "spark.shuffle.manager=tungsten-sort" 
--conf "spark.akka.frameSize=1028" 
--conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m -
XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 -
XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -
XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts -
XX:+UseCompressedOops"
Any Spark on YARN
 E.g. Deploy Spark 1.6 on CDH 5.4
 Download your Spark binary to the cluster and untar
 In $SPARK_HOME/conf/spark-env.sh:
 export
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/co
nf
 This tells Spark where Hadoop is deployed, this also gives it the
link it needs to run on YARN
 export SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop
classpath)
 This defines the location of the Hadoop binaries used at run
time
References
 http://spark.apache.org/docs/latest/programming-guide.html
 http://spark.apache.org/docs/latest/sql-programming-guide.html
 http://tinyurl.com/leqek2d (Working With Spark, by Ilya Ganelin)
 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-
spark-jobs-part-1/ (by Sandy Ryza)
 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-
spark-jobs-part-2/ (by Sandy Ryza)
 http://www.slideshare.net/ilganeli/frustrationreduced-spark-
dataframes-and-the-spark-timeseries-library

More Related Content

What's hot

Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
Josef Adersberger
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
Datio Big Data
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
Takuya UESHIN
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
Databricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
Spark Summit
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 

What's hot (20)

Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 

Similar to Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library

Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
Ravindra kumar
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
Chris George
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Ike Ellis
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Oracle Database 12c "New features"
Oracle Database 12c "New features" Oracle Database 12c "New features"
Oracle Database 12c "New features" Anar Godjaev
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 

Similar to Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library (20)

Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Oracle Database 12c "New features"
Oracle Database 12c "New features" Oracle Database 12c "New features"
Oracle Database 12c "New features"
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 

Recently uploaded

CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 

Recently uploaded (20)

CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 

Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library

  • 1. Frustration-Reduced Spark DataFrames and the Spark Time-Series Library Ilya Ganelin
  • 2. Why are we here?  Spark for quick and easy batch ETL (no streaming)  Actually using data frames  Creation  Modification  Access  Transformation  Time Series analysis  https://github.com/cloudera/spark-timeseries  http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library- for-analyzing-time-series-data-with-apache-spark/
  • 3. Why Spark?  Batch/micro-batch processing of large datasets  Easy to use, easy to iterate, wealth of common industry-standard ML algorithms  Super fast if properly configured  Bridges the gap between the old (SQL, single machine analytics) and the new (declarative/functional distributed programming)
  • 4. Why not Spark?  Breaks easily with poor usage or improperly specified configs  Scaling up to larger datasets 500GB -> TB scale requires deep understanding of internal configurations, garbage collection tuning, and Spark mechanisms  While there are lots of ML algorithms, a lot of them simply don’t work, don’t work at scale, or have poorly defined interfaces / documentation
  • 5. Scala  Yes, I recommend Scala  Python API is underdeveloped, especially for ML Lib  Java (until Java 8) is a second class citizen as far as convenience vs. Scala  Spark is written in Scala – understanding Scala helps you navigate the source  Can leverage the spark-shell to rapidly prototype new code and constructs  http://www.scala- lang.org/docu/files/ScalaByExample.pdf
  • 6. Why DataFrames?  Iterate on datasets MUCH faster  Column access is easier  Data inspection is easier  groupBy, join, are faster due to under-the-hood optimizations  Some chunks of ML Lib now optimized to use data frames
  • 7. Why not DataFrames?  RDD API is still much better developed  Getting data into DataFrames can be clunky  Transforming data inside DataFrames can be clunky  Many of the algorithms in ML Lib still depend on RDDs
  • 8. Creation  Read in a file with an embedded header  http://tinyurl.com/zc5jzb2
  • 9.  Create a DF  Option A – Map schema to Strings, convert to Rows  Option B – default type (case-classes or tuples) DataFrame Creation
  • 10.  Option C – Define the schema explicitly  Check your work with df.show() DataFrame Creation
  • 11. Column Manipulation  Selection  GroupBy  Confusing! You get a GroupedData object, not an RDD or DataFrame  Use agg or built-ins to get back to a DataFrame.  Can convert to RDD with dataFrame.rdd
  • 12. Custom Column Functions  Option A: Add a column with a custom function:  http://stackoverflow.com/questions/29483498/append-a- column-to-data-frame-in-apache-spark-1-3  Option B: Match the Row, get explicit names (yields RDD, not DataFrame!)
  • 13. Row Manipulation  Filter  Range:  Equality:  Column functions
  • 14. Joins  Option A (inner join)  Option B (explicit)  Join types: “inner”, “outer”, “left_outer”, “right_outer”, “leftsemi”  DataFrame joins benefit from Tungsten optimizations
  • 15. Null Handling  Built in support for handling nulls in data frames.  Drop, fill, replace  https://spark.apache.org/docs/1.6.0/api/java/org/apache /spark/sql/DataFrameNaFunctions.html
  • 16.
  • 19. Why Spark TS?  Each row of the TimeSeriesRDD is a keyed vector of doubles (indexed by your time index)  Easily and efficiently slice data-sets by time:  Generate statistics on the data:
  • 20. Why Spark TS?  Feature generation  Moving averages over time  Outlier detection (e.g. daily activity > 2 std-dev from moving average)  Constant time lookups in RDD by time vs. default O(m), where m is the partition size
  • 21. What doesn’t work?  Cannot have overlapping entries per time index, e.g. data with identical date time (e.g. same day for DayFrequency)  If time zones are not aligned in your data, data may not show up in the RDD  Limited input format: must be built from two columns => Key (K) and a Double  Documentation/Examples not up to date with v0.2  0.2 version => There will be bugs  But it’s open source! Go fix them 
  • 22. How do I use it?  Download the binary (version 0.2 with dependencies)  http://tinyurl.com/z6oo823  Add it as a jar dependency when launching Spark:  spark-shell --jars sparkts-0.2.0-SNAPSHOT-jar-with- dependencies_ilya_0.3.jar
  • 23. What else?  Save your work => Write completed datasets to file  Work on small data first, then go to big data  Create test data to capture edge cases  LMGTFY
  • 24. By popular demand: screen spark-shell --driver-memory 100g --num-executors 60 --executor-cores 5 --master yarn-client --conf "spark.executor.memory=20g” --conf "spark.io.compression.codec=lz4" --conf "spark.shuffle.consolidateFiles=true" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.shuffle.manager=tungsten-sort" --conf "spark.akka.frameSize=1028" --conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m - XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 - XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 - XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts - XX:+UseCompressedOops"
  • 25. Any Spark on YARN  E.g. Deploy Spark 1.6 on CDH 5.4  Download your Spark binary to the cluster and untar  In $SPARK_HOME/conf/spark-env.sh:  export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/co nf  This tells Spark where Hadoop is deployed, this also gives it the link it needs to run on YARN  export SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop classpath)  This defines the location of the Hadoop binaries used at run time
  • 26. References  http://spark.apache.org/docs/latest/programming-guide.html  http://spark.apache.org/docs/latest/sql-programming-guide.html  http://tinyurl.com/leqek2d (Working With Spark, by Ilya Ganelin)  http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache- spark-jobs-part-1/ (by Sandy Ryza)  http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache- spark-jobs-part-2/ (by Sandy Ryza)  http://www.slideshare.net/ilganeli/frustrationreduced-spark- dataframes-and-the-spark-timeseries-library