Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS

© Hortonworks Inc. 2014
Introduction to Big Data Analytics using
Apache Spark on HDInsights on Azure
(SaaS) and/or HDP on Azure(PaaS)
Hortonworks. We do Hadoop.
Alex Zeltov

@azeltov

In this workshop
• Introduction to HDP and Spark
• Build a Data analytics application:
- Spark Programming: Scala, Python, R
- Core Spark: working with RDDs, DataFrames
- Spark SQL: structured data access
• Conclusion and Q/A

Introduction to HDP and Spark
http://hortonworks.com/hadoop/spark/

What is Spark?
• Spark is
– an open-source software solution that performs rapid calculations
on in-memory datasets
- Open Source [Apache hosted & licensed]
• Free to download and use in production
• Developed by a community of developers
- Spark supports using well known languages such as: Scala, Python, R, Java
- Spark SQL: Seamlessly mix SQL queries with Spark programs
- In-memory datasets
• RDD (Resilient Distributed Data) is the basis for what Spark enables
• Resilient – the models can be recreated on the fly from known state
• Distributed – the dataset is often partitioned across multiple nodes for
increased scalability and parallelism

Spark is certified as YARN Ready and is a part of HDP.
Hortonworks Data Platform 2.4
GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
YARN: Data Operating System
(Cluster Resource Management)
MapReduce
Apache Falcon
Apache Sqoop
Apache Flume
Apache Kafka
ApacheHive
ApachePig
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
1 • • • • • • • • • • •
• • • • • • • • • • • •
HDFS
(Hadoop Distributed File System)
Apache Ambari
Apache
ZooKeeper
Apache Oozie
Deployment Choice
Linux Windows On-premises Cloud
Apache Atlas
Cloudbreak
SECURITY
Apache Ranger
Apache Knox
Apache Atlas
HDFS Encryption
ISVEngines

Spark Components
Spark allows you to do data processing, ETL, machine learning,
stream processing, SQL querying from one framework

Ease of Use
• Write applications quickly in Java, Scala, Python, R.
• Spark offers over 80 high-level operators that make it easy to
build parallel apps. And you can use it interactively from the
Scala, Python and R shells.

Generality
• Combine SQL, streaming, and complex analytics.
• Spark powers a stack of libraries including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these
libraries seamlessly in the same application.
Runs Everywhere:
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access
diverse data sources including HDFS, Cassandra, HBase, S3, WASB
• https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-zeppelin-notebook-jupyter-spark-sql/

Emerging Spark Patterns
• Spark as query federation engine
 Bring data from multiple sources to join/query in Spark
• Use multiple Spark libraries together
 Common to see Core, ML & Sql used together
• Use Spark with various Hadoop ecosystem projects
 Use Spark & Hive together
 Spark & HBase together

Elegant Developer APIs
DataFrames, Machine Learning, and SQL
Made for Data Science
All apps need to get predictive at scale and fine granularity
Democratize Machine Learning
Spark is doing to ML on Hadoop what Hive did for SQL on
Hadoop
Community
Broad developer, customer and partner interest
Realize Value of Data Operating System
A key tool in the Hadoop toolbox
Why We Love Spark at Hortonworks
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS

More Data Sources APIs
9/04/2016

Spark Motivation
• MapReduce – involves lots of disk I/O (Disk I/O is very slow)
• Spark – Keep more data in memory
•
Input

What is Hadoop?
Apache Hadoop is an open-source software framework
written in Java for distributed storage and distributed
processing of very large data sets on computer clusters built
from commodity hardware.
The core of Apache Hadoop consists of a storage part
Hadoop Distributed File System (HDFS) and a processing
part (MapReduce).

Interacting with Spark

Interacting with Spark
• Spark’s interactive REPL shell (in Python or Scala)
• Web-based Notebooks:
• Zeppelin: A web-based notebook that enables interactive data
analytics.
• Jupyter: Evolved from the IPython Project
• SparkNotebook: forked from the scala-notebook
• RStudio: for Spark R , Zeppelin support coming soon
https://community.hortonworks.com/articles/25558/running-sparkr-in-rstudio-using-hdp-24.html

Apache Zeppelin
• A web-based notebook that enables interactive data
analytics.
• Multiple language backend
• Multi-purpose Notebook is the place for all your
needs
 Data Ingestion
 Data Discovery
 Data Analytics
 Data Visualization
 Collaboration

Zeppelin- Multiple language backend
Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.

Zeppelin – Dependency Management
• Load libraries recursively from Maven repository
• Load libraries from local filesystem
• %dep
• // add maven repository
• z.addRepo("RepoName").url("RepoURL”)
• // add artifact from filesystem
• z.load("/path/to.jar")
• // add artifact from maven repository, with no dependency
• z.load("groupId:artifactId:version").excludeAll()

© Hortonworks Inc. 2014 19
Community Plugins
• 100+ connectors
http://spark-packages.org/

Programming Spark

How Does Spark Work?
• RDD
• Your data is loaded in parallel into structured collections
• Actions
• Manipulate the state of the working model by forming new RDDs
and performing calculations upon them
• Persistence
• Long-term storage of an RDD’s state

Resilient Distributed Datasets
• The primary abstraction in Spark
» Immutable once constructed
» Track lineage information to efﬁciently recompute lost data
» Enable operations on collection of elements in parallel
• You construct RDDs
» by parallelizing existing collections (lists)
» by transforming an existing RDDs
» from ﬁles in HDFS or any other storage system

item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
more partitions = more parallelism
Worker
Spark
executor
Worker
Spark
executor
Worker
Spark
executor
RDDs
• Programmer speciﬁes number of partitions for an RDD
(Default value used if unspeciﬁed)
RDD split into 5 partitions

RDDs
• Two types of operations:transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk

Example RDD Transformations
•map(func)
•filter(func)
•distinct(func)
• All create a new DataSet from an existing one
• Do not create the DataSet until an action is performed (Lazy)
• Each element in an RDD is passed to the target function and the
result forms a new RDD

Example Action Operations
•count()
•reduce(func)
•collect()
•take()
• Either:
• Returns a value to the driver program
• Exports state to external system

Example Persistence Operations
•persist() -- takes options
•cache() -- only one option: in-memory
• Stores RDD Values
• in memory (what doesn’t fit is recalculated when necessary)
• Replication is an option for in-memory
• to disk
• blended

Spark Applications
Are a definition in code of
• RDD creation
• Actions
• Persistence
Results in the creation of a DAG (Directed Acyclic Graph) [workflow]
• Each DAG is compiled into stages
• Each Stage is executed as a series of Tasks
• Each Task operates in parallel on assigned partitions

Spark Context
• A Spark program ﬁrst creates a SparkContext object
• Tells Spark how and where to access a cluster
• Use SparkContext to create RDDs
• SparkContext, SQLContext, ZeppelinContext:
• are automatically created and exposed as variable names 'sc', 'sqlContext' and
'z', respectively, both in scala and python environments using Zeppelin
• iPython and programs must use a constructor to create a new SparkContext
Note: that scala / python environment shares the same SparkContext, SQLContext,
ZeppelinContext instance.

1. Resilient Distributed Dataset [RDD] Graph
val v = sc.textFile("hdfs://…some-hdfs-data")
mapmap reduceByKey collecttextFile
v.flatMap(line=>line.split(" "))
.map(word=>(word, 1)))
.reduceByKey(_ + _, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]

Processing A File in Scala
//Load the file:
val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv")
//Trim away any empty rows:
val fltr = file.filter(_.length > 0)
//Print out the remaining rows:
fltr.foreach(println)
31

Looking at the State in the Machine
//run debug command to inspect RDD:
scala> fltr.toDebugString
//simplified output:
res1: String =
FilteredRDD[2] at filter at <console>:14
MappedRDD[1] at textFile at <console>:12
HadoopRDD[0] at textFile at <console>:12
32

A Word on Anonymous Functions
Scala programmers make great use of anonymous functions as can
be seen in the code:
flatMap( line => line.split(" ") )
33
Argument
to the
function
Body of
the
function

Scala Functions Come In a Variety of Styles
flatMap( line => line.split(" ") )
flatMap((line:String) => line.split(" "))
flatMap(_.split(" "))
34
Argument to the
function (type inferred)
Body of the function
Argument to the
function (explicit type)
Body of the
function
No Argument to the
function declared
(placeholder) instead
Body of the function includes placeholder _ which allows for exactly one use of
one arg for each _ present. _ essentially means ‘whatever you pass me’

And Finally – the Formal ‘def’
def myFunc(line:String): Array[String]={
return line.split(",")
}
//and now that it has a name:
myFunc("Hi Mom, I’m home.").foreach(println)
Return type of the function)
Body of the function
Argument to the function)

LAB: Spark RDD & Data Frames Demo –
Philly Crime Data Set

Spark DataFrames

What are DataFrames?
• Distributed Collection of Data organized in Columns
• Equivalent to Tables in Databases or DataFrame in R/PYTHON
• Much richer optimization than any other implementation of DF
• Can be constructed from a wide variety of sources and APIs
• Greater accessiblity
• Declarative rather thanimperative
• Catalyst Optimizer
Why DataFrames?

Writing a DataFrame
val df = sqlContext.jsonFile("/tmp/people.json")
df.show()
df.printSchema()
df.select ("First Name").show()
df.select("First Name","Age").show()
df.filter(df("age")>40).show()
df.groupBy("age").count().show()

Querying RDD Using SQL
import org.apache.spark.sql.types.{StructType,StructField,StringType}
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, true)))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val people = sc.textFile("/tmp/people.txt")
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
peopleDataFrame.registerTempTable("people")
val results = sqlContext.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)

Querying RDD Using SQL
// SQL statements can be run directly on RDD’s
val teenagers =
sqlC.sql("SELECT name FROM people
WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support
// normal RDD operations:
val nameList = teenagers.map(t => "Name: " + t(0)).collect()
// Language integrated queries (ala LINQ)
val teenagers =
people.where('age >= 10).where('age <= 19).select('name)

Dataframes for Apache Spark
DataFrame SQL
DataFrame R
DataFrame Python
DataFrame Scala
RDD Python
RDD Scala
Time to aggregate 10 million integer pairs (in seconds)
DataFrames can be significantly faster than RDDs. And they
perform the same, regardless of language.

Transformations Actions
filter count
select collect
drop show
join take
Transformations contribute to the query plan but
nothing is executed until an action is called
Dataframes – Transformation & Actions

LAB: DataFrames
http://sandbox.hortonworks.com:8081/#/notebook/2B4B7EWY7
http://sandbox.hortonworks.com:8081/#/notebook/2B5RMG4AM
DataFrames + SQL
DataFrames JSON

DataFrames and JDBC
val jdbc_attendees = sqlContext.load("jdbc", Map("url" ->
"jdbc:mysql://localhost:3306/db1?user=root&password=xxx","dbtable" -> "attendees"))
jdbc_attendees.show()
jdbc.attendees.count()
jdbc_attendees.registerTempTable("jdbc_attendees")
val countall = sqlContext.sql("select count(*) from jdbc_attendees")
countall.map(t=>"Records count is "+t(0)).collect().foreach(println)

Code ‘select count’
Equivalent SQL Statement:
Select count(*) from pagecounts WHERE state = ‘FL’
Scala statement:
val file = sc.textFile("hdfs://…/log.txt")
val numFL = file.filter(line =>
line.contains("fl")).count()
scala> println(numFL)
46
1. Load the page as an RDD
2. Filter the lines of the page
eliminating any that do not
contain “fl“
3. Count those lines that
remain
4. Print the value of the
counted lines containing ‘fl’

Spark SQL
47

Platform APIs
• Joining Data from Different
Sources
• Access Data using DataFrames /
SQL

LAB: JDBC and 3rd party packages
http://sandbox.hortonworks.com:8081/#/notebook/2B2P8RE82

What About Integration With Hive?
scala> val hiveCTX = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveCTX.hql("SHOW TABLES").collect().foreach(println)
…
[omniture]
[omniturelogs]
[orc_table]
[raw_products]
[raw_users]
…
50

More Integration With Hive:
scala> hCTX.hql("DESCRIBE raw_users").collect().foreach(println)
[swid,string,null]
[birth_date,string,null]
[gender_cd,string,null]
scala> hCTX.hql("SELECT * FROM raw_users WHERE gender_cd='F' LIMIT
5").collect().foreach(println)
[0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D,8-Apr-84,F]
[00071AA7-86D2-4EB9-871A-A786D27EB9BA,7-Feb-88,F]
[00071B7D-31AF-4D85-871B-7D31AFFD852E,22-Oct-64,F]
[000F36E5-9891-4098-9B69-CEE78483B653,24-Mar-85,F]
[00102F3F-061C-4212-9F91-1254F9D6E39F,1-Nov-91,F]
51

LAB: HIVE ORC
http://sandbox.hortonworks.com:8081/#/notebook/2B6KUW16Z

Spark Streaming

MicroBatch Spark Streams

Physical Execution

Spark Streaming 101
• Spark has significant library support for streaming applications
val ssc = new StreamingContext(sc, Seconds(5))
val tweetStream = TwitterUtils.createStream(ssc, Some(auth))
• Allows to combine Streaming with Batch/ETL,SQL & ML
• Read data from HDFS, Flume, Kafka, Twitter, ZeroMQ & custom.
• Chop input data stream into batches
• Spark processes batches & results published in batches
• Fundamental unit is Discretized Streams (DStreams)

Spark MLLib

Spark MLlib – Algorithms Offered
• Classification: logistic regression, linear SVM,
– naïve Bayes, least squares, classification tree
• Regression: generalized linear models (GLMs),
– regression tree
• Collaborative filtering: alternating least squares (ALS),
– non-negative matrix factorization (NMF)
• Clustering: k-means
• Decomposition: SVD, PCA
• Optimization: stochastic gradient descent, L-BFGS

ML - Pipelines
• New algorithms KMeans [SPARK-7879], Naive Bayes [SPARK-
8600], Bisecting KMeans
• [SPARK-6517], Multi-layer Perceptron (ANN) [SPARK-2352],
Weighting for
• Linear Models [SPARK-7685]
• New transformers (close to parity with SciKit learn):
CountVectorizer [SPARK-8703],
• PCA [SPARK-8664], DCT [SPARK-8471], N-Grams [SPARK-8455]
• Calling into single machine solvers (coming soon as a package)

Twitter Language Classifier
Goal: connect to real time twitter stream and print only
those tweets whose language match our chosen language.
Main issue: how to detect the language during run time?
Solution: build a language classifier model offline capable of
detecting language of tweet (Mlib). Then, apply it to real
time twitter stream and do filtering (Spark Streaming).

Conclusion and Q&A

Learn More Spark + Hadoop Perfect Together
HDP Spark General Info:
http://hortonworks.com/hadoop/spark/
Learn more about our Focus on Spark:
http://hortonworks.com/hadoop/spark/#section_6
Get the HDP Spark 1.5.1 Tech Preview:
http://hortonworks.com/hadoop/spark/#section_5
Get started with Spark and Zeppelin and download the Sandbox:
http://hortonworks.com/sandbox
Try these tutorials:
http://hortonworks.com/hadoop/spark/#tutorials
http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/
Learn more about GeoSpatial Spark processing with Magellan:
http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS

Similar to Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS (20)

Recently uploaded

Recently uploaded (20)

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS