Walaa Assy
Giza Systems
Software
Developer
SPARK
LIGHTNING-FAST UNIFIED ANALYTICS
ENGINE
HOW DO WE HANDLE
EVER GROWING DATA
THAT HAS BECOME BIG
DATA?
Basics of Spark
Core API
 Cluster Managers
Spark Maintenance
Libraries
 - SQL
 - Streaming
 - Mllib
 GraphX
Troubleshooting /
Future of Spark
AGENDA
 Readability
 Expressiveness
 Fast
 Testability
 Interactive
 Fault Tolerant
 Unify Big Data
Spark officially sets a new record in large scale sorting, spark
does make computations on disk it makes use of cached data
in memory
WHY SPARK? TINIER CODE LEADS TO ..
 Map reduce has very narrow scope especially in batch
processing
 Each problem needed a new api to solve
EXPLOSION OF MAP REDUCE
A UNIFIED PLATFORM FOR BIG DATA
SPARK PROGRAMMING LANGUAGES
 The most basic abstraction of spark
 Spark operations are two main categories:
 Transformations [lazily evalutaed only storing the intent]
 Actions
 val textFile = sc.textFile("file:///spark/README.md")
 textFile.first // action
RDD [RESILIETNT DISTRIBUTION DATASET]
HELLO BIG DATA
 sudo yum install wget
 sudo wget https://downloads.lightbend.com/scala/2.13.0-
M4/scala-2.13.0-M4.tgz
 tar xvf scala-2.13.0-M4.tgz
 sudo mv scala-2.13.0-M4 /usr/lib
 sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala
 export PATH=$PATH:/usr/lib/scala/bin
SCALA INSTALLATION STEPS
 sudo wget
https://www.apache.org/dyn/closer.lua/spark/spark-
2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
 tar xvf spark-2.3.1-bin-hadoop2.7.tgz
 ln -s spark-2.3.1-bin-hadoop2.7 spark
 export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7
 export PATH=$PATH:$SPARK_HOME/bin
SPARK INSTALLATION – CENTOS 7
SPARK MECHANISM
 collection of elements partitioned across the nodes of the
cluster that can be operated on in parallel…
 A collection similar to a list or an array from a user level
 processed in parallel to fasten computation time with no
failure tolerance
 RDD is immutable
 Transformations are lazy and stored in a DAG
 Actions trigger DAGs
 DAGS are like linear graph of tasks
 Each action will trigger a fresh execution of the graph
RDD
INPUT DATASETS TYPES
 Map
 Flatmap
 Filter
 Distinct
 Sample
 Union
 Inttersection
 Subtract
 Cartesian
Transformations return RDDs
TRANSFORMATIONS IN MAP REDUCE
 Collect()
 Count()
 Take(num)
 takeOrdered(num)(ordering)
 Reduce(function)
 Aggregate(zeroValue)(seqOp,compOp)
 Foreach(function)
 Actions return different types according to each action
saveAsObjectFile(path)
saveAsTextFile(path) // saves as text file
External connector
foreach(T => Unit) // one object at a time
 - foreachPartition(Iterator[T] => Unit) // one partition at a time
ACTIONS IN SPARK
 Sql like pairing
 Join
 fullOuterJoin
 leftJoin
 rightJoin
 Pair Saving
 saveAs(NewAPI)HadoopFile
 - path
 - keyClass
 - valueClass
 - outputFormatClass

saveAs(NewAPI)HadoopData
Set
 - conf
 saveAsSequenceFile
 Pair Saving
 - saveAsHadoopFile(path,
keyClass, valueClass,
SequenceFileOutputFormat)
PAIR METHODS- CONTD
 Works Like a distributed kernel
 Built in a basic spark manager
 Haddop cluster manager yarn
 Apache mesos standalone
PRIMARY CLUSTER MANAGER
SPARK-SUBMIT DEMO
SPARK SQL
 Spark SQL is Apache Spark's module for working with
structured or semi data.
 It is meant to be used by non big data users
 As Spark continues to grow, we want to enable wider
audiences beyond “Big Data” engineers to leverage the power
of distributed processing.
Databricks blog (http://bit.ly/17NM70s)
SPARK SQL
 Seamlessly mix SQL queries with Spark programs
Spark SQL lets you query structured data inside Spark programs,
using either SQL or a familiar DataFrame API
 Connect to any data source the same way.
 It executes SQL queries.
 We can read data from existing Hive installation using
SparkSQL.
 When we run SQL within another programming language we
will get the result as Dataset/DataFrame.
SPARK SQL FEATURES
DataFrames and SQL provide a common way to access a variety
of data sources, including Hive, Avro, Parquet, ORC, JSON, and
JDBC. You can even join data across these sources.
 Run SQL or HiveQL queries on existing warehouses.[Hive
Integration]
 Connect through JDBC or ODBC.[Standard Connectivity]
 It is includes with spark
DATAFRAMES
 Spark 1.3 release. It is a distributed collection of data
ordered into named columns. Concept wise it is equal to the
table in a relational database or a data frame in R/Python.
We can create DataFrame using:
 Structured data files
 Tables in Hive
 External databases
 Using existing RDD
SPARK DATAFRAME IS
Data frames = schem RDD
EXAMPLES
SPARK SQL COMPETITION
 Hive
 Parquet
 Json
 Avro
 Amazon red shift
 Csv
 Others
It is recommended as a starting point for any spark application
As it adds
 Predicate push down
 Column pruning
 Can use SQL & RDD
SPARK SQL DATA SOURCES
SPARK STREAMING
 Big & fast data
 Gigabytes per second
 Real time fraud detection
 Marketing
 makes it easy to build scalable fault-tolerant streaming
applications.
SPARK STREAMING
SPARK STREAMING COMPETITORS
Streaming data
• Kafka
• Flume
• Twitter
• Hadoop hdfs
• Others
• live logs, system telemetry data, IoT device
data, etc.)
SPARK MLIB
 MLlib is a standard component of Spark providing machine
learning primitives on top of Spark.
SPARK MLIB
 MATLAB
 R
EASY TO USE BUT NOT SCALABLE
 MAHOUT
 GRAPHLAB
Scalable but at the cost ease
 Org.apache.spark.mlib
Rdd based algoritms
 Org.aoache.spark.ml
 Pipleline api built on top of dataframes
SPARK MLIB COMPETITION
 Loding the data
 Extracting features
 Training the data
 Testing
 the data
 The new pipeline allows tuning testing and early failure
detection
MACHINE LEARNING FLOW
 Algorithms
Classifcation ex: naïve bayes
Regression
Linear
Logistic
Filtering by als ,k squares
Clustering by k-means
Dimensional reduction by SVD singular value decomposition
 Feature extraction and transformations
Tf-idf : term frequency- inverse document frequency
ALGRITHMS IN MLIB
 Spam filtering
 Fraud detection
 Recommendation analysis
 Speech recognition
PRACTICAL USE
 Word to vector algorithm
 This algorithm takes an input text and outputs a set of vectors
representing a dictionary of words [to see word similarity]
 We cache the rdds because mlib will have multiple passes o
the same data so this memory cache can reduce processing
time alot
 breeze numerical processing library used inside of spark
 It has ability to perform mathematical operations on vectors
MLIB DEMO
SPARK GRAPHX
 GraphX is Apache Spark's API for graphs and graph-parallel
computation.
 Page ranking
 Producing evaluations
 It can be used in genetic analysis
 ALGORITHMS
 PageRank
 Connected components
 Label propagation
 SVD++
 Strongly connected components
 Triangle count
GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED
WORLD
COMPETITONS
End-to-end PageRank performance (20 iterations,
3.7B edges)
 Joints each had unique id
 Each vertex can has properties of user defined type and store
metal data
ARCHITECTURE
 Arrows are relations that can store metadata data known as
edges which is a long type
 A graph is built of two RDDs one containing the collection of
edges and the collection of vertices
 Another component is edge triplet is an object which exposes
the relation between each vertex and edge containing all the
information for each connection
WHO IS USING SPARK?
 http://spark.apache.org
 Tutorials: http://ampcamp.berkeley.edu
 Spark Summit: http://spark-summit.org
 Github: https://github.com/apache/spark
 https://data-flair.training/blogs/spark-sql-tutorial/
REFERENCES

Apache spark - Architecture , Overview & libraries

  • 1.
  • 2.
    HOW DO WEHANDLE EVER GROWING DATA THAT HAS BECOME BIG DATA?
  • 3.
    Basics of Spark CoreAPI  Cluster Managers Spark Maintenance Libraries  - SQL  - Streaming  - Mllib  GraphX Troubleshooting / Future of Spark AGENDA
  • 5.
     Readability  Expressiveness Fast  Testability  Interactive  Fault Tolerant  Unify Big Data Spark officially sets a new record in large scale sorting, spark does make computations on disk it makes use of cached data in memory WHY SPARK? TINIER CODE LEADS TO ..
  • 6.
     Map reducehas very narrow scope especially in batch processing  Each problem needed a new api to solve EXPLOSION OF MAP REDUCE
  • 8.
    A UNIFIED PLATFORMFOR BIG DATA
  • 9.
  • 10.
     The mostbasic abstraction of spark  Spark operations are two main categories:  Transformations [lazily evalutaed only storing the intent]  Actions  val textFile = sc.textFile("file:///spark/README.md")  textFile.first // action RDD [RESILIETNT DISTRIBUTION DATASET]
  • 11.
  • 12.
     sudo yuminstall wget  sudo wget https://downloads.lightbend.com/scala/2.13.0- M4/scala-2.13.0-M4.tgz  tar xvf scala-2.13.0-M4.tgz  sudo mv scala-2.13.0-M4 /usr/lib  sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala  export PATH=$PATH:/usr/lib/scala/bin SCALA INSTALLATION STEPS
  • 13.
     sudo wget https://www.apache.org/dyn/closer.lua/spark/spark- 2.3.1/spark-2.3.1-bin-hadoop2.7.tgz tar xvf spark-2.3.1-bin-hadoop2.7.tgz  ln -s spark-2.3.1-bin-hadoop2.7 spark  export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7  export PATH=$PATH:$SPARK_HOME/bin SPARK INSTALLATION – CENTOS 7
  • 14.
  • 15.
     collection ofelements partitioned across the nodes of the cluster that can be operated on in parallel…  A collection similar to a list or an array from a user level  processed in parallel to fasten computation time with no failure tolerance  RDD is immutable  Transformations are lazy and stored in a DAG  Actions trigger DAGs  DAGS are like linear graph of tasks  Each action will trigger a fresh execution of the graph RDD
  • 16.
  • 18.
     Map  Flatmap Filter  Distinct  Sample  Union  Inttersection  Subtract  Cartesian Transformations return RDDs TRANSFORMATIONS IN MAP REDUCE
  • 24.
     Collect()  Count() Take(num)  takeOrdered(num)(ordering)  Reduce(function)  Aggregate(zeroValue)(seqOp,compOp)  Foreach(function)  Actions return different types according to each action saveAsObjectFile(path) saveAsTextFile(path) // saves as text file External connector foreach(T => Unit) // one object at a time  - foreachPartition(Iterator[T] => Unit) // one partition at a time ACTIONS IN SPARK
  • 28.
     Sql likepairing  Join  fullOuterJoin  leftJoin  rightJoin  Pair Saving  saveAs(NewAPI)HadoopFile  - path  - keyClass  - valueClass  - outputFormatClass  saveAs(NewAPI)HadoopData Set  - conf  saveAsSequenceFile  Pair Saving  - saveAsHadoopFile(path, keyClass, valueClass, SequenceFileOutputFormat) PAIR METHODS- CONTD
  • 29.
     Works Likea distributed kernel  Built in a basic spark manager  Haddop cluster manager yarn  Apache mesos standalone PRIMARY CLUSTER MANAGER
  • 30.
  • 31.
  • 32.
     Spark SQLis Apache Spark's module for working with structured or semi data.  It is meant to be used by non big data users  As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. Databricks blog (http://bit.ly/17NM70s) SPARK SQL
  • 33.
     Seamlessly mixSQL queries with Spark programs Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API  Connect to any data source the same way.  It executes SQL queries.  We can read data from existing Hive installation using SparkSQL.  When we run SQL within another programming language we will get the result as Dataset/DataFrame. SPARK SQL FEATURES
  • 35.
    DataFrames and SQLprovide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.  Run SQL or HiveQL queries on existing warehouses.[Hive Integration]  Connect through JDBC or ODBC.[Standard Connectivity]  It is includes with spark DATAFRAMES
  • 36.
     Spark 1.3release. It is a distributed collection of data ordered into named columns. Concept wise it is equal to the table in a relational database or a data frame in R/Python. We can create DataFrame using:  Structured data files  Tables in Hive  External databases  Using existing RDD SPARK DATAFRAME IS Data frames = schem RDD
  • 37.
  • 38.
  • 39.
     Hive  Parquet Json  Avro  Amazon red shift  Csv  Others It is recommended as a starting point for any spark application As it adds  Predicate push down  Column pruning  Can use SQL & RDD SPARK SQL DATA SOURCES
  • 40.
  • 41.
     Big &fast data  Gigabytes per second  Real time fraud detection  Marketing  makes it easy to build scalable fault-tolerant streaming applications. SPARK STREAMING
  • 42.
    SPARK STREAMING COMPETITORS Streamingdata • Kafka • Flume • Twitter • Hadoop hdfs • Others • live logs, system telemetry data, IoT device data, etc.)
  • 43.
  • 44.
     MLlib isa standard component of Spark providing machine learning primitives on top of Spark. SPARK MLIB
  • 45.
     MATLAB  R EASYTO USE BUT NOT SCALABLE  MAHOUT  GRAPHLAB Scalable but at the cost ease  Org.apache.spark.mlib Rdd based algoritms  Org.aoache.spark.ml  Pipleline api built on top of dataframes SPARK MLIB COMPETITION
  • 46.
     Loding thedata  Extracting features  Training the data  Testing  the data  The new pipeline allows tuning testing and early failure detection MACHINE LEARNING FLOW
  • 47.
     Algorithms Classifcation ex:naïve bayes Regression Linear Logistic Filtering by als ,k squares Clustering by k-means Dimensional reduction by SVD singular value decomposition  Feature extraction and transformations Tf-idf : term frequency- inverse document frequency ALGRITHMS IN MLIB
  • 48.
     Spam filtering Fraud detection  Recommendation analysis  Speech recognition PRACTICAL USE
  • 49.
     Word tovector algorithm  This algorithm takes an input text and outputs a set of vectors representing a dictionary of words [to see word similarity]  We cache the rdds because mlib will have multiple passes o the same data so this memory cache can reduce processing time alot  breeze numerical processing library used inside of spark  It has ability to perform mathematical operations on vectors MLIB DEMO
  • 50.
  • 51.
     GraphX isApache Spark's API for graphs and graph-parallel computation.  Page ranking  Producing evaluations  It can be used in genetic analysis  ALGORITHMS  PageRank  Connected components  Label propagation  SVD++  Strongly connected components  Triangle count GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED WORLD
  • 52.
  • 53.
     Joints eachhad unique id  Each vertex can has properties of user defined type and store metal data ARCHITECTURE
  • 54.
     Arrows arerelations that can store metadata data known as edges which is a long type  A graph is built of two RDDs one containing the collection of edges and the collection of vertices
  • 55.
     Another componentis edge triplet is an object which exposes the relation between each vertex and edge containing all the information for each connection
  • 56.
  • 58.
     http://spark.apache.org  Tutorials:http://ampcamp.berkeley.edu  Spark Summit: http://spark-summit.org  Github: https://github.com/apache/spark  https://data-flair.training/blogs/spark-sql-tutorial/ REFERENCES