Big Data Processing With
Scala and Spark
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Slide 2Slide 2 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
What is Big Data?
What is Spark?
Why Spark?
Spark Ecosystem
A note about Scala
Why Scala?
Hello Spark!
For Queries during the session and class recording:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN
Objectives of this Session
Slide 3Slide 3 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Big Data
 Lots of Data (Terabytes or Petabytes)
 Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications
 The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and
visualization
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
processing
mobile
Big Data
Slide 4Slide 4 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
What is Spark?
 Apache Spark is a general-purpose cluster in-memory computing system
 Provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs
 Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and more..
High Level
APIs
High Level
Tools
More…
Slide 5Slide 5 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Why Spark?
Cluster Manager
Deployment
via YARN
 The Spark framework can be deployed through
Apache Mesos, Apache Hadoop via Yarn, or
Spark’s own cluster manager.
Slide 6Slide 6 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Why Spark?
Polyglot Scala
 Spark framework is polyglot – Can be programmed
in several programming languages (Currently
Scala, Java and Python supported).
Slide 7Slide 7 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Why Spark?
A fully Apache Hive compatible data
warehousing system that can run 100x
faster than Hive.
100x faster than for certain applications.
Slide 8Slide 8 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Why Spark?
 Provides powerful caching and disk persistence capabilities
 Interactive Data Analysis
 Faster Batch
 Iterative Algorithms
 Real-Time Stream Processing
 Faster Decision-Making
Slide 9Slide 9 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Spark Community is Super Active!
Slide 10Slide 10 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Spark Ecosystem
Spark Core Engine
Aplha/Pre-alpha
Shark
(SQL)
Spark
Streaming
(Streaming)
MLLib
(Machine
learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
BlindDB
(Approximate
SQL)
Slide 11Slide 11 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Spark Ecosystem (Contd.)
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop
deployment.
Spark Core Engine
Aplha/Pre-alpha
Shark
(SQL)
Spark
Streaming
(Streaming)
MLLib
(Machine
learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
BlindDB
(Approximate
SQL)
Enables analytical
and interactive
apps for live
streaming data.
An approximate
query engine. To
run over Core
Spark Engine.
Graph Computation
engine.
(Similar to Giraph)
Package for R language
to enable R-users to
leverage Spark power
from R shell.
Machine learning library being built on top of Spark. Provision for support to many
machine learning algorithms with speeds upto 100 times faster than Map-Reduce.
Slide 12Slide 12 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
A Note on Scala
 Scala is a general-purpose programming language designed
to express common programming patterns in a concise,
elegant, and type-safe way
 Scala supports both Object Oriented Programming and
Functional Programming
 Scala is very much in fabric of present and Future Big Data
frameworks like Scalding, Spark, Akka
» All examples of Spark in class will be
covered in Scala
» Scala would be covered before Spark
coverage as part of course!
Slide 13Slide 13 www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Why Scala?
 Scala is a pure object-oriented language. Conceptually, every value is an object and every operation is a
method-call. The language supports advanced component architectures through classes and traits
 Scala is also a functional language. Supports functions, immutable data structures and preference for
immutability over mutation
 Seamlessly integrated with Java
 Being used heavily for future Big data and we developments frameworks like Spark, Akka, Scalding, Play etc
Slide 14 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions www.edureka.in
Hello Spark!
Hello Spark!
Slide 15
Questions?
Buy Spark Course at : www.edureka.in
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Apache Spark & Scala

Apache Spark & Scala

  • 1.
    Big Data ProcessingWith Scala and Spark Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
  • 2.
    Slide 2Slide 2www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions What is Big Data? What is Spark? Why Spark? Spark Ecosystem A note about Scala Why Scala? Hello Spark! For Queries during the session and class recording: Post on Twitter @edurekaIN: #askEdureka Post on Facebook /edurekaIN Objectives of this Session
  • 3.
    Slide 3Slide 3www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Big Data  Lots of Data (Terabytes or Petabytes)  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization cloud tools statistics No SQL compression storage support database analyze information terabytes processing mobile Big Data
  • 4.
    Slide 4Slide 4www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions What is Spark?  Apache Spark is a general-purpose cluster in-memory computing system  Provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs  Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and more.. High Level APIs High Level Tools More…
  • 5.
    Slide 5Slide 5www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Why Spark? Cluster Manager Deployment via YARN  The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster manager.
  • 6.
    Slide 6Slide 6www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Why Spark? Polyglot Scala  Spark framework is polyglot – Can be programmed in several programming languages (Currently Scala, Java and Python supported).
  • 7.
    Slide 7Slide 7www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Why Spark? A fully Apache Hive compatible data warehousing system that can run 100x faster than Hive. 100x faster than for certain applications.
  • 8.
    Slide 8Slide 8www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Why Spark?  Provides powerful caching and disk persistence capabilities  Interactive Data Analysis  Faster Batch  Iterative Algorithms  Real-Time Stream Processing  Faster Decision-Making
  • 9.
    Slide 9Slide 9www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Spark Community is Super Active!
  • 10.
    Slide 10Slide 10www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Spark Ecosystem Spark Core Engine Aplha/Pre-alpha Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) BlindDB (Approximate SQL)
  • 11.
    Slide 11Slide 11www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Spark Ecosystem (Contd.) Used for structured data. Can run unmodified hive queries on existing Hadoop deployment. Spark Core Engine Aplha/Pre-alpha Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) BlindDB (Approximate SQL) Enables analytical and interactive apps for live streaming data. An approximate query engine. To run over Core Spark Engine. Graph Computation engine. (Similar to Giraph) Package for R language to enable R-users to leverage Spark power from R shell. Machine learning library being built on top of Spark. Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-Reduce.
  • 12.
    Slide 12Slide 12www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions A Note on Scala  Scala is a general-purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way  Scala supports both Object Oriented Programming and Functional Programming  Scala is very much in fabric of present and Future Big Data frameworks like Scalding, Spark, Akka » All examples of Spark in class will be covered in Scala » Scala would be covered before Spark coverage as part of course!
  • 13.
    Slide 13Slide 13www.edureka.inTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Why Scala?  Scala is a pure object-oriented language. Conceptually, every value is an object and every operation is a method-call. The language supports advanced component architectures through classes and traits  Scala is also a functional language. Supports functions, immutable data structures and preference for immutability over mutation  Seamlessly integrated with Java  Being used heavily for future Big data and we developments frameworks like Spark, Akka, Scalding, Play etc
  • 14.
    Slide 14 Twitter@edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions www.edureka.in Hello Spark! Hello Spark!
  • 15.
    Slide 15 Questions? Buy SparkCourse at : www.edureka.in Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions