In Memory Analytics with Apache Spark


Published on

Presentation detailed about capabilities of In memory Analytic using Apache Spark. Apache Spark overview with programming mode, cluster mode with Mosos, supported operations and comparison with Hadoop Map Reduce. Elaborating Apache Spark Stack expansion like Shark, Streaming, MLib, GraphX

Published in: Software, Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • In Memory Analytics with Apache Spark

    1. 1. In Memory Analytics- Apache Spark Ravi
    2. 2. Agenda  Overview of Spark  Spark with Hadoop MapReduce  Spark Elements and Operations  Spark Cluster Overview  Spark Examples  Spark Stack Extensions:  Shark  Streaming  Mlib  Graphx
    3. 3. In Memory Analytics • In-memory analytics is an approach to querying data when it resides in a computer’s random access memory (RAM), as opposed to querying data that is stored on physical disks. • This results in vastly shortened query response times, allowing business intelligence (BI) and analytic applications to support faster business decisions. • As the cost of RAM declines, in-memory analytics is becoming feasible for many businesses. • BI and analytic applications have long supported caching data in RAM, but older 32-bit operating systems provided only 4 GB of addressable memory. • Newer 64-bit operating systems, with up to 1 terabyte (TB) addressable memory (and perhaps more in the future), have made it possible to cache large volumes of data -- potentially an entire data warehouse or data mart -- in a computer’s RAM.
    4. 4.  Not a modified version of Hadoop  Separate, fast, Map-Reduce-like engine  In-memory data storage for very fast iterative queries  Generate execution of graphs and powerful optimizations  Up to 40x faster than Hadoop  Spark beats Hadoop by providing primitives for in-memory cluster computing; thereby avoiding the I/O bottleneck between the individual jobs of an iterative MapReduce workflow that repeatedly performs computations on the same working set.  Compatible with Hadoop’s storage APIs  Can read/write to any Hadoop-supported systems, including HDFS, Hbase, SequenceFiles, etc What is Spark - Lightning-Fast Cluster Computing
    5. 5. Quick Recap Hadoop Eco System
    6. 6. Spark Programming Model  Key idea : Resilient Distributed Data (RDD)  Distributed collections of objects that can be cached in memory across cluster nodes  Manipulated through various parallel operations  Automatically rebuilt on failures  Types of RDD:  Parallelized collections: Take an existing Scala collection and run functions on it in parallel  scala> val distData = sc.parallelize(data)  distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e  Hadoop datasets : Run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop  scala> val distFile = sc.textFile("data.txt")  distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
    7. 7. For example, consider the following job:"ERROR") rdd2.join(rdd1, key).take(10) Automatic Parallelization of Complex Flows  When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence.  With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph.  This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention.
    8. 8. Spark vs Hadoop Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100 times considering its ability to perform computations in memory
    9. 9. Transformations (eg: map, filter, group by) : Create a new dataset from an existing one Actions ( eg: count, collect, save) : Return a value to the driver program after running a computation on the dataset
    10. 10. Spark Elements  Application User program built on Spark. Consists of a driver program and executors on the cluster.  Driver program The process running the main() function of the application and creating the SparkContext  Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)  Worker node Any node that can run application code in the cluster  Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.  Task A unit of work that will be sent to one executor  Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.  Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
    11. 11. Spark Cluster Overview Cluster Manager Types • Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster. • Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications. • Hadoop YARN – the resource manager in Hadoop 2.
    12. 12. Mesos (Dynamic Resource Sharing for Clusters) Run Modes  Spark can run over Mesos in two modes: “fine-grained” and “coarse- grained”.  Fine-grained mode, which is the default, each Spark task runs as a separate Mesos task.  This allows multiple instances of Spark (and other frameworks) to share machines at a very fine granularity, where each application gets more or fewer machines as it ramps up, but it comes with an additional overhead in launching each task, which may be inappropriate for low-latency applications (e.g. interactive queries or serving web requests).  Coarse-grained mode will instead launch only one long-running Spark task on each Mesos machine, and dynamically schedule its own “mini- tasks” within it.  The benefit is much lower startup overhead, but at the cost of reserving the Mesos resources for the complete duration of the application.
    13. 13. Task Scheduler • Runs general DAGs • Pipelines functions within a stage • Cache-aware data reuse & locality • Partitioning-aware to avoid shuffles
    14. 14. Spark Stack Extension Spark powers a stack of high-level tools including  Shark for SQL  MLlib for machine learning  GraphX  Spark Streaming. You can combine these frameworks seamlessly in the same application.
    15. 15. Shark Shark makes Hive faster and more powerful.  Shark is a new data analysis system that marries query processing with complex analytics on large clusters  Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users.  Speed : Run Hive queries up to 100x faster in memory, or 10x on disk.
    16. 16. Streaming Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.  Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs.  It supports both Java and Scala.  Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state  Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ.  Since Spark Streaming is built on top of Spark, users can apply Spark's in-built machine learning algorithms (MLlib), and graph processing algorithms (GraphX) on data streams TwitterUtils.createStream(...) .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5)) Counting tweets on a sliding window stream.join(historicCounts).filter { case (word, (curCount, oldCount)) => curCount > oldCount } Find words with higher frequency than historic data
    17. 17. MLlib MLlib is Apache Spark's scalable machine learning library.  MLlib fits into Spark's APIs and interoperates with NumPy in Python (starting in Spark 0.9). You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows. points = spark.textFile("hdfs://...") .map(parsePoint) model = KMeans.train(points) Calling MLlib in Scala
    18. 18. GraphX Unifying Graphs and Tables  GraphX extends the distributed fault-tolerant collections API and interactive console of Spark with a new graph API which leverages recent advances in graph systems (e.g., GraphLab) to enable users to easily and interactively build, transform, and reason about graph structured data at scale.
    19. 19. BDAS, the Berkeley Data Analytics Stack, BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.
    20. 20. Software and Research Projects  Shark - Hive and SQL on top of Spark  MLbase - Machine Learning project on top of Spark  BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark  GraphX - a graph processing & analytics framework on top of Spark (GraphX has been merged into Spark 0.9)  Apache Mesos - Cluster management system that supports running Spark  Tachyon - In memory storage system that supports running Spark  Apache MRQL - A query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark  OpenDL - A deep learning algorithm library based on Spark framework. Just kick off.  SparkR - R frontend for Spark  Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster
    21. 21. Conclusion  “Bigdata” is moving beyond one-pass batch jobs, to low-latency apps that need data sharing  RDDs offer fault-tolerant sharing at memory speed  Spark uses them to combine streaming, batch & interactive analytics in one system