Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

3,556 views

Published on

Apache Spark is a fast and general purpose engine for both large-scale data and stream processing. Mix built-in machine learning with Couchbase Server and you have a swiss army knife for real time data analytics. In this session you will learn about Apache Spark and how it fits into the Couchbase ecosystem. You will learn how to leverage core Spark components as well as higher level integrations like Spark SQL and Spark streaming. And since all talk and no play makes jack a dull boy, there will be plenty of code and demos!

  • Be the first to comment

  • Be the first to like this

Spark with Couchbase to Electrify Your Data Processing: Couchbase Connect 2015

  1. 1. SPARK WITH COUCHBASE TO ELECTRIFY YOUR DATA PROCESSING Michael Nitschinger, Couchbase
  2. 2. What is Spark?
  3. 3. ©2015 Couchbase Inc. 3 Introduction  Apache Spark is a fast and general engine for large-scale data processing.
  4. 4. ©2015 Couchbase Inc. 4 More Facts  Over 450 contributors, very active Apache Big Data project.  Huge public interest: Source: http://www.google.com/trends/explore?hl=en-US#q=apache%20spark,%20apache%20hadoop&cmpt=q
  5. 5. ©2015 Couchbase Inc. 5 Community  Ecosystem growing fast  Hadoop  RDBMS  NoSQL  Package Repository  http://spark-packages.org/  Connectors  Utility Libraries
  6. 6. ©2015 Couchbase Inc. 6 Components: Spark Core Resilient Distributed Datasets Clustering Execution
  7. 7. ©2015 Couchbase Inc. 7 Components: Spark SQL Structured Data Frames Distributed querying with SQL
  8. 8. ©2015 Couchbase Inc. 8 Components: Spark Streaming Fault-tolerant streaming applications
  9. 9. ©2015 Couchbase Inc. 9 Components: Spark MLib Built-In Machine Learning Algorithms
  10. 10. ©2015 Couchbase Inc. 10 Components: Spark GraphX Graph processing and graph-parallel computations
  11. 11. ©2015 Couchbase Inc. 11 How does it work?  Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  12. 12. Why should you care?
  13. 13. ©2015 Couchbase Inc. 13 Spark Benefits  Linearly scalable to 1000+ worker nodes  Simpler to use than Hadoop MR  Only partial recompute on failure  For developers and data scientists  machine learning  R integration  Tight but not mandatory Hadoop integration  Sources, Sinks  Scheduler
  14. 14. ©2015 Couchbase Inc. 14 Spark vs Hadoop  Spark is RAM while Hadoop is mainly HDFS (disk) bound  Fully compatible with Hadoop Input/Output  Easier to develop against thanks to functional composition  Hadoop certainly more mature, but Spark ecosystem growing fast
  15. 15. ©2015 Couchbase Inc. 15 Ecosystem Flexibility RDBMS Streams Web APIs DCP KV N1QL Views Batching Data Archive OLTP Data
  16. 16. ©2015 Couchbase Inc. 16 Infrastructure Consolidation
  17. 17. The Couchbase Spark Connector
  18. 18. ©2015 Couchbase Inc. 18 Couchbase Connector  Spark Core  Automatic Cluster and Resource Management  Creating and Persisting RDDs  Java APIs in addition to Scala (planned before GA)  Spark SQL  Easy JSON handling and querying  Tight N1QL Integration (partially in dp2, fully planned before GA)  Spark Streaming  Persisting DStreams  DCP source (partially in dp2, fully planned before GA)
  19. 19. ©2015 Couchbase Inc. 19 Facts  CurrentVersion: 1.0.0-dp2  Beta in July, GA in Q3 (tentative)  Code: https://github.com/couchbaselabs/couchbase-spark-connector  Docs until GA: https://github.com/couchbaselabs/couchbase-spark- connector/wiki
  20. 20. ©2015 Couchbase Inc. 20 Connection Management
  21. 21. ©2015 Couchbase Inc. 21 Connection Management
  22. 22. ©2015 Couchbase Inc. 22 Creating RDDs
  23. 23. ©2015 Couchbase Inc. 23 Persisting RDDs
  24. 24. ©2015 Couchbase Inc. 24 Spark SQL Integration
  25. 25. ©2015 Couchbase Inc. 25 Spark Streaming with DCP
  26. 26. Questions?
  27. 27. Thank you.

×