Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Spark in a Couchbase Environment: Couchbase Connect 2015

1,242 views

Published on

Many enterprise looking at leveraging their data with more advanced analytics turn to Spark as a standard solution. But leveraging Couchbase data with Spark is more simple than it looks. Databricks, founded by the creators of Spark, will present how they see Spark evolving to address new use cases, and the simple mechanisms enabling to immediately use Spark with Couchbase data.

Published in: Technology
  • Be the first to comment

Using Spark in a Couchbase Environment: Couchbase Connect 2015

  1. 1. Spark @ Couchbase Connect John Tripier, John@databricks.com Michael Nitschinger, Couchbase June, 2015
  2. 2. What is Apache Spark? Fast and general engine for big data processing with libraries for advanced analytics Most active open source project in big data
  3. 3. Founded by the creators of Spark in 2013 Most active organization contributing to Spark –  3/4 of the code in 2014 Created Databricks Cloud, a cloud-based big data platform on top of Spark to make big data simple About Databricks
  4. 4. 2014: an Amazing Year for Spark Total contributors: 150 => 500 Lines of code: 190K => 370K 500+ active production deployments 4
  5. 5. 0 20 40 60 80 100 120 140 2011 2012 2013 2014 2015 Contributors per Month to Spark Most active project in big data 5
  6. 6. 6 On-Disk Sort Record: Time to sort 100TB Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes 2015 Project Tungsten: memory and CPU for Spark applications
  7. 7. Ecosystem Distribu(ons   Applica(ons  
  8. 8. 8 Spark Core Spark Streaming Spark SQL MLlib GraphX Spark platform
  9. 9. 9 New Directions in 2015 Data Science High-level interfaces similar to single-machine tools Platform Interfaces Plug in data sources and algorithms
  10. 10. 10 DataFrames Similar API to data frames in R and Pandas Automatically optimized via Spark SQL 0 5 10 Python Scala DataFrame RunningTime A distributed collection of data grouped into named columns Faster and easier for Spark developers to work with structured data by providing simplified methods for filtering, aggregating, and projecting over large datasets
  11. 11. 11 Machine Learning Pipelines High-level API inspired by SciKit-Learn Featurization, evaluation, parameter search tokenizer = Tokenizer() tf = HashingTF(numFeatures=1000) lr = LogisticRegression() pipe = Pipeline([tokenizer, tf, lr]) model = pipe.fit(df) tokenizer TF LR modelDataFrame
  12. 12. 12 R Interface (SparkR) Targeting Spark 1.4 (June) Exposes DataFrames, RDDs, and ML library in R df = jsonFile(“tweets.json”)  summarize(                            group_by(                             df[df$user == “matei”,],     “date”),   sum(“retweets”)) 
  13. 13. 13 New Directions in 2015 Data Science High-level interfaces similar to single-machine tools Platform Interfaces Plug in data sources and algorithms
  14. 14. 14 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources Spark {JSON}
  15. 15. 15 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources SELECT * FROM mysql_users u JOIN hive_logs h WHERE u.lang = “en” Spark {JSON} SELECT * FROM users WHERE lang=“en”
  16. 16. 16 {JSON} Data Sources Spark Core DataFrames ML Pipelines Spark Streaming Spark SQL MLlib GraphX
  17. 17. 17 {JSON} Data Sources Spark Core DataFrames ML Pipelines Spark Streaming Spark SQL MLlib GraphX ?  
  18. 18. 18 Spark Packages Community index of third party packages bin/spark-shell --packages databricks/spark-csv:0.2 spark-packages.org
  19. 19. 19
  20. 20. 20
  21. 21. 21
  22. 22. 22
  23. 23. 23
  24. 24. Demo 24
  25. 25. Ecosystem Flexibility RDBMS   Streams   Web  APIs   DCP   KV   N1QL   Views   Batching   Data  Archive   OLTP  Data  
  26. 26. Infrastructure Consolidation
  27. 27. To Learn More Two free massive online courses (MOOCs) on Big Data and Spark: http://databricks.com/moocs Couchbase Spark Package: http://spark-packages.org/?q=couchbase Try Databricks Cloud: databricks.com Email me at john@databricks.com 27
  28. 28. Thank you 28

×