Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couchbase

2,068 views

Published on

Abstract:-
Tracking user events as they happen can challenge anyone providing real time user interaction. It can demand both huge scale and a lot of processing to support dynamic adjustment to targeting products and services. As the operational data store Couchbase data services are capable of processing tens of millions of updates a day. Streaming through systems such as Apache Spark and Kafka into Hadoop, information about these key events can be turned into deeper knowledge. We will review Lambda architectures deployed at sites like PayPal, Live Person and LinkedIn that leverage a Couchbase Data Pipeline.

Bio:-

Justin Michaels. With over 20 years experience in deploying mission critical systems, Justin Michaels industry experience covers capacity planning, architecture and industry vertical experience. Justin brings his passion for architecting, implementing and improving Couchbase to the community as a Solution Architect. His expertise involves both conventional application platforms as well as distributed data management systems. He regularly engages with existing and new Couchbase customers in performance reviews, architecture planning and best practice guidance.

Published in: Technology
  • ★★ How Long Does She Want You to Last? ★★ A recent study proved that the average man lasts just 2-5 minutes in bed (during intercourse). The study also showed that many women need at least 7-10 minutes of intercourse to reach "The Big O" - and, worse still... 30% of women never get there during intercourse. Clearly, most men are NOT fulfilling there women's needs in bed. Now, as I've said many times - how long you can last is no guarantee of being a GREAT LOVER. But, not being able to last 20, 30 minutes or more, is definitely a sign that you're not going to "set your woman's world on fire" between the sheets. Question is: "What can you do to last longer?" Well, one of the best recommendations I can give you today is to read THIS report. In it, you'll discover a detailed guide to an Ancient Taoist Thrusting Technique that can help any man to last much longer in bed. I can vouch 100% for the technique because my husband has been using it for years :) Here's the link to the report ■■■ http://ishbv.com/rockhardx/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • ➤➤ 3 Reasons Why You Shouldn't take Pills for ED (important) ♥♥♥ https://tinyurl.com/rockhardxxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couchbase

  1. 1. Couchbase Data Pipeline Stream yourOperational Data w/ Apache Spark & Kafka into Hadoop using Couchbase
  2. 2. ©2016 Couchbase Inc. 2 Agenda  Couchbase Overview  Couchbase Data Pipeline  Kafka – Demo - https://github.com/couchbase/couchbase-kafka-connector  Spark – Demo - https://github.com/justinmichaels006/CouchbaseSpark  Q&A
  3. 3. ©2016 Couchbase Inc. 3 Couchbase Overview Couchbase is the company behind Couchbase Server & Couchbase Mobile • Open source JSON database • Founded 2010 • 400+ enterprise customers globally Some of our customers: Couchbase Server can be deployed as: Document database Key-value store Distributed cache
  4. 4. ©2016 Couchbase Inc. 4 Couchbase Overview  The first NoSQL database that enables you to develop with agility and operate at any scale. Managed Cache Key-Value Store Document Database Embedded Database Sync Management
  5. 5. ©2016 Couchbase Inc. 5 Couchbase Architecture • Data Service – builds and maintains Distributed secondary indexes (MapReduce Views) • Indexing Engine – builds and maintains Global Secondary Indexes • Query Engine – plans, coordinates, and executes queries against either Global or Distributed indexes • Cluster Manager – configuration, heartbeat, statistics, RESTful Management interface
  6. 6. ©2016 Couchbase Inc. 6 Storing And Retrieving Documents
  7. 7. ©2016 Couchbase Inc. 7 Online Linear Scalability ACTIVE ACTIVE ACTIVE REPLICA REPLICA REPLICA Couchbase Server 1 Couchbase Server 2 Couchbase Server 3 ACTIVE ACTIVE REPLICA REPLICA Couchbase Server 4 Couchbase Server 5 SHARD 5 SHARD 2 SHARD SHARD SHARD 4 SHARD SHARD SHARD 1 SHARD 3 SHARD SHARD SHARD 4 SHARD 1 SHARD 8 SHARD SHARD SHARD SHARD 6 SHARD 3 SHARD 2 SHARD SHARD SHARD SHARD 7 SHARD 9 SHARD 5 SHARD SHARD SHARD SHARD 7 SHARD SHARD 6 SHARD SHARD 8 SHARD 9 SHARD READ/WRITE/UPDATE
  8. 8. ©2016 Couchbase Inc. 8 Cross Datacenter Replication (XDCR) Unidirectional or Bidirectional Replication Unidirectional  Hot spare / Disaster Recovery  Development/Testing copies  Connectors (Solr, Elasticsearch)  Integrate to custom consumer Bidirectional  MultipleActive Masters  Disaster Recovery  Datacenter Locality
  9. 9. Couchbase Data Pipeline 9
  10. 10. ©2016 Couchbase Inc. 10 DCP  DatabaseChange Protocol – Since Couchbase Server 3.x internal standard to handle changes – Clients: Intra-Cluster Replication, Indexing, XDCR  Mutation – Event which is raised in case of a creation, update or delete – Each mutation that occurs in a vBucket has a sequence number  Core of the 2.x Java SDK – Can consume DCP streams – Important: API not yet exposed but used to implementConnectors provided by Couchbase
  11. 11. ©2016 Couchbase Inc. 11 Couchbase Data Pipeline  Couchbase is primarily online operational NoSQL datastore, low latency, scalable  Data Source – Example: Pulling user profiles into Hadoop for deep analytics  Data Sink – Example: training machine learning models that are then cached / served from Couchbase NoSQL Spark Hadoop Web Mobile IoT Analytics Discovery Prediction
  12. 12. ©2016 Couchbase Inc. 12 Couchbase Data Pipeline Couchbase Spark Hadoop (Hive) Use cases • Operational • Web / Mobile • Analytics • Machine Learning • Analytics • Machine Learning Processing mode • Online • Ad Hoc • Ad Hoc • Batch • Streaming (+/-) • Batch • Ad Hoc (+/-) Low latency = < 1ms ops Seconds Minutes Performance Highly predictable Variable Variable Users are typically… Millions of customers 100’s of analysts or data scientists 100’s of analysts or data scientists Memory-centric Memory-centric Disk-centric Big data = 10s of Terabytes Petabytes (?) Petabytes ANALYTICALOPERATIONAL
  13. 13. ©2016 Couchbase Inc. 13 Couchbase Data Pipeline 13 Operational Velocity Analytical VolumeAnalytical Velocity
  14. 14. ©2016 Couchbase Inc. 14 Couchbase Data Pipeline New Data Stream MergedView All Data Precompute Views (Map Reduce) Process Stream Incremental Views Partial Aggregate Partial Aggregate Partial Aggregate Real-Time Data Batch Recompute BatchViews Real-TimeViews Real-Time Increment Merge Batch Layer Serving Layer Speed Layer Couchbase Hadoop Connector (Sqoop)
  15. 15. ©2016 Couchbase Inc. 15 Couchbase Hadoop Connector (Sqoop) Couchbase Data Pipeline New Data Stream MergedView All Data Precompute Views (Map Reduce) Process Stream Incremental Views Partial Aggregate Partial Aggregate Partial Aggregate Real-Time Data Batch Recompute BatchViews Real-TimeViews Real-Time Increment Merge Batch Layer Serving Layer Speed Layer Stream / Data Ingestion Store Incremental Data / Stream processing Serving merged results / responses
  16. 16. ©2016 Couchbase Inc. 16 Couchbase Connectors data scientist / engineersup to 1010 application users NoSQL Database 101- 102 Kafka Hadoop Spark Elasticsearch DCP XDCR Storm Sqoop
  17. 17. COMPLEX EVENT PROCESSING Real Time REPOSITORY PERPETUAL STORE ANALYTICAL DB BUSINESS INTELLIGENCE MONITORING CHAT/VOICE SYSTEM BATCH TRACK REAL-TIME TRACK DASHBOARD
  18. 18. © 2015 PayPal Inc. All rights reserved. Confidential and proprietary. PayPal Use Case http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-2.0/kafka-intro.html https://github.com/paypal/couchbasekafka 18 Camus , MR Jobs DCP Stream Couchbase Kafka Adapter {DCP Client + Kafka Producer} [1] [2] [3] [4][5][6] [7]
  19. 19. Kafka Connector 19
  20. 20. ©2016 Couchbase Inc. 20 Kafka  Publish-Subscriber System – Model which describes how Publishers can distribute information across multiple Subscribers those need to sign up for the retrieval of such data  Message Queue System – Messages are put (stored) in a queue until the recipient can retrieve them  Specifics – Commit log based – Distributed & partitioned – Failover mechanism
  21. 21. ©2016 Couchbase Inc. 21 Kafka  ZooKeeper – Coordination between Kafka Brokers – Store Coordination Data, Status Information, Configurations  Broker – One or more Services those are processing messages – Can stores messages – Failover: Leader vs. Follower  Topic – Distributed/partitioned message queue  Consumer – Applications/processes/threads those are subscribed to the topic – Can be grouped in order to process messages in parallel  Producer – Publish data/messages to the topic
  22. 22. ©2016 Couchbase Inc. 22 Kafka  Data broker w/ publish / subscribe system  Decouples producers of data from consumers  Massively scalable  Messages queued until the recipient can retrieve them Consumer HDFSKafka Consumer Producer Consumer Producer Producer Producer Couchbase can be consumer and producer
  23. 23. ©2016 Couchbase Inc. 23 Couchbase/Kafka Use Cases  Polyglot Processing – Activities – Events – Monitoring Metrics – Sensor Data  Typical Use Cases – Messaging: Decouple data processing from data producers – Log Aggregation: A log as stream of messages – Stream Processing:Consume data from one topic and put the filtered/transformed data into another one – Click StreamAnalysis: Page views/searches as real-time publish-subscribe feeds – Real-time Data Integration: Extract from Couchbase , transform and load data in real time
  24. 24. ©2016 Couchbase Inc. 24 Couchbase Kafka Connector Available Now: 2.0 GA  Kafka Producer or Consumer  Stream events  Filters  Transform events  Sample Producer & Consumer 24 Code: https://github.com/couchbase/couchbase-kafka-connector/ Monthly Releases  Kafka Connect (Apache Kafka 0.9)  Merge code for Storm connector  ??? Issues: https://issues.couchbase.com/projects/KAFKAC Docs: http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-1.2/kafka-intro.html
  25. 25. ©2016 Couchbase Inc. 25 Learn More - Couchbase Kafka Connector Confluent’s Ewen Cheslack-Postava atCouchbase Connect 2015  https://youtu.be/fFPVwYKUTHs Couchbase and Kafka - Up and Running in 10 Minutes  http://blog.couchbase.com/2015/november/kafka-and-couchbase-up-and-running-in-10-minutes Product docs  http://developer.couchbase.com/documentation/server/4.1/connectors/kafka-1.2/kafka-intro.html Avalon Consulting blog and Github repo  http://blogs.avalonconsult.com/blog/big-data/purchase-transaction-alerting-with-couchbase-and-kafka/  https://github.com/Avalon-Consulting-LLC/couchbase-kafka 25
  26. 26. Couchbase Spark Connector
  27. 27. ©2016 Couchbase Inc. 27 Spark DCP KV N1QL Views Source: Databricks https://www.brighttalk.com/webcast/12891/196891
  28. 28. ©2016 Couchbase Inc. 28 Spark  Fast and general engine for big data processing with libraries for advanced analytics Spark Core: • task scheduling • memory management • fault recovery • interacting with storage systems
  29. 29. ©2016 Couchbase Inc. 29 Spark  Resilient Distributed Dataset: Core Spark data abstraction – Distributed collection of elements – Read-only (immutable) – Fault tolerant: Can recover from loss of a partition • Re-computed, not stored  Operations – Transformation: Lazy operation creating new RDD • e.g. map(), filter() – Action: Return a result or save it • e.g. take(), save() 29
  30. 30. ©2016 Couchbase Inc. 30 Spark Create: Read a log file sc.textFile(“server.log”) Filter: Keep lines starting with "ERROR" Filter: Keep lines containing ”system5" Action: Write result to storage  RDD is created by either: – Loading an external dataset or RDD  RDD is transformed: – Result:A new RDD – Sequence of transformations  Data is eventually extracted – By an action on an RDD
  31. 31. ©2016 Couchbase Inc. 31 DataFrames (SparkSQL) • Distributed collection of data organized in named colums  DataFrame = “RDD + Schema” • Perform SQL Queries on top of your data From https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
  32. 32. ©2016 Couchbase Inc. 32 Datasets • Typesafe programming on top of DataFrames • Initial support in Spark 1.6, likely to be extended in 2.0 • Higher performance & less memory usage • Encoding/Decoding for semi-structured data https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html
  33. 33. ©2016 Couchbase Inc. 34 Couchbase Spark Connector data scientists & data engineers application users DCP Features • Create RDDs from KV, N1QL, Views • Create DStreams from DCP feeds • Persist RDDs and Dstreams • Support for DataFrames and SparkSQL
  34. 34. ©2016 Couchbase Inc. 35 Couchbase Spark Use Case: Data Analysis  Query across data in many systems using one language & runtime – Separate Couchbase clusters for workload isolation – Results streamed back as needed to support applications Operational Data Store XDCR RDBMSs3hdfs Elastic
  35. 35. ©2016 Couchbase Inc. 36 Couchbase Spark Use Case: Machine Learning  Data scientists train machine learning models – Load results into Couchbase so end users can interact with them online – Recommendations for content and products, fraud detection or spam filter DCP Hadoop Machine Learning Models Data Warehouse Historical Data
  36. 36. ©2016 Couchbase Inc. 37 Couchbase Spark Connector 1.2 Upcoming Release (planned) : 1.2 • Spark 1.6 Support (including Datasets) • Bugfixes • Enhanced JavaAPIs • Updated DCP Functionality 37 github.com/couchbaselabs/couchbase-spark-connector https://issues.couchbase.com/projects/SPARKC
  37. 37. ©2016 Couchbase Inc. 38 Learn More - Couchbase Spark Connector Github Repo  https://github.com/couchbaselabs/couchbase-spark-connector Spark with Couchbase to Electrify your Data Processing  https://youtu.be/sBnAf7gAfLc Docs  http://developer.couchbase.com/documentation/server/current/connectors/spark-1.0/spark-intro.html Avalon Consulting blog and Github repo (Market Basket Analysis)  http://blogs.avalonconsult.com/blog/big-data/combining-operational-and-analytical-big-data- using-couchbase-and-spark-a-market-basket-analysis-example/  https://github.com/Avalon-Consulting-LLC/couchbase-spark-mba 38
  38. 38. Q&A justin@couchbase.com
  39. 39. Dev Guide
  40. 40. ©2016 Couchbase Inc. 41 Couchbase Data Node Data Node Spark Worker Anatomy of a Spark Application Driver Program SparkContext Cluster Manager Couchbase Couchbase Executor Task
  41. 41. ©2016 Couchbase Inc. 42 Connection Management
  42. 42. ©2016 Couchbase Inc. 43 Connection Management
  43. 43. ©2016 Couchbase Inc. 44 Creating RDDs
  44. 44. ©2016 Couchbase Inc. 45 Persisting RDDs
  45. 45. ©2016 Couchbase Inc. 46 RDD N1QL Query
  46. 46. ©2016 Couchbase Inc. 47 Spark SQL - Schema
  47. 47. ©2016 Couchbase Inc. 48 Spark SQL – Dataframe Query
  48. 48. ©2016 Couchbase Inc. 49 Demo of Dataset (Spark 1.6)
  49. 49. ©2016 Couchbase Inc. 50 Demo of Dataset (Spark 1.6)
  50. 50. ©2016 Couchbase Inc. 51 Demo of Dataset (Spark 1.6)
  51. 51. ©2016 Couchbase Inc. 52 Spark Streaming with DCP

×