Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Couchbase Meetup Jan 2016
Couchbase Meetup Jan 2016
Loading in …3
×
1 of 17

CouchbasetoHadoop_Matt_Michael_Justin v4

2

Share

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

CouchbasetoHadoop_Matt_Michael_Justin v4

  1. 1. Couchbase to Hadoop at Linkedin Kafka is Enabling the Big Data Pipeline
  2. 2. • Define Problem Domain Justin Michaels | Solution Architect, Couchbase • Use case at LinkedIn Michael Kehoe | Site Reliability Engineer, Linkedin • Supporting Technology Overview and Demo Matt Ingenthron | Senior Director, Couchbase • Q&A Agenda 2
  3. 3. Lambda Architecture 4 1 2 3 4 5 DATA BATCH SPEED SERVE QUER Y
  4. 4. Lambda Architecture 5 Interactive and Real Time Applications 1 2 3 4 5 DATA BATCH SPEED SERVE QUER YHADOOP COUCHBASE STORM COUCHBASEBroker Cluster Spout for Topic Kafka Producers Ordered Subscriptions
  5. 5. • Hadoop … an open-source framework written for distributed storage and distributed processing of very large data sets on commodity hardware • Kafka … append only write-ahead log that records messages to a persistent store and allows subscribers to read and apply these changes to their own stores in an appropriate time-frame • Storm … distributed framework that uses custom created "spouts" and "bolts" to define information sources and manipulations for processing of streaming data • Couchbase … an open source, distributed NoSQL document- oriented database that is optimized for interactive applications with an integrated data cache and incremental map reduce facility 6
  6. 6. COMPLEX EVENT PROCESSING Real Time REPOSITORY PERPETUAL STORE ANALYTICAL DB BUSINESS INTELLIGENCE MONITORING CHAT/VOICE SYSTEM BATCH TRACK REAL-TIME TRACK DASHBOARD
  7. 7. TRACKING and COLLECTION ANALYSIS AND VISUALIZATION REST FILTER METRICS
  8. 8. Use Case at Linkedin 10
  9. 9. • Site Reliability Engineer (SRE) at LinkedIn • SRE for Profile & Higher-Education • Member of LinkedIn’s CBVT • B.E. (Electrical Engineering) from the University of Queensland, Australia Michael Kehoe
  10. 10. • Kafka was created by LinkedIn • Kafka is a publish-subcribe system built as a distributed commit log • Processes 500+ TB/ day (~500 billion messages) @ LinkedIn Kafka @ LinkedIn
  11. 11. • Monitoring • InGraphs • Traditional Messaging (Pub-Sub) • Analytics • Who Viewed my Profile • Experiment reports • Executive reports • Building block for (log) distributibuted applications • Pinot • Espresso LinkedIn’s uses of Kafka
  12. 12. Use Case: Kafka to Hadoop (Analytics) • LinkedIn tracks data to better understand how members use our products • Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center • Some of these events are all centrally collected and pushed onto our Hadoop grid for analysis and daily report generation
  13. 13. Couchbase @ LinkedIn • About 25 separate services with one or more clusters in multiple data centers • Up to 100 servers in a cluster • Single and Multi-tenant clusters
  14. 14. Use Case: Jobs Cluster • Read scaling, Couchbase ~80k QPS, 24 server cluster(s) • Hadoop to pre-build data by partition • Couchbase 99 percentile latencies
  15. 15. Hadoop to Couchbase • Our primary use-case for Hadoop  Couchbase is for building (warming) / recovering Couchbase buckets • LinkedIn built it’s own in-house solution to work with our ETL processes, cache invalidation procedures etc

Editor's Notes

  • Note: Remove the logos from the animation and speed up build.
    Distributed users communities relying on interactive applications require systems to be distributed. As a result data is created in a variety of forms and places … the Polyglot Persistence … as the complexity of problems to be solved in creases applications demand a variety of development environments for tackling different problems. These complex, real-time applications combine different problems. Reliably storing, providing access to, and analyzing this data landscape leads to the Polyglot Persistence of data.
  • Users and consumers of information increasingly demand an always on, low latency access to their data. As well as providing a framework for businesses to understand what’s happening in real time while addressing Polyglot Persistence in managing data. The conceptual framework Lambda Architecture evolved out of Twitter and coined by Nathan Marz for a generic data processing architecture. In a way the architecture is an extended event sourced system but aims to accommodate streaming data at large scale.

    1. All data entering the system is dispatched to both the batch layer and the speed layer for processing.
    2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
    3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
    4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
    5. Any incoming query can be answered by merging results from batch views and real-time views.
  • Hadoop is engineered for storage and analysis.
    It can store petabytes of data, and if can be deployed to thousands of servers. It started with map / reduce. It added Hive. Today, we see efforts like Impala and Drill along with Hortonworks Stinger Initiative and Tez. Some of the Hadoop distributions are bundling Storm and / or Spark. The analytical capabilities of Hadoop are continuing to evolve and improve. However, it’s not well suited to operational workloads. It’s not intended to serve as a backend for enterprise application, mobile or web. It’s not intended to provide interactive data access.

  • Note: Remove the logos from the animation and speed up build.
    Distributed users communities relying on interactive applications require systems to be distributed. As a result data is created in a variety of forms and places … the Polyglot Persistence … as the complexity of problems to be solved in creases applications demand a variety of development environments for tackling different problems. These complex, real-time applications combine different problems. Reliably storing, providing access to, and analyzing this data landscape leads to the Polyglot Persistence of data.
  • The data generated by users is published to Apache Kafka.
    Next, it’s pulled into Apache Storm for real time analysis and processing as well as into Hadoop.
    Finally, Storm writes the data to Couchbase Server for real-time access by LivePerson agents while the data in Hadoop is eventually accessed via HP Vertica and MicroStrategy for offline business intelligence and analysis.
  • The data is first collected by tracking and collection service. Next, Storm pulls the data in for filtering, enrichment, and statistical analysis. The raw data is written to one Couchbase Server cluster while the processed data is written to a separate Couchbase Server cluster. The processed data is access by a front end for visualization and analysis. In addition, the raw data is copied from Couchbase Server to Hadoop. It’s combine with additional data and the whole is moved into HBase for ad hoc analysis. PayPal was able to handle both the volume and the velocity of data as well as meet both operation and analytical requirements. They relied on data capture, stream processing, NoSQL and Hadoop to do so.
  • ×