Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CouchbasetoHadoop_Matt_Michael_Justin v4

365 views

Published on

  • Be the first to comment

CouchbasetoHadoop_Matt_Michael_Justin v4

  1. 1. Couchbase to Hadoop at Linkedin Kafka is Enabling the Big Data Pipeline
  2. 2. • Define Problem Domain Justin Michaels | Solution Architect, Couchbase • Use case at LinkedIn Michael Kehoe | Site Reliability Engineer, Linkedin • Supporting Technology Overview and Demo Matt Ingenthron | Senior Director, Couchbase • Q&A Agenda 2
  3. 3. Lambda Architecture 4 1 2 3 4 5 DATA BATCH SPEED SERVE QUER Y
  4. 4. Lambda Architecture 5 Interactive and Real Time Applications 1 2 3 4 5 DATA BATCH SPEED SERVE QUER YHADOOP COUCHBASE STORM COUCHBASEBroker Cluster Spout for Topic Kafka Producers Ordered Subscriptions
  5. 5. • Hadoop … an open-source framework written for distributed storage and distributed processing of very large data sets on commodity hardware • Kafka … append only write-ahead log that records messages to a persistent store and allows subscribers to read and apply these changes to their own stores in an appropriate time-frame • Storm … distributed framework that uses custom created "spouts" and "bolts" to define information sources and manipulations for processing of streaming data • Couchbase … an open source, distributed NoSQL document- oriented database that is optimized for interactive applications with an integrated data cache and incremental map reduce facility 6
  6. 6. COMPLEX EVENT PROCESSING Real Time REPOSITORY PERPETUAL STORE ANALYTICAL DB BUSINESS INTELLIGENCE MONITORING CHAT/VOICE SYSTEM BATCH TRACK REAL-TIME TRACK DASHBOARD
  7. 7. TRACKING and COLLECTION ANALYSIS AND VISUALIZATION REST FILTER METRICS
  8. 8. Use Case at Linkedin 10
  9. 9. • Site Reliability Engineer (SRE) at LinkedIn • SRE for Profile & Higher-Education • Member of LinkedIn’s CBVT • B.E. (Electrical Engineering) from the University of Queensland, Australia Michael Kehoe
  10. 10. • Kafka was created by LinkedIn • Kafka is a publish-subcribe system built as a distributed commit log • Processes 500+ TB/ day (~500 billion messages) @ LinkedIn Kafka @ LinkedIn
  11. 11. • Monitoring • InGraphs • Traditional Messaging (Pub-Sub) • Analytics • Who Viewed my Profile • Experiment reports • Executive reports • Building block for (log) distributibuted applications • Pinot • Espresso LinkedIn’s uses of Kafka
  12. 12. Use Case: Kafka to Hadoop (Analytics) • LinkedIn tracks data to better understand how members use our products • Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center • Some of these events are all centrally collected and pushed onto our Hadoop grid for analysis and daily report generation
  13. 13. Couchbase @ LinkedIn • About 25 separate services with one or more clusters in multiple data centers • Up to 100 servers in a cluster • Single and Multi-tenant clusters
  14. 14. Use Case: Jobs Cluster • Read scaling, Couchbase ~80k QPS, 24 server cluster(s) • Hadoop to pre-build data by partition • Couchbase 99 percentile latencies
  15. 15. Hadoop to Couchbase • Our primary use-case for Hadoop  Couchbase is for building (warming) / recovering Couchbase buckets • LinkedIn built it’s own in-house solution to work with our ETL processes, cache invalidation procedures etc

×