Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
OPEN SOURCE LAMBDA ARCHITECTURE
KAFKA · HADOOP · SAMZA · DRUID
FANGJIN YANG · GIAN MERLINO · DRUID COMMITTERS
PROBLEM DEALING WITH EVENT DATA
MOTIVATION EVOLUTION OF A “REAL-TIME” STACK
ARCHITECTURE THE “RAD”-STACK
NEXT STEPS TRY IT...
THE PROBLEM
2013
THE PROBLEM
‣ Arbitrary and interactive exploration of time series data
• Ad-tech, system/app metrics, network/websit...
2013
FINDING A SOLUTION
‣ Load all your data into Hadoop. Query it. Done!
‣ Good job guys, let’s go home
2013
FINDING A SOLUTION
Hadoop
EventStreams
Insight
2013
PROBLEMS WITH THE NAIVE SOLUTION
‣ MapReduce can handle almost every distributed computing
problem
‣ MapReduce over y...
2013
FINDING A SOLUTION
Hadoop (pre-processing and storage) Query Layer
Hadoop
EventStreams
Insight
A FASTER QUERY LAYER
2013
MAKE QUERIES FASTER
‣ What types of queries to optimize for?
• Revenue over time broken down by demographic
• Top pub...
2013
FINDING A SOLUTION
Hadoop (pre-processing and storage) RDBMS
Hadoop
EventStreams
Insight
2013
FINDING A SOLUTION
Hadoop (pre-processing and storage)
NoSQL K/V
Stores
Hadoop
EventStreams
Insight
2013
FINDING A SOLUTION
Hadoop (pre-processing and storage)
Commercial
Databases
Hadoop
EventStreams
Insight
DRUID AS A QUERY LAYER
2013
DRUID
‣ Druid project started in 2011, went open source in 2012
‣ Designed for low latency ingestion and ad-hoc aggre...
2014
REALTIME INGESTION
>500K EVENTS / SECOND AVERAGE
>1M EVENTS / SECOND PEAK
10 – 100K EVENTS / SECOND / CORE
DRUID IN P...
2014
0.0
0.5
1.0
1.5
0
1
2
3
4
0
5
10
15
20
90%ile95%ile99%ile
Feb 03 Feb 10 Feb 17 Feb 24
time
querytime(seconds)
datasou...
2013
RAW DATA
timestamp publisher advertiser gender country click price
2011-01-01T01:01:35Z bieberfever.com google.com Ma...
2013
ROLLUP DATA
timestamp publisher advertiser gender country impressions clicks revenue
2011-01-01T01:00:00Z ultratrimfa...
2013
PARTITION DATA
timestamp publisher advertiser gender country impressions clicks revenue
2011-01-01T01:00:00Z ultratri...
2013
IMMUTABLE SEGMENTS
‣ Fundamental storage unit in Druid
‣ Read consistency
‣ One thread scans one segment
‣ Multiple t...
2013
COLUMN ORIENTATION
timestamp publisher advertiser gender country impressions clicks revenue
2011-01-01T01:00:00Z ultr...
DRUID INGESTION
‣ Must have denormalized, flat data
‣ Druid cannot do stateful processing at ingestion time
‣ …like stream-...
2013
DRUID REAL-TIME INGESTION
Druid
Realtime
Workers
Immediate Druid
Historical
Nodes
Periodic
Druid
Broker
Nodes
Data
So...
2013
DRUID REAL-TIME INGESTION
Druid
Realtime
Workers
Druid
Historical
Nodes
Periodic
Druid
Broker
Nodes
Data
Source
User ...
2013
DRUID REAL-TIME INGESTION
Druid
Realtime
Workers
Immediate Druid
Historical
Nodes
Periodic
Druid
Broker
Nodes
Data
So...
2013
DRUID REAL-TIME INGESTION
Druid
Realtime
Workers
Immediate Druid
Historical
Nodes
Periodic
Druid
Broker
Nodes
User qu...
STREAMING DATA PIPELINES
AN EXAMPLE: ONLINE ADS
‣ Input data: impressions, clicks, ID-to-name mappings
‣ Output: enhanced impressions
‣ Steps
‣ Joi...
PIPELINE
Impressions
Clicks
Druid
?
PIPELINE
Impressions
Partition 0
{key: 186bd591-9442-48f0, publisher: foo, …}
{key: 9b5e2cd2-a8ac-4232, publisher: qux, …}...
PIPELINE
Impressions
Clicks
Druid
PIPELINE
Impressions
Clicks
Shuffled
Shuffle
Druid
PIPELINE
Shuffled
Partition 0
{type: impression, key: 186bd591-9442-48f0, publisher: foo, …}
{type: impression, key: 107902...
PIPELINE
Impressions
Clicks
Shuffled
Shuffle
Druid
PIPELINE
Impressions
Clicks
Shuffled
Joined
Shuffle
Join
Druid
PIPELINE
Joined
Partition 0
{key: 186bd591-9442-48f0, is_clicked: true, publisher: foo, …}
{key: 1079026c-7151-4871, is_cl...
PIPELINE
Impressions
Clicks
Shuffled
Joined
Shuffle
Join
Druid
PIPELINE
Impressions
Clicks
Shuffled
Joined
Shuffle
Join
Enhance & Output
Druid
ALTERNATIVE PIPELINE
Impressions
Clicks
Shuffled
Joined
Shuffle
Join
Enhance Druid
Enhanced
REPROCESSING
WHY REPROCESS DATA?
‣ Bugs in processing code
‣ Imprecise streaming operations
‣ …like using short join windows
‣ Limitati...
LAMBDA ARCHITECTURES
‣ Hybrid batch/streaming data pipeline
‣ Batch technologies
• Hadoop MapReduce
• Spark
‣ Streaming te...
LAMBDA ARCHITECTURES
‣ Advantages?
• Works as advertised
• Works with a huge variety of open software
• Druid supports bat...
LAMBDA ARCHITECTURES
‣ Disadvantages?
‣ Need code to run on two very different systems
‣ Maintaining two codebases is peri...
LAMBDA ARCHITECTURES
Data
streaming
LAMBDA ARCHITECTURES
Data batch
LAMBDA ARCHITECTURES
Data
streaming
batch
KAPPA ARCHITECTURE
‣ Pure streaming
‣ Reprocess data by replaying the input stream
‣ Doesn’t require operating two systems...
OPERATIONS
NICE THINGS ABOUT KAFKA
‣ Scalable, replicated pub/sub
‣ Replayable message logs
‣ New consumers can read all old messages...
NICE THINGS ABOUT SAMZA
‣ Multi-tenancy: one main thread per container
‣ Robustness: isolated containers limit slowness an...
NICE THINGS ABOUT DRUID
‣ Fast ingestion, fast queries
‣ Seamlessly merge stream-ingested and batch-ingested data
‣ Batch ...
NICE THINGS ABOUT HADOOP
‣ Solid batch processing system
‣ Easy to partition and reprocess data by time range
‣ Jobs can p...
MONITORING
‣ Kafka partition availability
‣ Kafka log cleaner
‣ Samza consumer offsets
‣ Druid ingestion process rate
‣ Dr...
STREAM METRICS
STREAM METRICS
DO TRY THIS AT HOME
2013
CORNERSTONES
‣ Druid - druid.io - @druidio
‣ Samza - samza.apache.org - @samzastream
‣ Kafka - kafka.apache.org - @ap...
GLUE
Tranquility
Camus / Secor Druid Hadoop indexer
GLUE
Camus / Secor Druid Hadoop indexer
druid-kaka-eight
TAKE AWAYS
‣ Consider Kafka for making your streams available
‣ Consider Samza for streaming data integration
‣ Consider D...
THANK YOU
Upcoming SlideShare
Loading in …5
×

Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid

6,317 views

Published on

Hadoop summit 2015

Published in: Technology
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid

  1. 1. OPEN SOURCE LAMBDA ARCHITECTURE KAFKA · HADOOP · SAMZA · DRUID FANGJIN YANG · GIAN MERLINO · DRUID COMMITTERS
  2. 2. PROBLEM DEALING WITH EVENT DATA MOTIVATION EVOLUTION OF A “REAL-TIME” STACK ARCHITECTURE THE “RAD”-STACK NEXT STEPS TRY IT OUT FOR YOURSELF OVERVIEW
  3. 3. THE PROBLEM
  4. 4. 2013 THE PROBLEM ‣ Arbitrary and interactive exploration of time series data • Ad-tech, system/app metrics, network/website traffic analysis ‣ Multi-tenancy: lots of concurrent users ‣ Scalability: 10+ TB/day, ad-hoc queries on trillions of events ‣ Recency matters! Real-time analysis
  5. 5. 2013 FINDING A SOLUTION ‣ Load all your data into Hadoop. Query it. Done! ‣ Good job guys, let’s go home
  6. 6. 2013 FINDING A SOLUTION Hadoop EventStreams Insight
  7. 7. 2013 PROBLEMS WITH THE NAIVE SOLUTION ‣ MapReduce can handle almost every distributed computing problem ‣ MapReduce over your raw data is flexible but slow ‣ Hadoop is not optimized for query latency ‣ To optimize queries, we need a query layer
  8. 8. 2013 FINDING A SOLUTION Hadoop (pre-processing and storage) Query Layer Hadoop EventStreams Insight
  9. 9. A FASTER QUERY LAYER
  10. 10. 2013 MAKE QUERIES FASTER ‣ What types of queries to optimize for? • Revenue over time broken down by demographic • Top publishers by clicks over the last month • Number of unique visitors broken down by any dimension • Not dumping the entire dataset • Not examining individual events
  11. 11. 2013 FINDING A SOLUTION Hadoop (pre-processing and storage) RDBMS Hadoop EventStreams Insight
  12. 12. 2013 FINDING A SOLUTION Hadoop (pre-processing and storage) NoSQL K/V Stores Hadoop EventStreams Insight
  13. 13. 2013 FINDING A SOLUTION Hadoop (pre-processing and storage) Commercial Databases Hadoop EventStreams Insight
  14. 14. DRUID AS A QUERY LAYER
  15. 15. 2013 DRUID ‣ Druid project started in 2011, went open source in 2012 ‣ Designed for low latency ingestion and ad-hoc aggregations ‣ Designed for keeping around a lot of history (years are ok) ‣ Growing Community • ~100 contributors • Used in production at numerous large and small organizations
  16. 16. 2014 REALTIME INGESTION >500K EVENTS / SECOND AVERAGE >1M EVENTS / SECOND PEAK 10 – 100K EVENTS / SECOND / CORE DRUID IN PRODUCTION
  17. 17. 2014 0.0 0.5 1.0 1.5 0 1 2 3 4 0 5 10 15 20 90%ile95%ile99%ile Feb 03 Feb 10 Feb 17 Feb 24 time querytime(seconds) datasource a b c d e f g h Query latency percentiles QUERY LATENCY (500MS AVERAGE) 90% < 1S 95% < 5S 99% < 10S DRUID IN PRODUCTION
  18. 18. 2013 RAW DATA timestamp publisher advertiser gender country click price 2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65 2011-01-01T01:03:63Z bieberfever.com google.com Male USA 0 0.62 2011-01-01T01:04:51Z bieberfever.com google.com Male USA 1 0.45 ... 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53
  19. 19. 2013 ROLLUP DATA timestamp publisher advertiser gender country impressions clicks revenue 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31 2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01 ‣ Truncate timestamps ‣ GroupBy over string columns (dimensions) ‣ Aggregate numeric columns (metrics)
  20. 20. 2013 PARTITION DATA timestamp publisher advertiser gender country impressions clicks revenue 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31 2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01 ‣ Shard data by time ‣ Immutable chunks of data called “segments” Segment 2011-01-01T02/2011-01-01T03 Segment 2011-01-01T01/2011-01-01T02
  21. 21. 2013 IMMUTABLE SEGMENTS ‣ Fundamental storage unit in Druid ‣ Read consistency ‣ One thread scans one segment ‣ Multiple threads can access same underlying data ‣ Segment sizes -> computation completes in ms ‣ Simplifies distribution & replication
  22. 22. 2013 COLUMN ORIENTATION timestamp publisher advertiser gender country impressions clicks revenue 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 ‣ Scan/load only what you need ‣ Compression! ‣ Indexes!
  23. 23. DRUID INGESTION ‣ Must have denormalized, flat data ‣ Druid cannot do stateful processing at ingestion time ‣ …like stream-stream joins ‣ …or user session reconstruction ‣ …or a bunch of other useful things! ‣ Many Druid users need an ETL pipeline
  24. 24. 2013 DRUID REAL-TIME INGESTION Druid Realtime Workers Immediate Druid Historical Nodes Periodic Druid Broker Nodes Data Source User queries
  25. 25. 2013 DRUID REAL-TIME INGESTION Druid Realtime Workers Druid Historical Nodes Periodic Druid Broker Nodes Data Source User queries
  26. 26. 2013 DRUID REAL-TIME INGESTION Druid Realtime Workers Immediate Druid Historical Nodes Periodic Druid Broker Nodes Data Source Stream Processor User queries
  27. 27. 2013 DRUID REAL-TIME INGESTION Druid Realtime Workers Immediate Druid Historical Nodes Periodic Druid Broker Nodes User queries
  28. 28. STREAMING DATA PIPELINES
  29. 29. AN EXAMPLE: ONLINE ADS ‣ Input data: impressions, clicks, ID-to-name mappings ‣ Output: enhanced impressions ‣ Steps ‣ Join impressions with clicks ->“clicks” ‣ Look up IDs to names -> “advertiser”, “publisher”, … ‣ Geocode -> “country”, … ‣ Lots of other additions
  30. 30. PIPELINE Impressions Clicks Druid ?
  31. 31. PIPELINE Impressions Partition 0 {key: 186bd591-9442-48f0, publisher: foo, …} {key: 9b5e2cd2-a8ac-4232, publisher: qux, …} … Partition 1 {key: 1079026c-7151-4871, publisher: baz, …} … Clicks Partition 0 … Partition 1 {key: 186bd591-9442-48f0} …
  32. 32. PIPELINE Impressions Clicks Druid
  33. 33. PIPELINE Impressions Clicks Shuffled Shuffle Druid
  34. 34. PIPELINE Shuffled Partition 0 {type: impression, key: 186bd591-9442-48f0, publisher: foo, …} {type: impression, key: 1079026c-7151-4871, publisher: baz, …} {type: click, key: 186bd591-9442-48f0} … Partition 1 {type: impression, key: 9b5e2cd2-a8ac-4232, publisher: qux, …} …
  35. 35. PIPELINE Impressions Clicks Shuffled Shuffle Druid
  36. 36. PIPELINE Impressions Clicks Shuffled Joined Shuffle Join Druid
  37. 37. PIPELINE Joined Partition 0 {key: 186bd591-9442-48f0, is_clicked: true, publisher: foo, …} {key: 1079026c-7151-4871, is_clicked: false, publisher: baz, …} … Partition 1 {key: 9b5e2cd2-a8ac-4232, is_clicked: false, publisher: qux, …} …
  38. 38. PIPELINE Impressions Clicks Shuffled Joined Shuffle Join Druid
  39. 39. PIPELINE Impressions Clicks Shuffled Joined Shuffle Join Enhance & Output Druid
  40. 40. ALTERNATIVE PIPELINE Impressions Clicks Shuffled Joined Shuffle Join Enhance Druid Enhanced
  41. 41. REPROCESSING
  42. 42. WHY REPROCESS DATA? ‣ Bugs in processing code ‣ Imprecise streaming operations ‣ …like using short join windows ‣ Limitations of current software ‣ …Kafka 0.8.x, Samza 0.9.x can generate duplicate messages ‣ …Druid 0.7.x streaming ingestion is best-effort
  43. 43. LAMBDA ARCHITECTURES ‣ Hybrid batch/streaming data pipeline ‣ Batch technologies • Hadoop MapReduce • Spark ‣ Streaming technologies • Samza • Storm • Spark Streaming
  44. 44. LAMBDA ARCHITECTURES ‣ Advantages? • Works as advertised • Works with a huge variety of open software • Druid supports batch-replace-by-time-range through Hadoop
  45. 45. LAMBDA ARCHITECTURES ‣ Disadvantages? ‣ Need code to run on two very different systems ‣ Maintaining two codebases is perilous ‣ …productivity loss ‣ …code drift ‣ …difficulty training new developers
  46. 46. LAMBDA ARCHITECTURES Data streaming
  47. 47. LAMBDA ARCHITECTURES Data batch
  48. 48. LAMBDA ARCHITECTURES Data streaming batch
  49. 49. KAPPA ARCHITECTURE ‣ Pure streaming ‣ Reprocess data by replaying the input stream ‣ Doesn’t require operating two systems ‣ Doesn’t overcome software limitations ‣ I don’t have much experience with this ‣ http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html
  50. 50. OPERATIONS
  51. 51. NICE THINGS ABOUT KAFKA ‣ Scalable, replicated pub/sub ‣ Replayable message logs ‣ New consumers can read all old messages ‣ Existing consumers can reprocess all old messages
  52. 52. NICE THINGS ABOUT SAMZA ‣ Multi-tenancy: one main thread per container ‣ Robustness: isolated containers limit slowness and failure ‣ Visibility ‣ Multistage jobs, lots of metrics per stage ‣ Can inspect the message queue in Kafka ‣ State is simple ‣ Logging and restoring handled for you ‣ Single-threaded programming
  53. 53. NICE THINGS ABOUT DRUID ‣ Fast ingestion, fast queries ‣ Seamlessly merge stream-ingested and batch-ingested data ‣ Batch loads can “replace” stream loads for the same time range
  54. 54. NICE THINGS ABOUT HADOOP ‣ Solid batch processing system ‣ Easy to partition and reprocess data by time range ‣ Jobs can process all data, or a pre-partitioned slice
  55. 55. MONITORING ‣ Kafka partition availability ‣ Kafka log cleaner ‣ Samza consumer offsets ‣ Druid ingestion process rate ‣ Druid ingestion drop rate ‣ Druid query latency ‣ System metrics: CPU, network, disk ‣ Event counts at various stages
  56. 56. STREAM METRICS
  57. 57. STREAM METRICS
  58. 58. DO TRY THIS AT HOME
  59. 59. 2013 CORNERSTONES ‣ Druid - druid.io - @druidio ‣ Samza - samza.apache.org - @samzastream ‣ Kafka - kafka.apache.org - @apachekafka ‣ Hadoop - hadoop.apache.org
  60. 60. GLUE Tranquility Camus / Secor Druid Hadoop indexer
  61. 61. GLUE Camus / Secor Druid Hadoop indexer druid-kaka-eight
  62. 62. TAKE AWAYS ‣ Consider Kafka for making your streams available ‣ Consider Samza for streaming data integration ‣ Consider Druid for interactive exploration of streams ‣ Metrics, metrics, metrics ‣ Have a reprocessing strategy if you’re interested in historical data
  63. 63. THANK YOU

×