Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Druid
Rostislav Pashuto
November, 2015
The pattern
● we have to scale, current storage no longer able to support our growth
○ horizontal scaling for data which i...
Once upon a time
“Over the last twelve months, we tried and failed to achieve scale and speed
with relational databases (G...
Druid
Pros
● aggregate operations in sub-second for most use cases
● real-time streaming ingestion and batch Ingestion
● d...
Druid: checklist
You need
● fast aggregations and exploratory analytics
● sub-second queries for near real-time analysis
●...
Druid in production
Existing production cluster according druid.io whitepaper
● 3+ trillion events/month
● 3M+ events/sec ...
Case: GumGum
GumGum, a digital marketing platform reported about
3 billion events per day in real time => 5 TB of new data...
Case: GumGum
Production
Netflix
Netflix engineers use Druid to aggregate multiple data streams, ingesting up to two
terabytes per hour,...
Sample data
Wikipedia “edit” events
Druid cluster and the flow of data through the cluster
Components
● Realtime Node
● Historical Node
● Broker Node
● Coordinator Node
● Indexing Service
Real-time nodes
Ingest and query event streams. Events indexed via
these nodes are immediately available for querying.
Buf...
Historical nodes
Encapsulate the functionality to load and serve the immutable blocks of data (segments)
created by real-t...
Broker nodes
Query router to historical and real-time
nodes.
Merge partial results from historical and real-
time nodes
Un...
Coordinator nodes
In charge of data management and distribution on historical nodes.
Tell historical nodes to load new dat...
Indexing service
Consists of
● Overlord (manages tasks distribution to middle manager)
● Middle Manager (create peons for ...
Components overview
Ingestion
Streaming data (does not guarantee exactly once processing)
● Stream processor ( like Apache Samza or Apache Sto...
Data Formats
● JSON
● CSV
● A custom delimited form such as TSV
● Protobuf
Multi-value dimensions support
Storage format
● data tables called data sources are collections of timestamped events and
partitioned into a set of segme...
Replication
● replication and distribution are done at a segment level
● druid’s data distribution is segment-based and le...
QUERYING
Key differences
Limited support for joins (through query-
time lookups)
Replace dimension value with another
value.
No off...
Timeseries
{
"queryType": "timeseries",
"dataSource": "city",
"granularity": "day",
"dimensions": ["income"],
"aggregation...
TopN
TopNs are much faster and resource efficient than “GroupBy” for this use case.
{
"queryType": "topN",
"dataSource": "...
GroupBy
● Use “Timeseries” to do straight aggregates for some time range.
● Use “TopN for an ordered groupBy over a single...
Other
Time Boundary. Return the earliest and latest data points of a data set.
Segment Metadata. Return per segment inform...
Filters
● exact match (“=”)
● and
● not
● or
● reg_exp (Java reg_exp)
● JavaScript
● extraction (similar to “in”)
● search...
Aggregations
● count
● min/max
● JavaScript (All JavaScript functions must return numerical values)
● cardinality (by valu...
Post Aggregations
● arithmetic (applies the provided function to the given fields from left to right)
● filed accessor (re...
Party
Druid in a party: Spark
Druid is designed to power analytic applications and focuses on the latencies to
ingest data and s...
Druid in a party: SQL on hadoop
Druid was designed to
1. be an always on service
2. ingest data in real-time
3. handle sli...
Resources
● http://druid.io/
● Druid in nutshell http://static.druid.io/docs/druid.pdf
● Druid API https://github.com/drui...
THANK YOU
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
Upcoming SlideShare
Loading in …5
×

Aggregated queries with Druid on terrabytes and petabytes of data

8,601 views

Published on

Overview of the Druid.io, features, pros/cons, querying and etc.

Published in: Engineering
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ACCESS WEBSITE for All Ebooks ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • deciding factor for number of historical nodes?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Can you please share your configurations for different different nodes?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi Rostislav,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Aggregated queries with Druid on terrabytes and petabytes of data

  1. 1. Druid Rostislav Pashuto November, 2015
  2. 2. The pattern ● we have to scale, current storage no longer able to support our growth ○ horizontal scaling for data which is doubling, quadrupling, … ○ compression ○ cost effective please ● we want near real-time reports ○ sub-second queries ○ multi-tenancy ● we have to do a real-time ingestion ○ insights on events immediately after they occur ● we need something stable and maintained ○ highly available ○ open source solution with active community
  3. 3. Once upon a time “Over the last twelve months, we tried and failed to achieve scale and speed with relational databases (Greenplum, InfoBright, MySQL) and NoSQL offerings (HBase). So instead we did something crazy: we rolled our own database. Druid is the distributed, in-memory OLAP data store that resulted.” © by Eric Tschetter · April, 2011 Started in 2011, open sourced in 2012, under Apache 2.0 licence since 20 Feb, 2015. Druid is a fast column-oriented, distributed, not only in-memory data store designed for low latency ingestion, ad-hoc aggregations, keeping a history for years.
  4. 4. Druid Pros ● aggregate operations in sub-second for most use cases ● real-time streaming ingestion and batch Ingestion ● denormalized data ● horizontal scalability with linear performance ● active community Cons ● Lack of real joins ● Limited query power compared to SQL/MDX
  5. 5. Druid: checklist You need ● fast aggregations and exploratory analytics ● sub-second queries for near real-time analysis ● no SPoF data store ● to store a lot of events (trillions, petabytes of data) which you can define as a set of dimensions ● to process denormalized data, which is not completely unstructured data ● basic search is ok for you (regexp included)
  6. 6. Druid in production Existing production cluster according druid.io whitepaper ● 3+ trillion events/month ● 3M+ events/sec through Druid's real-time ingestion ● 100+ PB of raw data ● 50+ trillion events ● Thousands of queries per second for applications used by thousands of users ● Tens of thousands of cores
  7. 7. Case: GumGum GumGum, a digital marketing platform reported about 3 billion events per day in real time => 5 TB of new data per day with: ● Brokers – 2 m4.xlarge (Round-robin DNS) ● Coordinators – 2 c4.large ● Historical (Cold) – 2 m4.2xlarge (1 x 1000GB EBS SSD) ● Historical (Hot) – 4 m4.2xlarge (1 x 250GB EBS SSD) ● Middle Managers – 15 c4.4xlarge (1 x 300GB EBS SSD) ● Overlords – 2 c4.large ● Zookeeper – 3 c4.large ● MySQL – RDS – db.m3.medium More: http://goo.gl/tKKmw5
  8. 8. Case: GumGum
  9. 9. Production Netflix Netflix engineers use Druid to aggregate multiple data streams, ingesting up to two terabytes per hour, with the ability to query data as it's being ingested. They use Druid to pinpoint anomalies within their infrastructure, endpoint activity and content flow. Paypal The Druid production deployment at PayPal processes a very large volume of data and is used for internal exploratory analytics by business analytic teams Xiaomi Xiaomi uses Druid as an analytics tool to analyze online advertising data. More: http://druid.io/druid-powered.html
  10. 10. Sample data Wikipedia “edit” events
  11. 11. Druid cluster and the flow of data through the cluster
  12. 12. Components ● Realtime Node ● Historical Node ● Broker Node ● Coordinator Node ● Indexing Service
  13. 13. Real-time nodes Ingest and query event streams. Events indexed via these nodes are immediately available for querying. Buffer incoming events to an in-memory index, which is regularly persisted to disk. On a periodic basis, persisted indexes are then merged together before getting handed off. Queries will hit both the in-memory and persisted indexes. Later, during the handoff stage, a real-time node uploads segment to a permanent backup storage called “deep storage”, typically a distributed file system like S3 or HDFS. Real-time nodes leverage Zookeeper for coordination with the rest of the Druid cluster. One of the largest production Druid clusters to be able to consume raw data at approximately 500 MB/s (150,000 events/s or 2 TB/hour).
  14. 14. Historical nodes Encapsulate the functionality to load and serve the immutable blocks of data (segments) created by real-time nodes. Every node share nothing and only know how to load, drop, and serve immutable segments. Historical nodes announce their online state and the data they are serving in Zookeeper. Instructions to load and drop segments are sent over Zookeeper. Before a historical node downloads a particular segment from deep storage, it first checks a local cache that maintains information about what segments already exist on the node. Once processing is complete, the segment is announced in Zookeeper. At this point, the segment is queryable. Segments must be loaded in memory before they can be queried. Druid supports memory mapped files. On Zookeeper outage historical nodes are still able to respond to query requests for the data they are currently serving, but no longer able to serve new data or drop outdated data.
  15. 15. Broker nodes Query router to historical and real-time nodes. Merge partial results from historical and real- time nodes Understand what segments are queryable and where those segments are located. On ZK fail broker use last known view of the cluster
  16. 16. Coordinator nodes In charge of data management and distribution on historical nodes. Tell historical nodes to load new data, drop outdated data, replicate data, and move data to load balance. Undergo a leader-election process that determines a single node that runs the coordinator functionality. Act as redundant backups. On ZK downtime will no longer be able to send instructions
  17. 17. Indexing service Consists of ● Overlord (manages tasks distribution to middle manager) ● Middle Manager (create peons for running tasks) ● Peons (run a single task in a single JVM) Creates and destroy segments
  18. 18. Components overview
  19. 19. Ingestion Streaming data (does not guarantee exactly once processing) ● Stream processor ( like Apache Samza or Apache Storm ) ● Kafka support You can use Tranquility library to send event streams to Druid. Batch Data ● Hadoop-based indexing ● Index task Lambda Architecture Druid staff recommend running a streaming real-time pipeline to run queries over events as they are occurring and a batch pipeline to perform periodic cleanups of data.
  20. 20. Data Formats ● JSON ● CSV ● A custom delimited form such as TSV ● Protobuf Multi-value dimensions support
  21. 21. Storage format ● data tables called data sources are collections of timestamped events and partitioned into a set of segments ● each segment is typically 5–10 million rows ● storage format is highly optimized for linear scans
  22. 22. Replication ● replication and distribution are done at a segment level ● druid’s data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage.
  23. 23. QUERYING
  24. 24. Key differences Limited support for joins (through query- time lookups) Replace dimension value with another value. No official SQL support 3rd party drivers Immutable dimension data Re-index specific data segment Specific “>” and “<” support in query filters AND, OR, NOT, REG_EXP, Java Script (Rhino), IN (as lookups), partial match Workaround for “more”/”less”: { "type" : "javascript", "dimension" : "age", "function" : "function(x) { return(x >= '21' && x <= '35') }" }
  25. 25. Timeseries { "queryType": "timeseries", "dataSource": "city", "granularity": "day", "dimensions": ["income"], "aggregations": [{ "type": "count", "name": "total", "fieldName": "userId" }], "intervals": [ "2012-01-01T00:00:00.000/2016-01-04T00:00:00.000" ], "filter": { "type" : "and", "fields" : [{"type": "javascript", "dimension": "income", "function": "function(x) {return x > 100 }"} ] } } ----------------------------------------------------------------------------------------------------------------------------------------- [ {"timestamp": "2015-06-09T00:00:00.000Z", "result": {"total": 112}}, {"timestamp": "2015-06-10T00:00:00.000Z", "result": {"total": 117}} ]
  26. 26. TopN TopNs are much faster and resource efficient than “GroupBy” for this use case. { "queryType": "topN", "dataSource": "city", "granularity": "day", "dimension": "income", “threshold”: 3, “metric”: “count”, "aggregations": [...], ... } ----------------------------------------------------------------------------------------------------------------------------------------- [ { "timestamp": "2015-06-09T00:00:00.000Z", "result": [ { “who”: “Bob”, "count": 100}, { “who”: “Alice”, "count": 40}, { “who”: “Jane”, "count": 15}, ] } ]
  27. 27. GroupBy ● Use “Timeseries” to do straight aggregates for some time range. ● Use “TopN for an ordered groupBy over a single dimension. { "queryType": "groupBy", "dataSource": "twitterstream", "granularity": "all", "dimensions": ["lang", "utc_offset"], "aggregations":[ { "type": "count", "name": "rows"}, { "type": "doubleSum", "fieldName": "tweets", "name": "tweets"} ], "filter": { "type": "selector", "dimension": "lang", "value": "en" }, "intervals":["2012-10-01T00:00/2020-01-01T00"] } ----------------------------------------------------------------------------------------------------------------------------------------- [{ "version": "v1", "timestamp": "2012-10-01T00:00:00.000Z", "event": { "utc_offset": "-10800", "tweets": 90, "lang": "en", "rows": 81 } }...
  28. 28. Other Time Boundary. Return the earliest and latest data points of a data set. Segment Metadata. Return per segment information about: cardinality, byte size, type of columns, segment intervals and etc. Data Source Metadata. Return timestamp of last ingested event. Search. Returns dimension values that match the search specification.
  29. 29. Filters ● exact match (“=”) ● and ● not ● or ● reg_exp (Java reg_exp) ● JavaScript ● extraction (similar to “in”) ● search (capture partial search match)
  30. 30. Aggregations ● count ● min/max ● JavaScript (All JavaScript functions must return numerical values) ● cardinality (by value and by row) ● HyperUnique aggregator (uses HyperLogLog to compute the estimated cardinality of a dimension that has been aggregated as a "hyperUnique" metric at indexing time) ● filtered agregator (wraps any given aggregator, but only aggregates the values for which the given dimension filter matches)
  31. 31. Post Aggregations ● arithmetic (applies the provided function to the given fields from left to right) ● filed accessor (returns the value produced by the specified) ● constant value (always returns the specified value) ● JavaScript ● HyperUnique Cardinality (is used to wrap a hyperUnique object such that it can be used in post aggregations) ... "aggregations" : [ { "type" : "count", "name" : "rows" }, { "type" : "doubleSum", "name" : "tot", "fieldName" : "total" }], "postAggregations" : [{ "type" : "arithmetic", "name" : "average", "fn" : "*", "fields" : [ { "type" : "arithmetic", "name" : "div", "fn" : "/", "fields" : [ { "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" }, { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }] }, ...
  32. 32. Party
  33. 33. Druid in a party: Spark Druid is designed to power analytic applications and focuses on the latencies to ingest data and serve queries over that data. If you were to build an application where users could arbitrarily explore data, the latencies seen by using Spark will likely be too slow for an interactive experience.
  34. 34. Druid in a party: SQL on hadoop Druid was designed to 1. be an always on service 2. ingest data in real-time 3. handle slice-n-dice style ad-hoc queries SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems. Some of these engines (including Impala and Presto) can be collocated with HDFS data nodes and coordinate with them to achieve data locality for queries. What does this mean? We can talk about it in terms of three general areas 1. Queries 2. Data Ingestion 3. Query Flexibility
  35. 35. Resources ● http://druid.io/ ● Druid in nutshell http://static.druid.io/docs/druid.pdf ● Druid API https://github.com/druid-io/druid-api ● Analytic UI http://imply.io/ ● 3rd party SQL interface https://github.com/srikalyc/Sql4D
  36. 36. THANK YOU

×