Thank you all for Joining us today. I am Nishant, I recently joined Business Intelligence Team at Hortonworks after working for a couple of years at Metamarkets. I am also a Druid committer and Druid PMC Member.
I am here with slim, who is also a druid committer and recently joined Hortonworks after working for few years with yahoo Druid team. Today we are going to be talking about Druid, The title of this talk is Scalable Realtime Analytics using Druid.
Initial use case was to power ad tech analytics products at metamarkets First lines of code were written in 2011 Was open sourced in 2012 with GPL license but they switched to apache last year. Back in 2011 developers at MMX were looking at creating an interactive customer dashboard which can analyze ad tech events They needed a high scale flexible query system which supports low latency queries as it was powering an interactive customer dashboard
Suppose I am running an ad campaign, and I want to understand what kind of Impressions are there What is my click through rate How many users decided to purchase my services We have User Activity Stream and we may want to know How the users are behaving. We may have a stream of Firewall Events and we want to do detect any anomalies in those streams in realtime. Also, For very large distributed clusters there is a need to answer questions about application performance. How individual node in my cluster behaving ? How my Are there any Anomalies in query response time ? All the above use cases can have data streams which can be huge in volume depending on the scale of business. How do I analyze this information ? How do I get insights from these Stream of Events in realtime ?
Developers at metamarkets started with evaluating traditional RDBMS solutions initially They made a star schema in Postgresql with aggregate tables. What they found was that as the data volume increases, queries becomes slow. A page load from their product dashboard which has 15-20 concurrent queries was taking around 20sec. They added a query caching layer in front of postgresql to handle this, this helped with the queries that were cached, but adhoc queries were still slow. Since there product allows users to do arbitrary queries, the user experience was not really interactive with caching. so they dumped postgresql and started looking into other solutions.
The next solution they tried was Key/Value Stores, which were pretty fast. They pre aggregated all dimensional combinations and stored it in Hbase. The query speed was drastically improved and there was a significant improvement in user experience. But as the data volume was growing, the precomputation time was also exponentially increasing. e.g for a precomputation job with 14 dimensions, it was taking around 9 hours to precompute the dimensional combinations. Also, this approach was not cost effective at scale.
After looking at these solutions, they started working on an in house product named Druid to solve their problem.
Let me talk a bit about Druid. Druid is a column oriented distributed datastore It is quite fast with < 1sec avg query times. It is able to ingest data from realtime streams and makes it queryable as soon as the event gets ingested. It can do arbitrary slicing and dicing of data. It can do automatic summarisation of data at the time of ingestion It supports approximate algorithms like hyperloglog, theta sketches. It is scalable upto petabytes of data and highly available.
This shows some of the production users. I can talk about some of the large ones which have common use cases. Alibaba and Ebay use druid for ecommerce and user behavior analytics
Cisco has a realtime analytics product for analyzing network flows
Yahoo uses druid for user behavior analytics and realtime cluster monitoring which is an interesting use case slim will be discussing in more detail later
Hulu does interactive analysis on user and application behavior
Paypal, SK telecom – uses druid for business analytics
Druid is able to handle upto petabytes of data and billions of events. Just to give you a sense of the scale druid can work on, the larges druid cluster (which is at metamarkets) Has around…........... These numbers are taken from Druid whitepaper, which you can have a look into If you need more details.
For any Interactive User application, fast response time is critical. These are again some numbers from Druid whitepaper on the query times of druid with the production cluster at metamarkets. With 1000s of concurrent queries …......
Druid is also able to do arbitrary slicing of dicing of data. What that means is you can filter and groupby on any combinations of dimensions and aggregate any combination of metrics. Any ad hoc query involving any combination of dimensions and metrics can be made without specifying the combinations at the ingestion time.
Next Key Feature is Realtime Queries or Data Freshness. Which is very important to many applications where we need to take decisions in realtime. e.g. While analysing Firewall events or network flows, it is important to detect any anomalies as soon as possible and take actions based on that. If we take actions on the anomalies after few hours, It may not be worth. Another use case for which druid is used for is realtime application monitoring, here also we need to take corrective measures as soon as possible.
Maintain a In-Memory row oriented key-value store Data stored inside the heap within a map Indexed by time and dimension values. Persist data to disk based on threshold or lapse time
Will be talking about druid in practice what you can solve with It and you can run it monitor it.
In fact this is a problem because the old way of doing thing was heavily based on Map jobs reduce jobs that computes the aggregates for every possible set of dimension and load it to a complex layers of data bases, then repeat this Hour/Day/Week or Year.. At this scale aggregating over a week can take up to 3 days, for month is in the order of days. The take of is those tools like hadoop, spark tez, they are great but they are not the best best choice for this kind of problems don’t solve all problem.
Doing analytics on use engagment wont be a problem if we were not tracking more than 2 millions devices of if we don’t need to ingest 20 billion new events every day or if Flurry SDK user stop asking question that involes petabyte of data. The data growth is out pacing the standard computing systems.
These metrics tell us about how fast the data is being ingested, how fast queries are running, and the general health of the components of the stack. An example event that is emitted by our production Druid cluster when a user issues a query looks something like this. This event includes a “timestamp” key to indicate when the event occurred. The event also contains numerous tags (or dimensions if you are familiar with OLAP terminology) that describe attributes about the query being issued. These tags can include information such as the query type, interval, aggregators, id, and many other useful attributes. The query completion time is stored in the “value” metric. Different metric events will have different tags and different services will emit different metric events.
Metric collector is a piece of software open sourced by yahoo jetty dump to kafka style don’t block throw events if necessary. Talk about how this scale add more metric collector or Real time nodes
Scalable Real-time analytics using Druid
Analytics using Druid
Nishant Bangarwa and Slim Bouguerra