Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the Capabilities of Druid

Powered by the Capabilities of Druid
September 2nd
, 2020
Jeremy Woelfel, Principal Engineer, Target
1

22
About Target
• US-based retailer specializing in General
Merchandise and Grocery sales
• 1800+ Stores across all 50 states
• Top 10 Ecommerce Site in US
• 300,000+ team members
• Business Partners & Suppliers across the globe

33
What’s the purpose of an Enterprise Analytics Platform?
• Make data accessible for broad/mass, high-performance consumption.
• Allows non-technical users to discover data and generate insights
• Not limited to a single business domain; leveraged by nearly all business units across
the organization
• Cache of “certified” or “gold” datasets
What is an Enterprise Analytics Platform?
Hadoop/Hive
Enterprise Analytics
Platform
Today
Data Warehouse
Reporting
Tool(s)
Past
SQL
Tools

44
There’s no shortage of available options
• Well-established Vendor Products
• Ever-growing Open Source Market
Why Build a Custom Enterprise Analytics Platform?
What makes a good Analytics Tool?
Speed (to Market)
• Enable self-service, fast development. “Idea-to-Insight” measured in minutes or hours, not days or weeks.
• Keep up with users as they asked “follow-up” business questions of the data.
• Rapidly share Insights.
Discovery & Collaboration
• Enable users to ingest data into the platform in an intuitive and easy manner.
• Allow users to find and explore ready-to-consume data
• Integrate with existing in-house data management products
• Provide insights where users need them. (i.e. embedded analytics)
Scalability
• Allow any Target team member or business partner to utilize insights
• Able to respond to “surges” in demand in a cost-effective manager

55
Architecture of an Enterprise Analytics Platform
REST API
React UI
Druid
Mongo
DBHadoop/Hive
Postgres
Postgres
Postgres
BYOD
(Bring Your Own Database)
Internal
(back office)
Applications
Mobile
Applications
External
Applications
Query Generation Engine
• Translate business logic into
complex expressions
• Enforce security
• Enforce system limitations
Web-based self-service tool
• Provides intuitive interface
to help users generate
insights
Business Datastore
• Stores
Platform Metadata
• Store details about
datasets, users, groups,
etc.

66
“Druid is just a timeseries database…”
(Nearly) All Data is (can be) timeseries-ish data
Why Druid?
Druid is a great choice to power an Enterprise Analytics Platform:
• It’s fast.
• It’s specializes in functions that are common (yet critical) for analytics: aggregating, filtering
and summarizing highly-granular data
• Horizontally scalable
• It has facilities to allow for complex aggregation operations to run
• Large and dedicated open-source community
• And…. it’s fast

77
• On-premises, bare-metal cluster running Druid 18
• Nodes:
o Hundreds of historical nodes
o Multiple brokers (load-balanced)
• 10,000+ vCores
• 100s of TBs of RAM
• Leverage HDFS for Deepstore
• mySQL for metadata storage
• Over 1 PB of available storage (RAM + SSD)
• 3,500+ datasources
• Over 3 Trillion Rows of data loaded
Druid Cluster Details
Tiny: < 1M Rows
Small: < 1B Rows
Medium: < 10B Rows
Big: > 10B Rows

88
However, Druid offers features that allow some join-like operations:
• unions: fact-to-fact joins
• lookups: fact-to-dimension joins
But Druid Can’t Do Joins!
Most data loaded to the platform is “Analytics-ready”
• Denormalized
• Pre-joined
• Clearly-defined metrics and dimensions
Newer versions of Druid to
support some JOIN capabilities

99
SELECT Sales_Date, Location, sum(Sales) as Sales, sum(Forecast) as Forecast
FROM sales_table
JOIN sales_forecast_table ON (
sales.Sales_date = sales_forecast_table.Sales_date AND
sales_table.Location = sales_forecast_table.Location AND
sales.Item = sales_forecast.Item
)
GROUP BY sales_table.Sales_Date, sales_forecast_table.Location
Unions
Date Location Item [Metrics]
Date Location Item [Metrics]Sales
Sales Forecast
Date Location Item [Metrics]Inventory
Date Location Item [Metrics]Orders
Date Location [Metrics]Weather
Date Location [Metrics]Demographics
Unions can be used to enable fact-to-fact joins
Keys to Successfully Using Unions:
• Data Architecture Discipline
• columns must match
• dimension data values must match
• Ensure query only includes the datasets required for the
query
Date Item [Metrics]Shipping

1010
“Analytics-ready” data offers great performance, but comes significant cost: restatement
Lookups
Example: Analysis by Product Category is almost always done using “today’s” definition of a product hierarchy
Billions of rows over
multiple years of
data
Custom Groupings can
change daily.
1000s of custom
groupings in use
Example: Analysis of custom product groupings results in many-to-many relationship

1111
Lookups (cont’d)
Denormalized Data wasn’t going to work.
Evaluated Druid’s built-in Lookup Capabilities
• Needed to support large dimension lookup tables
• Needed to support many-to-many data relationship.
SELECT Custom_Grouping, Sales
FROM sales JOIN custom_groupings ON (sales.product_id = custom_groupings.product_id)
WHERE Custom_Grouping IN (‘Back to College’, ‘Movie Night’)
GROUP BY Custom_Grouping
Step 1:
SELECT product, custom_grouping
FROM custom_groups
WHERE custom_grouping in (‘Back to College’,
‘Movie Night’)
Step 2:
SELECT “Back to College”, sum(Sales)
FROM sales
WHERE product in (‘M&Ms’, ‘Pillow’)
SELECT “Movie Night”, sum(Sales)
FROM sales
WHERE product in (‘M&Ms’)
Created Custom Lookup Solution by relying on 2 key aspects of Druid:
• Druid is fast – running multi-step queries will be acceptable
• Druid is good at filtering, even when list of items in IN clause is large

1212
Multiple Data Ingestion Options
• Initially used Hadoop Indexer to ingest all datasets
• Evolved to optionally leverage Native Indexer for smaller datasets
• Using Kafka Indexer to address growing demand for real-time ingestion
Cardinality Aggregators
• Provides ”introspection” capabilities for the query engine to make execution decisions
“Time” as a Central Concept
• Arbitrary granularity extension supports Target’s fiscal calendar
• Ability to provide array of intervals provides powerful period-over-period analytical capabilities.
Other Key Druid Features
"intervals": [
"2020-02-02T00:00:00+00:00/2020-08-30T00:00:00.00000+00:00"
],
"granularity": {
"intervals": [
"2020-02-02T00:00:00+00:00/2020-03-01T00:00:00+00:00",
"2020-03-01T00:00:00+00:00/2020-04-05T00:00:00+00:00",
"2020-04-05T00:00:00+00:00/2020-05-03T00:00:00+00:00",
"2020-05-03T00:00:00+00:00/2020-05-31T00:00:00+00:00",
"2020-05-31T00:00:00+00:00/2020-07-05T00:00:00+00:00",
"2020-07-05T00:00:00+00:00/2020-08-02T00:00:00+00:00",
"2020-08-02T00:00:00+00:00/2020-08-30T00:00:00+00:00"
],
"type": "arbitrary"
},

1313
Metrics
• Druid’s native metric emitter provides detailed insight into platform health
• Adopted similar metric collection scheme for other pars of the platform stack.
JSON Query Language
• JSON query language allows us to express complex expressions (filtered aggregations, etc)
QuantilesSketch & Thetasketch
• Easily support Median, p90, COUNT DISTINCT, etc.
Other Key Druid Features (cont’d)

1414
• Over 4M queries/day run on Druid
o Peaks of ~200 queries/second.
o Another 32M queries answered from custom query cache layer. Save valuable Druid compute for
“interesting, non-repetitive questions”
o Preparing for (at least) 2x increase in query volume in the next 2 quarters
• 70K daily active users (DAU) & 250K monthly active users (MAU)
o Usage from dozens of different business units within the organization
o Dozen+ external applications using platform’s REST API
• Over 3,500 datasources (and growing)
o 50B rows/day loaded (or reloaded)
• Average Druid Query Response Times
o groupBy: 600ms
o All Other Query Types: 300ms or less.
Platform Scalability

1515
• Performance Optimizations
o Evolve Data Engineering practices to take advantage of emerging Druid capabilities to improve
query performance of our largest datasets
o Develop more sophisticated workload management capabilities to allow varied query ”profiles” to
co-exist in a multi-tenant environment
• Enable more complex queries to be written
o Utilize Windows functions
o Provide analytical capabilities by other timeseries-like dimensions
o Explore Druid-native joins
• Continue to push culture change to embrace event-style data.
o Evangelize the power of the platform
Future Evolution of the Analytics Platform

@woel0007
16
Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the Capabilities of Druid

More Related Content

What's hot

Similar to Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the Capabilities of Druid

More from Imply

Recently uploaded

Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the Capabilities of Druid