Powered by the Capabilities of Druid
September 2nd
, 2020
Jeremy Woelfel, Principal Engineer, Target
1
22
About Target
• US-based retailer specializing in General
Merchandise and Grocery sales
• 1800+ Stores across all 50 states
• Top 10 Ecommerce Site in US
• 300,000+ team members
• Business Partners & Suppliers across the globe
33
What’s the purpose of an Enterprise Analytics Platform?
• Make data accessible for broad/mass, high-performance consumption.
• Allows non-technical users to discover data and generate insights
• Not limited to a single business domain; leveraged by nearly all business units across
the organization
• Cache of “certified” or “gold” datasets
What is an Enterprise Analytics Platform?
Hadoop/Hive
Enterprise Analytics
Platform
Today
Data Warehouse
Reporting
Tool(s)
Past
SQL
Tools
44
There’s no shortage of available options
• Well-established Vendor Products
• Ever-growing Open Source Market
Why Build a Custom Enterprise Analytics Platform?
What makes a good Analytics Tool?
Speed (to Market)
• Enable self-service, fast development. “Idea-to-Insight” measured in minutes or hours, not days or weeks.
• Keep up with users as they asked “follow-up” business questions of the data.
• Rapidly share Insights.
Discovery & Collaboration
• Enable users to ingest data into the platform in an intuitive and easy manner.
• Allow users to find and explore ready-to-consume data
• Integrate with existing in-house data management products
• Provide insights where users need them. (i.e. embedded analytics)
Scalability
• Allow any Target team member or business partner to utilize insights
• Able to respond to “surges” in demand in a cost-effective manager
55
Architecture of an Enterprise Analytics Platform
REST API
React UI
Druid
Mongo
DBHadoop/Hive
Postgres
Postgres
Postgres
BYOD
(Bring Your Own Database)
Internal
(back office)
Applications
Mobile
Applications
External
Applications
Query Generation Engine
• Translate business logic into
complex expressions
• Enforce security
• Enforce system limitations
Web-based self-service tool
• Provides intuitive interface
to help users generate
insights
Business Datastore
• Stores
Platform Metadata
• Store details about
datasets, users, groups,
etc.
66
“Druid is just a timeseries database…”
(Nearly) All Data is (can be) timeseries-ish data
Why Druid?
Druid is a great choice to power an Enterprise Analytics Platform:
• It’s fast.
• It’s specializes in functions that are common (yet critical) for analytics: aggregating, filtering
and summarizing highly-granular data
• Horizontally scalable
• It has facilities to allow for complex aggregation operations to run
• Large and dedicated open-source community
• And…. it’s fast
77
• On-premises, bare-metal cluster running Druid 18
• Nodes:
o Hundreds of historical nodes
o Multiple brokers (load-balanced)
• 10,000+ vCores
• 100s of TBs of RAM
• Leverage HDFS for Deepstore
• mySQL for metadata storage
• Over 1 PB of available storage (RAM + SSD)
• 3,500+ datasources
• Over 3 Trillion Rows of data loaded
Druid Cluster Details
Tiny: < 1M Rows
Small: < 1B Rows
Medium: < 10B Rows
Big: > 10B Rows
88
However, Druid offers features that allow some join-like operations:
• unions: fact-to-fact joins
• lookups: fact-to-dimension joins
But Druid Can’t Do Joins!
Most data loaded to the platform is “Analytics-ready”
• Denormalized
• Pre-joined
• Clearly-defined metrics and dimensions
Newer versions of Druid to
support some JOIN capabilities
99
SELECT Sales_Date, Location, sum(Sales) as Sales, sum(Forecast) as Forecast
FROM sales_table
JOIN sales_forecast_table ON (
sales.Sales_date = sales_forecast_table.Sales_date AND
sales_table.Location = sales_forecast_table.Location AND
sales.Item = sales_forecast.Item
)
GROUP BY sales_table.Sales_Date, sales_forecast_table.Location
Unions
Date Location Item [Metrics]
Date Location Item [Metrics]Sales
Sales Forecast
Date Location Item [Metrics]Inventory
Date Location Item [Metrics]Orders
Date Location [Metrics]Weather
Date Location [Metrics]Demographics
Unions can be used to enable fact-to-fact joins
Keys to Successfully Using Unions:
• Data Architecture Discipline
• columns must match
• dimension data values must match
• Ensure query only includes the datasets required for the
query
Date Item [Metrics]Shipping
1010
“Analytics-ready” data offers great performance, but comes significant cost: restatement
Lookups
Example: Analysis by Product Category is almost always done using “today’s” definition of a product hierarchy
Billions of rows over
multiple years of
data
Custom Groupings can
change daily.
1000s of custom
groupings in use
Example: Analysis of custom product groupings results in many-to-many relationship
1111
Lookups (cont’d)
Denormalized Data wasn’t going to work.
Evaluated Druid’s built-in Lookup Capabilities
• Needed to support large dimension lookup tables
• Needed to support many-to-many data relationship.
SELECT Custom_Grouping, Sales
FROM sales JOIN custom_groupings ON (sales.product_id = custom_groupings.product_id)
WHERE Custom_Grouping IN (‘Back to College’, ‘Movie Night’)
GROUP BY Custom_Grouping
Step 1:
SELECT product, custom_grouping
FROM custom_groups
WHERE custom_grouping in (‘Back to College’,
‘Movie Night’)
Step 2:
SELECT “Back to College”, sum(Sales)
FROM sales
WHERE product in (‘M&Ms’, ‘Pillow’)
SELECT “Movie Night”, sum(Sales)
FROM sales
WHERE product in (‘M&Ms’)
Created Custom Lookup Solution by relying on 2 key aspects of Druid:
• Druid is fast – running multi-step queries will be acceptable
• Druid is good at filtering, even when list of items in IN clause is large
1212
Multiple Data Ingestion Options
• Initially used Hadoop Indexer to ingest all datasets
• Evolved to optionally leverage Native Indexer for smaller datasets
• Using Kafka Indexer to address growing demand for real-time ingestion
Cardinality Aggregators
• Provides ”introspection” capabilities for the query engine to make execution decisions
“Time” as a Central Concept
• Arbitrary granularity extension supports Target’s fiscal calendar
• Ability to provide array of intervals provides powerful period-over-period analytical capabilities.
Other Key Druid Features
"intervals": [
"2020-02-02T00:00:00+00:00/2020-08-30T00:00:00.00000+00:00"
],
"granularity": {
"intervals": [
"2020-02-02T00:00:00+00:00/2020-03-01T00:00:00+00:00",
"2020-03-01T00:00:00+00:00/2020-04-05T00:00:00+00:00",
"2020-04-05T00:00:00+00:00/2020-05-03T00:00:00+00:00",
"2020-05-03T00:00:00+00:00/2020-05-31T00:00:00+00:00",
"2020-05-31T00:00:00+00:00/2020-07-05T00:00:00+00:00",
"2020-07-05T00:00:00+00:00/2020-08-02T00:00:00+00:00",
"2020-08-02T00:00:00+00:00/2020-08-30T00:00:00+00:00"
],
"type": "arbitrary"
},
1313
Metrics
• Druid’s native metric emitter provides detailed insight into platform health
• Adopted similar metric collection scheme for other pars of the platform stack.
JSON Query Language
• JSON query language allows us to express complex expressions (filtered aggregations, etc)
QuantilesSketch & Thetasketch
• Easily support Median, p90, COUNT DISTINCT, etc.
Other Key Druid Features (cont’d)
1414
• Over 4M queries/day run on Druid
o Peaks of ~200 queries/second.
o Another 32M queries answered from custom query cache layer. Save valuable Druid compute for
“interesting, non-repetitive questions”
o Preparing for (at least) 2x increase in query volume in the next 2 quarters
• 70K daily active users (DAU) & 250K monthly active users (MAU)
o Usage from dozens of different business units within the organization
o Dozen+ external applications using platform’s REST API
• Over 3,500 datasources (and growing)
o 50B rows/day loaded (or reloaded)
• Average Druid Query Response Times
o groupBy: 600ms
o All Other Query Types: 300ms or less.
Platform Scalability
1515
• Performance Optimizations
o Evolve Data Engineering practices to take advantage of emerging Druid capabilities to improve
query performance of our largest datasets
o Develop more sophisticated workload management capabilities to allow varied query ”profiles” to
co-exist in a multi-tenant environment
• Enable more complex queries to be written
o Utilize Windows functions
o Provide analytical capabilities by other timeseries-like dimensions
o Explore Druid-native joins
• Continue to push culture change to embrace event-style data.
o Evangelize the power of the platform
Future Evolution of the Analytics Platform
@woel0007
16
Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
druidsummit.org
17

Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the Capabilities of Druid

  • 1.
    Powered by theCapabilities of Druid September 2nd , 2020 Jeremy Woelfel, Principal Engineer, Target 1
  • 2.
    22 About Target • US-basedretailer specializing in General Merchandise and Grocery sales • 1800+ Stores across all 50 states • Top 10 Ecommerce Site in US • 300,000+ team members • Business Partners & Suppliers across the globe
  • 3.
    33 What’s the purposeof an Enterprise Analytics Platform? • Make data accessible for broad/mass, high-performance consumption. • Allows non-technical users to discover data and generate insights • Not limited to a single business domain; leveraged by nearly all business units across the organization • Cache of “certified” or “gold” datasets What is an Enterprise Analytics Platform? Hadoop/Hive Enterprise Analytics Platform Today Data Warehouse Reporting Tool(s) Past SQL Tools
  • 4.
    44 There’s no shortageof available options • Well-established Vendor Products • Ever-growing Open Source Market Why Build a Custom Enterprise Analytics Platform? What makes a good Analytics Tool? Speed (to Market) • Enable self-service, fast development. “Idea-to-Insight” measured in minutes or hours, not days or weeks. • Keep up with users as they asked “follow-up” business questions of the data. • Rapidly share Insights. Discovery & Collaboration • Enable users to ingest data into the platform in an intuitive and easy manner. • Allow users to find and explore ready-to-consume data • Integrate with existing in-house data management products • Provide insights where users need them. (i.e. embedded analytics) Scalability • Allow any Target team member or business partner to utilize insights • Able to respond to “surges” in demand in a cost-effective manager
  • 5.
    55 Architecture of anEnterprise Analytics Platform REST API React UI Druid Mongo DBHadoop/Hive Postgres Postgres Postgres BYOD (Bring Your Own Database) Internal (back office) Applications Mobile Applications External Applications Query Generation Engine • Translate business logic into complex expressions • Enforce security • Enforce system limitations Web-based self-service tool • Provides intuitive interface to help users generate insights Business Datastore • Stores Platform Metadata • Store details about datasets, users, groups, etc.
  • 6.
    66 “Druid is justa timeseries database…” (Nearly) All Data is (can be) timeseries-ish data Why Druid? Druid is a great choice to power an Enterprise Analytics Platform: • It’s fast. • It’s specializes in functions that are common (yet critical) for analytics: aggregating, filtering and summarizing highly-granular data • Horizontally scalable • It has facilities to allow for complex aggregation operations to run • Large and dedicated open-source community • And…. it’s fast
  • 7.
    77 • On-premises, bare-metalcluster running Druid 18 • Nodes: o Hundreds of historical nodes o Multiple brokers (load-balanced) • 10,000+ vCores • 100s of TBs of RAM • Leverage HDFS for Deepstore • mySQL for metadata storage • Over 1 PB of available storage (RAM + SSD) • 3,500+ datasources • Over 3 Trillion Rows of data loaded Druid Cluster Details Tiny: < 1M Rows Small: < 1B Rows Medium: < 10B Rows Big: > 10B Rows
  • 8.
    88 However, Druid offersfeatures that allow some join-like operations: • unions: fact-to-fact joins • lookups: fact-to-dimension joins But Druid Can’t Do Joins! Most data loaded to the platform is “Analytics-ready” • Denormalized • Pre-joined • Clearly-defined metrics and dimensions Newer versions of Druid to support some JOIN capabilities
  • 9.
    99 SELECT Sales_Date, Location,sum(Sales) as Sales, sum(Forecast) as Forecast FROM sales_table JOIN sales_forecast_table ON ( sales.Sales_date = sales_forecast_table.Sales_date AND sales_table.Location = sales_forecast_table.Location AND sales.Item = sales_forecast.Item ) GROUP BY sales_table.Sales_Date, sales_forecast_table.Location Unions Date Location Item [Metrics] Date Location Item [Metrics]Sales Sales Forecast Date Location Item [Metrics]Inventory Date Location Item [Metrics]Orders Date Location [Metrics]Weather Date Location [Metrics]Demographics Unions can be used to enable fact-to-fact joins Keys to Successfully Using Unions: • Data Architecture Discipline • columns must match • dimension data values must match • Ensure query only includes the datasets required for the query Date Item [Metrics]Shipping
  • 10.
    1010 “Analytics-ready” data offersgreat performance, but comes significant cost: restatement Lookups Example: Analysis by Product Category is almost always done using “today’s” definition of a product hierarchy Billions of rows over multiple years of data Custom Groupings can change daily. 1000s of custom groupings in use Example: Analysis of custom product groupings results in many-to-many relationship
  • 11.
    1111 Lookups (cont’d) Denormalized Datawasn’t going to work. Evaluated Druid’s built-in Lookup Capabilities • Needed to support large dimension lookup tables • Needed to support many-to-many data relationship. SELECT Custom_Grouping, Sales FROM sales JOIN custom_groupings ON (sales.product_id = custom_groupings.product_id) WHERE Custom_Grouping IN (‘Back to College’, ‘Movie Night’) GROUP BY Custom_Grouping Step 1: SELECT product, custom_grouping FROM custom_groups WHERE custom_grouping in (‘Back to College’, ‘Movie Night’) Step 2: SELECT “Back to College”, sum(Sales) FROM sales WHERE product in (‘M&Ms’, ‘Pillow’) SELECT “Movie Night”, sum(Sales) FROM sales WHERE product in (‘M&Ms’) Created Custom Lookup Solution by relying on 2 key aspects of Druid: • Druid is fast – running multi-step queries will be acceptable • Druid is good at filtering, even when list of items in IN clause is large
  • 12.
    1212 Multiple Data IngestionOptions • Initially used Hadoop Indexer to ingest all datasets • Evolved to optionally leverage Native Indexer for smaller datasets • Using Kafka Indexer to address growing demand for real-time ingestion Cardinality Aggregators • Provides ”introspection” capabilities for the query engine to make execution decisions “Time” as a Central Concept • Arbitrary granularity extension supports Target’s fiscal calendar • Ability to provide array of intervals provides powerful period-over-period analytical capabilities. Other Key Druid Features "intervals": [ "2020-02-02T00:00:00+00:00/2020-08-30T00:00:00.00000+00:00" ], "granularity": { "intervals": [ "2020-02-02T00:00:00+00:00/2020-03-01T00:00:00+00:00", "2020-03-01T00:00:00+00:00/2020-04-05T00:00:00+00:00", "2020-04-05T00:00:00+00:00/2020-05-03T00:00:00+00:00", "2020-05-03T00:00:00+00:00/2020-05-31T00:00:00+00:00", "2020-05-31T00:00:00+00:00/2020-07-05T00:00:00+00:00", "2020-07-05T00:00:00+00:00/2020-08-02T00:00:00+00:00", "2020-08-02T00:00:00+00:00/2020-08-30T00:00:00+00:00" ], "type": "arbitrary" },
  • 13.
    1313 Metrics • Druid’s nativemetric emitter provides detailed insight into platform health • Adopted similar metric collection scheme for other pars of the platform stack. JSON Query Language • JSON query language allows us to express complex expressions (filtered aggregations, etc) QuantilesSketch & Thetasketch • Easily support Median, p90, COUNT DISTINCT, etc. Other Key Druid Features (cont’d)
  • 14.
    1414 • Over 4Mqueries/day run on Druid o Peaks of ~200 queries/second. o Another 32M queries answered from custom query cache layer. Save valuable Druid compute for “interesting, non-repetitive questions” o Preparing for (at least) 2x increase in query volume in the next 2 quarters • 70K daily active users (DAU) & 250K monthly active users (MAU) o Usage from dozens of different business units within the organization o Dozen+ external applications using platform’s REST API • Over 3,500 datasources (and growing) o 50B rows/day loaded (or reloaded) • Average Druid Query Response Times o groupBy: 600ms o All Other Query Types: 300ms or less. Platform Scalability
  • 15.
    1515 • Performance Optimizations oEvolve Data Engineering practices to take advantage of emerging Druid capabilities to improve query performance of our largest datasets o Develop more sophisticated workload management capabilities to allow varied query ”profiles” to co-exist in a multi-tenant environment • Enable more complex queries to be written o Utilize Windows functions o Provide analytical capabilities by other timeseries-like dimensions o Explore Druid-native joins • Continue to push culture change to embrace event-style data. o Evangelize the power of the platform Future Evolution of the Analytics Platform
  • 16.
    @woel0007 16 Apache Druid isan independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org. Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
  • 17.