SlideShare a Scribd company logo
1 of 31
Download to read offline
1© Cloudera, Inc. All rights reserved.
Simplifying Analytic Workloads via Complex Schemas
Marcel Kornacker
Data Modeling for Data Science
2© Cloudera, Inc. All rights reserved.
• Relational databases are optimized
to work with flat schemas
• Much of the useful information in
the world is stored using complex
schemas (a.k.a., the “document
model”)
• Log files
• NoSQL data stores
• REST APIs
On Complex Schemas
3© Cloudera, Inc. All rights reserved.
Handling Complex Schemas in SQL
4© Cloudera, Inc. All rights reserved.
Complex Schemas and Data Science
5© Cloudera, Inc. All rights reserved.
Think Like A Data Scientist
6© Cloudera, Inc. All rights reserved.
Building Data Science Teams: A Moneyball Approach
7© Cloudera, Inc. All rights reserved.
Leveraging The Power of SQL Engines…
8© Cloudera, Inc. All rights reserved.
…While Managing the Cost of SQL Engines
9© Cloudera, Inc. All rights reserved.
Intentional Complex Schemas
10© Cloudera, Inc. All rights reserved.
Better Resource Utilization
11© Cloudera, Inc. All rights reserved.
Higher Data Quality
12© Cloudera, Inc. All rights reserved.
Simplifying Analytic Queries
13© Cloudera, Inc. All rights reserved.
Data Science For Everyone
14© Cloudera, Inc. All rights reserved.
Nested Types in Impala
Support for Nested Data Types: STRUCT, MAP, ARRAY
Natural Extensions to SQL
Full Expressiveness of SQL with Nested Types
15© Cloudera, Inc. All rights reserved.
Example TPCH-like Schema
CREATE TABLE Customers {
id BIGINT,
address STRUCT<
city: STRING,
zip: INT
>
orders ARRAY<STRUCT<
date: TIMESTAMP,
price: DECIMAL(12,2),
cc: BIGINT,
items: ARRAY<STRUCT<
item_no: STRING,
price: DECIMAL(9,2)
>>
>>
}
16© Cloudera, Inc. All rights reserved.
Impala Syntax Extensions
• Path expressions extend column references to scalars (nested structs)
• Can appear anywhere a conventional column reference is used
• Collections (maps and arrays) are exposed like sub-tables
• Use FROM clause to specify which collections to read like conventional tables
• Can use JOIN conditions to express join relationship (default is INNER JOIN)
Find the ids and city of customers who live in the zip code 94305:
SELECT id, address.city FROM customers WHERE address.zip = 94305
Find all orders that were paid for with a customer’s preferred credit card:
SELECT o.txn_id FROM customers c, c.orders o WHERE o.cc = c.preferred_cc
17© Cloudera, Inc. All rights reserved.
Referencing Arrays & Maps
• Basic idea: Flatten nested collections referenced in the FROM clause
• Can be thought of as an implicit join on the parent/child relationship
SELECT c.id, o.date FROM customers c, c.c_orders o
c.id o.date
100 2012-05-03
100 2013-07-08
100 2011-01-29
… …
101 2014-02-04
101 2015-11-15
… …
id of a customer
repeated for every order
18© Cloudera, Inc. All rights reserved.
Referencing Arrays & Maps
SELECT c.c_custkey, o.o_orderkey
FROM customers c, c.c_orders o
Returns customer/order data for customers that have at least one order
SELECT c.c_custkey, o.o_orderkey
FROM customers c LEFT OUTER JOIN c.c_orders o
Also returns customers with no orders (with order fields NULL)
SELECT c.c_custkey, o.o_orderkey
FROM customers c LEFT ANTI JOIN c.orders o
Find all customers that have no orders
SELECT c.c_custkey, o.o_orderkey
FROM customers c INNER JOIN c.c_orders o
19© Cloudera, Inc. All rights reserved.
Motivation for Advanced Querying Capabilities
• Count the number of orders per customer
• Count the number of items per order
• Impractical Requires unique id at every nesting level
• Information is already expressed in nesting relationship!
• What about even more interesting queries?
• Get the number of orders and the average item price per customer?
• “Group by” multiple nesting levels at the same time
SELECT COUNT(*) FROM customers c, c.orders o GROUP BY c.id
SELECT COUNT(*) FROM customers.orders o, o.items GROUP BY ???
Must be unique
20© Cloudera, Inc. All rights reserved.
Advanced Querying: Aggregates over Nested Collections
• Count the number of orders per customer
• Count the number of items per order
• Get the number of orders and the average item price per customer
SELECT id, COUNT(orders) FROM customers
SELECT date, COUNT(items) FROM customers.orders
SELECT id, COUNT(orders), AVG(orders.items.price) FROM customers
21© Cloudera, Inc. All rights reserved.
Advanced Querying: Relative Table References
• Full expressibility of SQL with nested collections
• Arbitrary SQL allowed in inline views and subqueries with relative table refs
• Exploits nesting relationship
• No need for stored unique ids at all interesting levels
SELECT id FROM customers c
WHERE (SELECT AVG(price) FROM c.orders.items) > 1000
SELECT id FROM customers c JOIN
(SELECT price FROM c.orders ORDER BY date DESC LIMIT 2) v
22© Cloudera, Inc. All rights reserved.
Impala Nested Types in Action
Josh Wills’ Blog post on analyzing misspelled queries
http://blog.cloudera.com/blog/2014/08/how-to-count-events-like-a-data-scientist/
• Goal: Rudimentary spell checker based on counting query/click events
• Problem: Cross referencing items in multiple nested collections
• Representative of many machine learning tasks (Josh tells me)
• Goal was not naturally achievable with Hive
• Josh implemented a custom Hive extension “WITHIN”
• How can Impala serve this use case?
23© Cloudera, Inc. All rights reserved.
Impala Nested Types in Action
account_id: bigint
search_events:
array<struct<
event_id: bigint
query: string
tstamp_sec: bigint
...
>>
install_events:
array<struct<
event_id: bigint
search_event_id: bigint
app_id: bigint
...
>>
SELECT bad.query, good.query, count(*) as cnt
FROM sessions s,
s.search_events bad,
s.search_events good,
WHERE bad.tstamp_sec < good.tstamp_sec
AND good.tstamp_sec - bad.tstamp_sec < 30
AND bad.event_id NOT IN (select search_event_id FROM s.install_events)
AND good.event_id IN (select search_event_id FROM s.install_events),
GROUP BY bad.query, good.query
Schema
24© Cloudera, Inc. All rights reserved.
Design Considerations for Complex Schemas
• Turn parent-child hierarchies into nested collections
• logical hierarchy becomes physical hierarchy
• nested structure = join index
physical clustering of child with parent
• For distributed, big data systems, this matter:
distributed join turns into local join
25© Cloudera, Inc. All rights reserved.
Flat TPCH vs. Nested TPCH
Part PartSupp
Supplier Lineitem
Customer Orders
Nation Region
Part
PartSupp
Suppliers
Lineitems
Customer
Orders
Nation
Region
1 N
1
N
1
N
1
N
N
1 N
1
N
1
N 1
N
1
1
N
N
1
26© Cloudera, Inc. All rights reserved.
Columnar Storage for Complex Schemas
• Columnar storage: a necessity for processing nested data at speed
• complex schemas = really wide tables
• row-wise storage: read lots of data you don’t care about
• Columnar formats effectively store a join index
{id: 17
orders: [{date: 2014-01-29, price: 10.2}, {date: ….} …]
}
{id: 18
orders: [{date: 2015-04-10, price: 2000.0}, {date: ….} …]
}
17
18
10.2
22.1
100
200
id orders.price
…
…
…
27© Cloudera, Inc. All rights reserved.
Columnar Storage for Complex Schemas
• A “join” between parent and child is essentially free:
• coordinated scan of parent and child columns
• data effectively pre-sorted on parent PK
• merge join beats hash join!
• Aggregating child data is cheap:
• the data is already pre-grouped by the parent
• Amenable to vectorized execution
• Local non-grouping aggregation <<< distributed grouping aggregation
28© Cloudera, Inc. All rights reserved.
Analytics on Nested Data: Cheap & Easy!
SELECT c.id, AVG(orders.items.price) avg_price
FROM customer
ORDER BY avg_price DESC LIMIT 10
Query: Find the 10 customers that buy the most expensive items on average
SELECT c.id, AVG(i.price) avg_price
FROM customer c, order o, item i
WHERE c.id = o.cid and o.id = i.oid
GROUP BY c.id
ORDER BY avg_price DESC LIMIT 10
Flat Nested
29© Cloudera, Inc. All rights reserved.
Scan Scan
Xch Xch
Join
Scan
Xch
Join
Agg
Xch
Agg
Scan
Unnest
Agg
Top-n
Xch
Top-n
Top-n
Xch
Top-n
Analytics on Nested Data: Cheap & Easy!
$$$
30© Cloudera, Inc. All rights reserved.
A Note to BI Tool Vendors
Please support nested types, it’s not that hard!
• you already take advantage of declared PK/FK relationships
• a nested collection isn’t really that different
• except you don’t need a join predicate
31© Cloudera, Inc. All rights reserved.
Thank you

More Related Content

What's hot

Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano
 
Altis AWS Snowflake Practice
Altis AWS Snowflake PracticeAltis AWS Snowflake Practice
Altis AWS Snowflake PracticeSamanthaSwain7
 
SLC Snowflake User Group - Mar 12, 2020
SLC Snowflake User Group - Mar 12, 2020SLC Snowflake User Group - Mar 12, 2020
SLC Snowflake User Group - Mar 12, 2020Nathan Skousen
 
Delivering rapid-fire Analytics with Snowflake and Tableau
Delivering rapid-fire Analytics with Snowflake and TableauDelivering rapid-fire Analytics with Snowflake and Tableau
Delivering rapid-fire Analytics with Snowflake and TableauHarald Erb
 
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSAgile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSKent Graziano
 
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No LimitsAWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No LimitsAWS Summits
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Certus Solutions
 
Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020Harald Erb
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeAtScale
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
 
A 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeA 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeSnowflake Computing
 
Optimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptxOptimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptxIDERA Software
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Databricks
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Get Savvy with Snowflake
Get Savvy with SnowflakeGet Savvy with Snowflake
Get Savvy with SnowflakeMatillion
 
Star schema my sql
Star schema   my sqlStar schema   my sql
Star schema my sqldeathsubte
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 

What's hot (17)

Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Altis AWS Snowflake Practice
Altis AWS Snowflake PracticeAltis AWS Snowflake Practice
Altis AWS Snowflake Practice
 
SLC Snowflake User Group - Mar 12, 2020
SLC Snowflake User Group - Mar 12, 2020SLC Snowflake User Group - Mar 12, 2020
SLC Snowflake User Group - Mar 12, 2020
 
Delivering rapid-fire Analytics with Snowflake and Tableau
Delivering rapid-fire Analytics with Snowflake and TableauDelivering rapid-fire Analytics with Snowflake and Tableau
Delivering rapid-fire Analytics with Snowflake and Tableau
 
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSAgile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
 
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No LimitsAWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
 
Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
A 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeA 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with Snowflake
 
Optimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptxOptimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptx
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Get Savvy with Snowflake
Get Savvy with SnowflakeGet Savvy with Snowflake
Get Savvy with Snowflake
 
Star schema my sql
Star schema   my sqlStar schema   my sql
Star schema my sql
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 

Viewers also liked

Sumo Logic quickStart Webinar June 2016
Sumo Logic quickStart Webinar June 2016Sumo Logic quickStart Webinar June 2016
Sumo Logic quickStart Webinar June 2016Sumo Logic
 
Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016Sumo Logic
 
Sumo Logic Webinar: Visibility into your Host Metrics
Sumo Logic Webinar: Visibility into your Host MetricsSumo Logic Webinar: Visibility into your Host Metrics
Sumo Logic Webinar: Visibility into your Host MetricsSumo Logic
 
Sumo Logic: Optimizing Scheduled Searches
Sumo Logic: Optimizing Scheduled SearchesSumo Logic: Optimizing Scheduled Searches
Sumo Logic: Optimizing Scheduled SearchesSumo Logic
 
Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016Sumo Logic
 
How Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and Metrics
How Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and MetricsHow Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and Metrics
How Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and MetricsSumo Logic
 
Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017Sumo Logic
 
Sumo Logic - Optimizing Your Search Experience (2016-08-17)
Sumo Logic - Optimizing Your Search Experience (2016-08-17)Sumo Logic - Optimizing Your Search Experience (2016-08-17)
Sumo Logic - Optimizing Your Search Experience (2016-08-17)Sumo Logic
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic
 
"How to" Webinar: Sending Data to Sumo Logic
"How to" Webinar: Sending Data to Sumo Logic"How to" Webinar: Sending Data to Sumo Logic
"How to" Webinar: Sending Data to Sumo LogicSumo Logic
 
Sumo Logic QuickStart Webinar - Dec 2016
Sumo Logic QuickStart Webinar - Dec 2016Sumo Logic QuickStart Webinar - Dec 2016
Sumo Logic QuickStart Webinar - Dec 2016Sumo Logic
 
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondStanding Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondCloudera, Inc.
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
Bring your Graphite-compatible metrics into Sumo Logic
Bring your Graphite-compatible metrics into Sumo LogicBring your Graphite-compatible metrics into Sumo Logic
Bring your Graphite-compatible metrics into Sumo LogicSumo Logic
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataCloudera, Inc.
 
Memory Heap Analysis with AppDynamics - AppSphere16
Memory Heap Analysis with AppDynamics - AppSphere16Memory Heap Analysis with AppDynamics - AppSphere16
Memory Heap Analysis with AppDynamics - AppSphere16AppDynamics
 
Thousands of JVMs, Hundreds of Applications, and Two People: How Cerner Learn...
Thousands of JVMs, Hundreds of Applications, and Two People: How Cerner Learn...Thousands of JVMs, Hundreds of Applications, and Two People: How Cerner Learn...
Thousands of JVMs, Hundreds of Applications, and Two People: How Cerner Learn...AppDynamics
 
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...AppDynamics
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
AppDynamics Administration - AppSphere16
AppDynamics Administration - AppSphere16AppDynamics Administration - AppSphere16
AppDynamics Administration - AppSphere16AppDynamics
 

Viewers also liked (20)

Sumo Logic quickStart Webinar June 2016
Sumo Logic quickStart Webinar June 2016Sumo Logic quickStart Webinar June 2016
Sumo Logic quickStart Webinar June 2016
 
Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016
 
Sumo Logic Webinar: Visibility into your Host Metrics
Sumo Logic Webinar: Visibility into your Host MetricsSumo Logic Webinar: Visibility into your Host Metrics
Sumo Logic Webinar: Visibility into your Host Metrics
 
Sumo Logic: Optimizing Scheduled Searches
Sumo Logic: Optimizing Scheduled SearchesSumo Logic: Optimizing Scheduled Searches
Sumo Logic: Optimizing Scheduled Searches
 
Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016
 
How Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and Metrics
How Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and MetricsHow Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and Metrics
How Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and Metrics
 
Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017
 
Sumo Logic - Optimizing Your Search Experience (2016-08-17)
Sumo Logic - Optimizing Your Search Experience (2016-08-17)Sumo Logic - Optimizing Your Search Experience (2016-08-17)
Sumo Logic - Optimizing Your Search Experience (2016-08-17)
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced Analytics
 
"How to" Webinar: Sending Data to Sumo Logic
"How to" Webinar: Sending Data to Sumo Logic"How to" Webinar: Sending Data to Sumo Logic
"How to" Webinar: Sending Data to Sumo Logic
 
Sumo Logic QuickStart Webinar - Dec 2016
Sumo Logic QuickStart Webinar - Dec 2016Sumo Logic QuickStart Webinar - Dec 2016
Sumo Logic QuickStart Webinar - Dec 2016
 
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondStanding Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Bring your Graphite-compatible metrics into Sumo Logic
Bring your Graphite-compatible metrics into Sumo LogicBring your Graphite-compatible metrics into Sumo Logic
Bring your Graphite-compatible metrics into Sumo Logic
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
 
Memory Heap Analysis with AppDynamics - AppSphere16
Memory Heap Analysis with AppDynamics - AppSphere16Memory Heap Analysis with AppDynamics - AppSphere16
Memory Heap Analysis with AppDynamics - AppSphere16
 
Thousands of JVMs, Hundreds of Applications, and Two People: How Cerner Learn...
Thousands of JVMs, Hundreds of Applications, and Two People: How Cerner Learn...Thousands of JVMs, Hundreds of Applications, and Two People: How Cerner Learn...
Thousands of JVMs, Hundreds of Applications, and Two People: How Cerner Learn...
 
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
AppDynamics Administration - AppSphere16
AppDynamics Administration - AppSphere16AppDynamics Administration - AppSphere16
AppDynamics Administration - AppSphere16
 

Similar to Marcel Kornacker, Software Enginner at Cloudera - "Data modeling for data science: Simplify your workload with complex types"

Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Cloudera, Inc.
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in ImpalaMyung Ho Yun
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in ImpalaCloudera, Inc.
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Introduction 6.1 01_architecture_overview
Introduction 6.1 01_architecture_overviewIntroduction 6.1 01_architecture_overview
Introduction 6.1 01_architecture_overviewAnvith S. Upadhyaya
 
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...Precisely
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Cloudera, Inc.
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
An Exploration of 3 Very Different ML Solutions Running on Accumulo
An Exploration of 3 Very Different ML Solutions Running on AccumuloAn Exploration of 3 Very Different ML Solutions Running on Accumulo
An Exploration of 3 Very Different ML Solutions Running on AccumuloAccumulo Summit
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartchCloudera, Inc.
 
Anil Kumar_ 2YearsExp
Anil Kumar_ 2YearsExpAnil Kumar_ 2YearsExp
Anil Kumar_ 2YearsExpAnil Kumar
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
AWS Summit Seoul 2015 -  AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...AWS Summit Seoul 2015 -  AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...Amazon Web Services Korea
 
Anil Kumar_ 2YearsExp
Anil Kumar_ 2YearsExpAnil Kumar_ 2YearsExp
Anil Kumar_ 2YearsExpAnil Kumar
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.
 
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?TechWell
 

Similar to Marcel Kornacker, Software Enginner at Cloudera - "Data modeling for data science: Simplify your workload with complex types" (20)

Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in Impala
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in Impala
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Introduction 6.1 01_architecture_overview
Introduction 6.1 01_architecture_overviewIntroduction 6.1 01_architecture_overview
Introduction 6.1 01_architecture_overview
 
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Couchbase 3.0.2 d1
Couchbase 3.0.2  d1Couchbase 3.0.2  d1
Couchbase 3.0.2 d1
 
Nithin(1)
Nithin(1)Nithin(1)
Nithin(1)
 
An Exploration of 3 Very Different ML Solutions Running on Accumulo
An Exploration of 3 Very Different ML Solutions Running on AccumuloAn Exploration of 3 Very Different ML Solutions Running on Accumulo
An Exploration of 3 Very Different ML Solutions Running on Accumulo
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 
Anil Kumar_ 2YearsExp
Anil Kumar_ 2YearsExpAnil Kumar_ 2YearsExp
Anil Kumar_ 2YearsExp
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
AWS Summit Seoul 2015 -  AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...AWS Summit Seoul 2015 -  AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
 
Anil Kumar_ 2YearsExp
Anil Kumar_ 2YearsExpAnil Kumar_ 2YearsExp
Anil Kumar_ 2YearsExp
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
 

More from Dataconomy Media

Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...Dataconomy Media
 
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Dataconomy Media
 
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...Dataconomy Media
 
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...Dataconomy Media
 
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...Dataconomy Media
 
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...Dataconomy Media
 
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...Dataconomy Media
 
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...Dataconomy Media
 
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...Dataconomy Media
 
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...Dataconomy Media
 
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...Dataconomy Media
 
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...Dataconomy Media
 
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...Dataconomy Media
 
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...Dataconomy Media
 
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...Dataconomy Media
 
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...Dataconomy Media
 
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...Dataconomy Media
 
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...Dataconomy Media
 
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Dataconomy Media
 
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Dataconomy Media
 

More from Dataconomy Media (20)

Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
 
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
 
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
 
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
 
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
 
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
 
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
 
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
 
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
 
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
 
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
 
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
 
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
 
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
 
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
 
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
 
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
 
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
 
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
 
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Marcel Kornacker, Software Enginner at Cloudera - "Data modeling for data science: Simplify your workload with complex types"

  • 1. 1© Cloudera, Inc. All rights reserved. Simplifying Analytic Workloads via Complex Schemas Marcel Kornacker Data Modeling for Data Science
  • 2. 2© Cloudera, Inc. All rights reserved. • Relational databases are optimized to work with flat schemas • Much of the useful information in the world is stored using complex schemas (a.k.a., the “document model”) • Log files • NoSQL data stores • REST APIs On Complex Schemas
  • 3. 3© Cloudera, Inc. All rights reserved. Handling Complex Schemas in SQL
  • 4. 4© Cloudera, Inc. All rights reserved. Complex Schemas and Data Science
  • 5. 5© Cloudera, Inc. All rights reserved. Think Like A Data Scientist
  • 6. 6© Cloudera, Inc. All rights reserved. Building Data Science Teams: A Moneyball Approach
  • 7. 7© Cloudera, Inc. All rights reserved. Leveraging The Power of SQL Engines…
  • 8. 8© Cloudera, Inc. All rights reserved. …While Managing the Cost of SQL Engines
  • 9. 9© Cloudera, Inc. All rights reserved. Intentional Complex Schemas
  • 10. 10© Cloudera, Inc. All rights reserved. Better Resource Utilization
  • 11. 11© Cloudera, Inc. All rights reserved. Higher Data Quality
  • 12. 12© Cloudera, Inc. All rights reserved. Simplifying Analytic Queries
  • 13. 13© Cloudera, Inc. All rights reserved. Data Science For Everyone
  • 14. 14© Cloudera, Inc. All rights reserved. Nested Types in Impala Support for Nested Data Types: STRUCT, MAP, ARRAY Natural Extensions to SQL Full Expressiveness of SQL with Nested Types
  • 15. 15© Cloudera, Inc. All rights reserved. Example TPCH-like Schema CREATE TABLE Customers { id BIGINT, address STRUCT< city: STRING, zip: INT > orders ARRAY<STRUCT< date: TIMESTAMP, price: DECIMAL(12,2), cc: BIGINT, items: ARRAY<STRUCT< item_no: STRING, price: DECIMAL(9,2) >> >> }
  • 16. 16© Cloudera, Inc. All rights reserved. Impala Syntax Extensions • Path expressions extend column references to scalars (nested structs) • Can appear anywhere a conventional column reference is used • Collections (maps and arrays) are exposed like sub-tables • Use FROM clause to specify which collections to read like conventional tables • Can use JOIN conditions to express join relationship (default is INNER JOIN) Find the ids and city of customers who live in the zip code 94305: SELECT id, address.city FROM customers WHERE address.zip = 94305 Find all orders that were paid for with a customer’s preferred credit card: SELECT o.txn_id FROM customers c, c.orders o WHERE o.cc = c.preferred_cc
  • 17. 17© Cloudera, Inc. All rights reserved. Referencing Arrays & Maps • Basic idea: Flatten nested collections referenced in the FROM clause • Can be thought of as an implicit join on the parent/child relationship SELECT c.id, o.date FROM customers c, c.c_orders o c.id o.date 100 2012-05-03 100 2013-07-08 100 2011-01-29 … … 101 2014-02-04 101 2015-11-15 … … id of a customer repeated for every order
  • 18. 18© Cloudera, Inc. All rights reserved. Referencing Arrays & Maps SELECT c.c_custkey, o.o_orderkey FROM customers c, c.c_orders o Returns customer/order data for customers that have at least one order SELECT c.c_custkey, o.o_orderkey FROM customers c LEFT OUTER JOIN c.c_orders o Also returns customers with no orders (with order fields NULL) SELECT c.c_custkey, o.o_orderkey FROM customers c LEFT ANTI JOIN c.orders o Find all customers that have no orders SELECT c.c_custkey, o.o_orderkey FROM customers c INNER JOIN c.c_orders o
  • 19. 19© Cloudera, Inc. All rights reserved. Motivation for Advanced Querying Capabilities • Count the number of orders per customer • Count the number of items per order • Impractical Requires unique id at every nesting level • Information is already expressed in nesting relationship! • What about even more interesting queries? • Get the number of orders and the average item price per customer? • “Group by” multiple nesting levels at the same time SELECT COUNT(*) FROM customers c, c.orders o GROUP BY c.id SELECT COUNT(*) FROM customers.orders o, o.items GROUP BY ??? Must be unique
  • 20. 20© Cloudera, Inc. All rights reserved. Advanced Querying: Aggregates over Nested Collections • Count the number of orders per customer • Count the number of items per order • Get the number of orders and the average item price per customer SELECT id, COUNT(orders) FROM customers SELECT date, COUNT(items) FROM customers.orders SELECT id, COUNT(orders), AVG(orders.items.price) FROM customers
  • 21. 21© Cloudera, Inc. All rights reserved. Advanced Querying: Relative Table References • Full expressibility of SQL with nested collections • Arbitrary SQL allowed in inline views and subqueries with relative table refs • Exploits nesting relationship • No need for stored unique ids at all interesting levels SELECT id FROM customers c WHERE (SELECT AVG(price) FROM c.orders.items) > 1000 SELECT id FROM customers c JOIN (SELECT price FROM c.orders ORDER BY date DESC LIMIT 2) v
  • 22. 22© Cloudera, Inc. All rights reserved. Impala Nested Types in Action Josh Wills’ Blog post on analyzing misspelled queries http://blog.cloudera.com/blog/2014/08/how-to-count-events-like-a-data-scientist/ • Goal: Rudimentary spell checker based on counting query/click events • Problem: Cross referencing items in multiple nested collections • Representative of many machine learning tasks (Josh tells me) • Goal was not naturally achievable with Hive • Josh implemented a custom Hive extension “WITHIN” • How can Impala serve this use case?
  • 23. 23© Cloudera, Inc. All rights reserved. Impala Nested Types in Action account_id: bigint search_events: array<struct< event_id: bigint query: string tstamp_sec: bigint ... >> install_events: array<struct< event_id: bigint search_event_id: bigint app_id: bigint ... >> SELECT bad.query, good.query, count(*) as cnt FROM sessions s, s.search_events bad, s.search_events good, WHERE bad.tstamp_sec < good.tstamp_sec AND good.tstamp_sec - bad.tstamp_sec < 30 AND bad.event_id NOT IN (select search_event_id FROM s.install_events) AND good.event_id IN (select search_event_id FROM s.install_events), GROUP BY bad.query, good.query Schema
  • 24. 24© Cloudera, Inc. All rights reserved. Design Considerations for Complex Schemas • Turn parent-child hierarchies into nested collections • logical hierarchy becomes physical hierarchy • nested structure = join index physical clustering of child with parent • For distributed, big data systems, this matter: distributed join turns into local join
  • 25. 25© Cloudera, Inc. All rights reserved. Flat TPCH vs. Nested TPCH Part PartSupp Supplier Lineitem Customer Orders Nation Region Part PartSupp Suppliers Lineitems Customer Orders Nation Region 1 N 1 N 1 N 1 N N 1 N 1 N 1 N 1 N 1 1 N N 1
  • 26. 26© Cloudera, Inc. All rights reserved. Columnar Storage for Complex Schemas • Columnar storage: a necessity for processing nested data at speed • complex schemas = really wide tables • row-wise storage: read lots of data you don’t care about • Columnar formats effectively store a join index {id: 17 orders: [{date: 2014-01-29, price: 10.2}, {date: ….} …] } {id: 18 orders: [{date: 2015-04-10, price: 2000.0}, {date: ….} …] } 17 18 10.2 22.1 100 200 id orders.price … … …
  • 27. 27© Cloudera, Inc. All rights reserved. Columnar Storage for Complex Schemas • A “join” between parent and child is essentially free: • coordinated scan of parent and child columns • data effectively pre-sorted on parent PK • merge join beats hash join! • Aggregating child data is cheap: • the data is already pre-grouped by the parent • Amenable to vectorized execution • Local non-grouping aggregation <<< distributed grouping aggregation
  • 28. 28© Cloudera, Inc. All rights reserved. Analytics on Nested Data: Cheap & Easy! SELECT c.id, AVG(orders.items.price) avg_price FROM customer ORDER BY avg_price DESC LIMIT 10 Query: Find the 10 customers that buy the most expensive items on average SELECT c.id, AVG(i.price) avg_price FROM customer c, order o, item i WHERE c.id = o.cid and o.id = i.oid GROUP BY c.id ORDER BY avg_price DESC LIMIT 10 Flat Nested
  • 29. 29© Cloudera, Inc. All rights reserved. Scan Scan Xch Xch Join Scan Xch Join Agg Xch Agg Scan Unnest Agg Top-n Xch Top-n Top-n Xch Top-n Analytics on Nested Data: Cheap & Easy! $$$
  • 30. 30© Cloudera, Inc. All rights reserved. A Note to BI Tool Vendors Please support nested types, it’s not that hard! • you already take advantage of declared PK/FK relationships • a nested collection isn’t really that different • except you don’t need a join predicate
  • 31. 31© Cloudera, Inc. All rights reserved. Thank you