SlideShare a Scribd company logo
‹#›© Cloudera, Inc. All rights reserved.
Marcel Kornacker // Cloudera, Inc.
Friction-Free ELT:
Automating Data
Transformation with
Cloudera Impala
© Cloudera, Inc. All rights reserved.
Outline
• Evolution of ELT in the context of analytics
• traditional systems
• Hadoop today
• Cloudera’s vision for ELT
• make most of it disappear
• automate data transformation
2
© Cloudera, Inc. All rights reserved.
Traditional ELT
• Extract: physical extraction from source data store
• could be an RDBMS acting as an operational data store
• or log data materialized as json
• Load: the targeted analytic DBMS converts the transformed data into its
binary format (typically columnar)
• Transform:
• data cleansing and standardization
• conversion of naturally complex/nested data into a flat relational schema
3
© Cloudera, Inc. All rights reserved.
• Three aspects to the traditional ELT process:
1. semantic transformation such as data standardization/cleansing
» makes data more queryable, adds value
2. representational transformation: from source to target schema
(from complex/nested to flat relational)
» “lateral” transformation that doesn’t change semantics,
adds operational overhead
3. data movement: from source to staging area to target system
» adds yet more operational overhead
Traditional ELT
4
© Cloudera, Inc. All rights reserved.
• The goals of “analytics with no ELT”:
• simplify aspect 1
• eliminate aspects 2 and 3
Traditional ELT
5
© Cloudera, Inc. All rights reserved.
A typical ELT workflow with Hadoop looks like this:
raw source data initially lands in HDFS (examples: text/xml/json log files) …
Text, XML, JSON
ELT with Hadoop Today
6
© Cloudera, Inc. All rights reserved.
… map that data into a table to make it queryable …
Text, XML, JSON Original Data
CREATE TABLE RawLogData (…) ROW FORMAT DELIMITED FIELDS LOCATION
‘/raw-log-data/‘;
ELT with Hadoop Today
7
© Cloudera, Inc. All rights reserved.
… create the target table at a different location …
Text, XML, JSON Original Data Parquet
CREATE TABLE LogData (…) STORED AS PARQUET LOCATION ‘/log-data/‘;
ELT with Hadoop Today
8
© Cloudera, Inc. All rights reserved.
… convert the raw source data to the target format …
Text, XML, JSON Original Data Parquet
INSERT INTO LogData SELECT * FROM RawLogData;
ELT with Hadoop Today
9
© Cloudera, Inc. All rights reserved.
… the data is then available for batch reporting/analytics (via Impala, Hive,
Pig, Spark) or interactive analytics (via Impala, Search)
Text, XML, JSON Original Data Parquet
Impala
Hive
Spark
Pig
ELT with Hadoop Today
10
© Cloudera, Inc. All rights reserved.
• Compared to traditional ELT, this has several advantages:
• Hadoop acts as a centralized location for all data: raw source data lives
side by side with the transformed data
• data does not need to be moved between multiple platforms/clusters
• data in the raw source format is queryable as soon as it lands, although
at reduced performance, compared to an optimized columnar data
format
ELT with Hadoop Today
11
© Cloudera, Inc. All rights reserved.
• However, even this still has drawbacks:
• new data needs to be loaded periodically into the target table
• doing that reliably and within SLAs can be a challenge
• you now have two tables:
one with current but slow data
another with lagging but fast data
Text, XML, JSON Original Data Parquet
Impala
Hive
Spark
Pig
Another user-managed
data pipeline!
ELT with Hadoop Today
12
© Cloudera, Inc. All rights reserved.
• No explicit loading/conversion step to move raw data into a target table
• A single view of the data that is up-to-date and (mostly) in an efficient
columnar format
• automated with custom logic
Text, XML, JSON Parquet
Impala
Hive
Spark
Pig
automated
A Vision for Analytics with No ELT
13
© Cloudera, Inc. All rights reserved.
• support for complex/nested schemas
» avoid remapping of raw data into a flat relational schema
• background and incremental data conversion
» retain in-place single view of entire data set, with most data being in an efficient
format
• bonus: schema inference and schema evolution
» start analyzing data as soon as it arrives, regardless of its complexity
A Vision for Analytics with No ELT
14
© Cloudera, Inc. All rights reserved.
• Standard relational: all columns have scalar values:
CHAR(n), DECIMAL(p, s), INT, DOUBLE, TIMESTAMP, etc.
• Complex types: structs, arrays, maps
in essence, a nested relational schema
• Supported file formats:
Parquet, json, XML, Avro
• Design principle for SQL extensions: maintain SQL’s way of dealing with
multi-valued data
Support for Complex Schemas in Impala
15
© Cloudera, Inc. All rights reserved.
Sample schema CREATE TABLE Customers (
cid BIGINT,
address STRUCT {
street STRING,
zip INT,
city STRING
},
orders ARRAY<STRUCT {
oid BIGINT,
total DECIMAL(9, 2),
items ARRAY< STRUCT {
iid BIGINT, qty INT, price DECIMAL(9, 2) }>
}>
)
Support for Complex Schemas in Impala
16
© Cloudera, Inc. All rights reserved.
• Paths in expressions: generalization of column references to drill
through structs
• Can appear anywhere a conventional column reference is legal
SELECT address.zip, c.address.street
FROM Customers c
WHERE address.city = ‘Oakland’
SQL Syntax Extensions: Structs
17
© Cloudera, Inc. All rights reserved.
• Basic idea: nested collections are referenced like tables in the FROM
clause
SELECT SUM(i.price * i.qty)
FROM Customers.orders.items i
WHERE i.price > 10
SQL Syntax Extensions: Arrays and Maps
18
© Cloudera, Inc. All rights reserved.
• Parent/child relationships don’t require join predicates
identical to
SELECT c.cid, o.total
FROM Customers c, c.orders o
SELECT c.cid, o.total
FROM Customers c INNER JOIN c.orders o
SQL Syntax Extensions: Arrays and Maps
19
© Cloudera, Inc. All rights reserved.
• All join types are supported (INNER, OUTER, SEMI, ANTI)
• Advanced querying capabilities via correlated inline views and
subqueries
SELECT c.cid
FROM Customers c LEFT ANTI JOIN c.orders o
SELECT c.cid
FROM Customers c
WHERE EXISTS
(SELECT * FROM c.orders.items where iid = 117)
SQL Syntax Extensions: Arrays and Maps
20
© Cloudera, Inc. All rights reserved.
• Aggregates over collections are concise and easy to read
instead of
SELECT c.cid,
COUNT(c.orders), AVG(c.orders.items.price)
FROM Customers c
SELECT c.cid, a, b
FROM Customers c,
(SELECT COUNT(*) AS a FROM c.orders),
(SELECT AVG(price) AS b FROM c.orders.items)
SQL Syntax Extensions: Arrays and Maps
21
© Cloudera, Inc. All rights reserved.
• Aggregates over collections can replace subqueries
instead of
SELECT c.cid
FROM Customers c
WHERE COUNT(c.orders) > 10
SELECT c.cid
FROM Customers c
WHERE (SELECT COUNT(*) FROM c.orders) > 10
SQL Syntax Extensions: Arrays and Maps
22
© Cloudera, Inc. All rights reserved.
• Parquet optimizes storage of complex schemas by inlining structural data
into the columnar data
(repetition/definition levels)
• No performance penalty for accessing nested collections!
• Parquet translates logical hierarchy into physical collocation:
• you don’t need parent-child joins
• queries run faster!
Complex Schemas with Parquet
23
© Cloudera, Inc. All rights reserved.
• Automated data conversion takes care of
• format conversion: from text/JSON/Avro/… to Parquet
• compaction: multiple smaller files are coalesced into a single larger one
• Sample workflow:
• create table for data:
• attach data to table:
CREATE TABLE LogData (…) WITH CONVERSION TO PARQUET;
LOAD DATA INPATH ‘/raw-log-data/file1’ INTO LogData SOURCE FORMAT
JSON;
Automated Data Conversion
24
© Cloudera, Inc. All rights reserved.
Automated Data Conversion
25
JSON
Impala
Hive
Spark
Pig
PARQUETJSON
Single-Table View
© Cloudera, Inc. All rights reserved.
• Conversion process
• atomic: the switch from the source to the target data files is atomic from
the perspective of a running query (but any running query sees the full
data set)
• redundant: with option to retain original data
• incremental: Impala’s catalog service detects new data files that are not
in the target format automatically
Automated Data Conversion
26
© Cloudera, Inc. All rights reserved.
• Specify data transformation via “view” definition:
• derived table is expressed as Select, Project, Join, Aggregation
• custom logic incorporated via UDFs, UDAs
CREATE DERIVED TABLE CleanLogData AS
SELECT StandardizeName(name), StandardizeAddr(addr, zip), …
FROM LogData
STORED AS PARQUET;
Automating Data Transformation: Derived Tables
27
© Cloudera, Inc. All rights reserved.
• From the user’s perspective:
• table is queryable like any other table (but doesn’t allow INSERTs)
• reflects all data visible in source tables at time of query (not: at time of
CREATE)
• performance is close to that of a table created with CREATE TABLE …
AS SELECT (ie, that of a static snapshot)
Automating Data Transformation: Derived Tables
28
© Cloudera, Inc. All rights reserved.
• From the system’s perspective:
• table is Union of
• physically materialized data, derived from input tables as of some
point in the past
• view over yet-unprocessed data of input tables
• table is updated incrementally (and in the background) when new data is
added to input tables
Automating Data Transformation: Derived Tables
29
© Cloudera, Inc. All rights reserved.
Automating Data Transformation: Derived Tables
30
Impala
Hive
Spark
Pig
Delta Query Materialized Data
Base Table
Updates
Single-Table View
© Cloudera, Inc. All rights reserved.
• Schema inference from data files is useful to reduce the barrier to
analyzing complex source data
• as an example, log data often has hundreds of fields
• the time required to create the DDL manually is substantial
• Example: schema inference from structured data files
• available today:
• future formats: XML, json, Avro
CREATE TABLE LogData LIKE PARQUET ‘/log-data.pq’
Schema Inference and Schema Evolution
31
© Cloudera, Inc. All rights reserved.
• Schema evolution:
• a necessary follow-on to schema inference: every schema evolves over
time; explicit maintenance is as time-consuming as the initial creation
• algorithmic schema evolution requires sticking to generally safe schema
modifications: adding new fields
• adding new top-level columns
• adding fields within structs
Schema Inference and Schema Evolution
32
© Cloudera, Inc. All rights reserved.
• Example workflow:
• scans data to determine new columns/fields to add
• synchronous: if there is an error, the ‘load’ is aborted and the user
notified
LOAD DATA INPATH ‘/path’ INTO LogData SOURCE FORMAT JSON WITH SCHEMA EXPANSION;
Schema Inference and Schema Evolution
33
© Cloudera, Inc. All rights reserved.
• CREATE TABLE … LIKE <File>:
• available today for Parquet
• Impala 2.3+ for Avro, JSON, XML
• Nested types: Impala 2.3
• Background format conversion: Impala 2.4
• Derived tables: > Impala 2.4
Timeline of Features in Impala
34
© Cloudera, Inc. All rights reserved.
• Hadoop offers a number of advantages over traditional multi-platform ETL
solutions:
• availability of all data sets on a single platform
• data becomes accessible through SQL as soon as it lands
• However, this can be improved further:
• a richer analytic SQL that is extended to handle nested data
• an automated background conversion process that preserves an up-to-
date view of all data while providing BI-typical performance
• a declarative transformation process that focuses on application
semantics and removes operational complexity
• simple automation of initial schema creation and subsequent maintenance
that makes dealing with large, complex schemas less labor-intensive
Conclusion
35
‹#›© Cloudera, Inc. All rights reserved.
Thank you
Marcel Kornacker | marcel@cloudera.com
36

More Related Content

What's hot

Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
Scott Leberknight
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
Jason Shih
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
Swiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
Databricks
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
trihug
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in ImpalaCloudera, Inc.
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
Gwen (Chen) Shapira
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
Yue Chen
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera, Inc.
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
Chicago Hadoop Users Group
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
larsgeorge
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
kristinferrier
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
Wangda Tan
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impalaNAVER D2
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
Will Du
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
DataWorks Summit
 

What's hot (20)

Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impala
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
 

Similar to Friction-free ETL: Automating data transformation with Impala | Strata + Hadoop World London 2015

Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Democratizing Data
Democratizing DataDemocratizing Data
Democratizing Data
Databricks
 
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Lucas Jellema
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
ORACLE 12C-New-Features
ORACLE 12C-New-FeaturesORACLE 12C-New-Features
ORACLE 12C-New-FeaturesNavneet Upneja
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Databricks
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Abdelkrim Hadjidj
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Michael Rys
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
Ilham31574
 
1696153390685.pdf
1696153390685.pdf1696153390685.pdf
1696153390685.pdf
AhmedAnwar784575
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
Dan Lynn
 
Spark etl
Spark etlSpark etl
Spark etl
Imran Rashid
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
DataStax
 

Similar to Friction-free ETL: Automating data transformation with Impala | Strata + Hadoop World London 2015 (20)

Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Democratizing Data
Democratizing DataDemocratizing Data
Democratizing Data
 
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
ORACLE 12C-New-Features
ORACLE 12C-New-FeaturesORACLE 12C-New-Features
ORACLE 12C-New-Features
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Ado
AdoAdo
Ado
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
1696153390685.pdf
1696153390685.pdf1696153390685.pdf
1696153390685.pdf
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 
Spark etl
Spark etlSpark etl
Spark etl
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

Friction-free ETL: Automating data transformation with Impala | Strata + Hadoop World London 2015

  • 1. ‹#›© Cloudera, Inc. All rights reserved. Marcel Kornacker // Cloudera, Inc. Friction-Free ELT: Automating Data Transformation with Cloudera Impala
  • 2. © Cloudera, Inc. All rights reserved. Outline • Evolution of ELT in the context of analytics • traditional systems • Hadoop today • Cloudera’s vision for ELT • make most of it disappear • automate data transformation 2
  • 3. © Cloudera, Inc. All rights reserved. Traditional ELT • Extract: physical extraction from source data store • could be an RDBMS acting as an operational data store • or log data materialized as json • Load: the targeted analytic DBMS converts the transformed data into its binary format (typically columnar) • Transform: • data cleansing and standardization • conversion of naturally complex/nested data into a flat relational schema 3
  • 4. © Cloudera, Inc. All rights reserved. • Three aspects to the traditional ELT process: 1. semantic transformation such as data standardization/cleansing » makes data more queryable, adds value 2. representational transformation: from source to target schema (from complex/nested to flat relational) » “lateral” transformation that doesn’t change semantics, adds operational overhead 3. data movement: from source to staging area to target system » adds yet more operational overhead Traditional ELT 4
  • 5. © Cloudera, Inc. All rights reserved. • The goals of “analytics with no ELT”: • simplify aspect 1 • eliminate aspects 2 and 3 Traditional ELT 5
  • 6. © Cloudera, Inc. All rights reserved. A typical ELT workflow with Hadoop looks like this: raw source data initially lands in HDFS (examples: text/xml/json log files) … Text, XML, JSON ELT with Hadoop Today 6
  • 7. © Cloudera, Inc. All rights reserved. … map that data into a table to make it queryable … Text, XML, JSON Original Data CREATE TABLE RawLogData (…) ROW FORMAT DELIMITED FIELDS LOCATION ‘/raw-log-data/‘; ELT with Hadoop Today 7
  • 8. © Cloudera, Inc. All rights reserved. … create the target table at a different location … Text, XML, JSON Original Data Parquet CREATE TABLE LogData (…) STORED AS PARQUET LOCATION ‘/log-data/‘; ELT with Hadoop Today 8
  • 9. © Cloudera, Inc. All rights reserved. … convert the raw source data to the target format … Text, XML, JSON Original Data Parquet INSERT INTO LogData SELECT * FROM RawLogData; ELT with Hadoop Today 9
  • 10. © Cloudera, Inc. All rights reserved. … the data is then available for batch reporting/analytics (via Impala, Hive, Pig, Spark) or interactive analytics (via Impala, Search) Text, XML, JSON Original Data Parquet Impala Hive Spark Pig ELT with Hadoop Today 10
  • 11. © Cloudera, Inc. All rights reserved. • Compared to traditional ELT, this has several advantages: • Hadoop acts as a centralized location for all data: raw source data lives side by side with the transformed data • data does not need to be moved between multiple platforms/clusters • data in the raw source format is queryable as soon as it lands, although at reduced performance, compared to an optimized columnar data format ELT with Hadoop Today 11
  • 12. © Cloudera, Inc. All rights reserved. • However, even this still has drawbacks: • new data needs to be loaded periodically into the target table • doing that reliably and within SLAs can be a challenge • you now have two tables: one with current but slow data another with lagging but fast data Text, XML, JSON Original Data Parquet Impala Hive Spark Pig Another user-managed data pipeline! ELT with Hadoop Today 12
  • 13. © Cloudera, Inc. All rights reserved. • No explicit loading/conversion step to move raw data into a target table • A single view of the data that is up-to-date and (mostly) in an efficient columnar format • automated with custom logic Text, XML, JSON Parquet Impala Hive Spark Pig automated A Vision for Analytics with No ELT 13
  • 14. © Cloudera, Inc. All rights reserved. • support for complex/nested schemas » avoid remapping of raw data into a flat relational schema • background and incremental data conversion » retain in-place single view of entire data set, with most data being in an efficient format • bonus: schema inference and schema evolution » start analyzing data as soon as it arrives, regardless of its complexity A Vision for Analytics with No ELT 14
  • 15. © Cloudera, Inc. All rights reserved. • Standard relational: all columns have scalar values: CHAR(n), DECIMAL(p, s), INT, DOUBLE, TIMESTAMP, etc. • Complex types: structs, arrays, maps in essence, a nested relational schema • Supported file formats: Parquet, json, XML, Avro • Design principle for SQL extensions: maintain SQL’s way of dealing with multi-valued data Support for Complex Schemas in Impala 15
  • 16. © Cloudera, Inc. All rights reserved. Sample schema CREATE TABLE Customers ( cid BIGINT, address STRUCT { street STRING, zip INT, city STRING }, orders ARRAY<STRUCT { oid BIGINT, total DECIMAL(9, 2), items ARRAY< STRUCT { iid BIGINT, qty INT, price DECIMAL(9, 2) }> }> ) Support for Complex Schemas in Impala 16
  • 17. © Cloudera, Inc. All rights reserved. • Paths in expressions: generalization of column references to drill through structs • Can appear anywhere a conventional column reference is legal SELECT address.zip, c.address.street FROM Customers c WHERE address.city = ‘Oakland’ SQL Syntax Extensions: Structs 17
  • 18. © Cloudera, Inc. All rights reserved. • Basic idea: nested collections are referenced like tables in the FROM clause SELECT SUM(i.price * i.qty) FROM Customers.orders.items i WHERE i.price > 10 SQL Syntax Extensions: Arrays and Maps 18
  • 19. © Cloudera, Inc. All rights reserved. • Parent/child relationships don’t require join predicates identical to SELECT c.cid, o.total FROM Customers c, c.orders o SELECT c.cid, o.total FROM Customers c INNER JOIN c.orders o SQL Syntax Extensions: Arrays and Maps 19
  • 20. © Cloudera, Inc. All rights reserved. • All join types are supported (INNER, OUTER, SEMI, ANTI) • Advanced querying capabilities via correlated inline views and subqueries SELECT c.cid FROM Customers c LEFT ANTI JOIN c.orders o SELECT c.cid FROM Customers c WHERE EXISTS (SELECT * FROM c.orders.items where iid = 117) SQL Syntax Extensions: Arrays and Maps 20
  • 21. © Cloudera, Inc. All rights reserved. • Aggregates over collections are concise and easy to read instead of SELECT c.cid, COUNT(c.orders), AVG(c.orders.items.price) FROM Customers c SELECT c.cid, a, b FROM Customers c, (SELECT COUNT(*) AS a FROM c.orders), (SELECT AVG(price) AS b FROM c.orders.items) SQL Syntax Extensions: Arrays and Maps 21
  • 22. © Cloudera, Inc. All rights reserved. • Aggregates over collections can replace subqueries instead of SELECT c.cid FROM Customers c WHERE COUNT(c.orders) > 10 SELECT c.cid FROM Customers c WHERE (SELECT COUNT(*) FROM c.orders) > 10 SQL Syntax Extensions: Arrays and Maps 22
  • 23. © Cloudera, Inc. All rights reserved. • Parquet optimizes storage of complex schemas by inlining structural data into the columnar data (repetition/definition levels) • No performance penalty for accessing nested collections! • Parquet translates logical hierarchy into physical collocation: • you don’t need parent-child joins • queries run faster! Complex Schemas with Parquet 23
  • 24. © Cloudera, Inc. All rights reserved. • Automated data conversion takes care of • format conversion: from text/JSON/Avro/… to Parquet • compaction: multiple smaller files are coalesced into a single larger one • Sample workflow: • create table for data: • attach data to table: CREATE TABLE LogData (…) WITH CONVERSION TO PARQUET; LOAD DATA INPATH ‘/raw-log-data/file1’ INTO LogData SOURCE FORMAT JSON; Automated Data Conversion 24
  • 25. © Cloudera, Inc. All rights reserved. Automated Data Conversion 25 JSON Impala Hive Spark Pig PARQUETJSON Single-Table View
  • 26. © Cloudera, Inc. All rights reserved. • Conversion process • atomic: the switch from the source to the target data files is atomic from the perspective of a running query (but any running query sees the full data set) • redundant: with option to retain original data • incremental: Impala’s catalog service detects new data files that are not in the target format automatically Automated Data Conversion 26
  • 27. © Cloudera, Inc. All rights reserved. • Specify data transformation via “view” definition: • derived table is expressed as Select, Project, Join, Aggregation • custom logic incorporated via UDFs, UDAs CREATE DERIVED TABLE CleanLogData AS SELECT StandardizeName(name), StandardizeAddr(addr, zip), … FROM LogData STORED AS PARQUET; Automating Data Transformation: Derived Tables 27
  • 28. © Cloudera, Inc. All rights reserved. • From the user’s perspective: • table is queryable like any other table (but doesn’t allow INSERTs) • reflects all data visible in source tables at time of query (not: at time of CREATE) • performance is close to that of a table created with CREATE TABLE … AS SELECT (ie, that of a static snapshot) Automating Data Transformation: Derived Tables 28
  • 29. © Cloudera, Inc. All rights reserved. • From the system’s perspective: • table is Union of • physically materialized data, derived from input tables as of some point in the past • view over yet-unprocessed data of input tables • table is updated incrementally (and in the background) when new data is added to input tables Automating Data Transformation: Derived Tables 29
  • 30. © Cloudera, Inc. All rights reserved. Automating Data Transformation: Derived Tables 30 Impala Hive Spark Pig Delta Query Materialized Data Base Table Updates Single-Table View
  • 31. © Cloudera, Inc. All rights reserved. • Schema inference from data files is useful to reduce the barrier to analyzing complex source data • as an example, log data often has hundreds of fields • the time required to create the DDL manually is substantial • Example: schema inference from structured data files • available today: • future formats: XML, json, Avro CREATE TABLE LogData LIKE PARQUET ‘/log-data.pq’ Schema Inference and Schema Evolution 31
  • 32. © Cloudera, Inc. All rights reserved. • Schema evolution: • a necessary follow-on to schema inference: every schema evolves over time; explicit maintenance is as time-consuming as the initial creation • algorithmic schema evolution requires sticking to generally safe schema modifications: adding new fields • adding new top-level columns • adding fields within structs Schema Inference and Schema Evolution 32
  • 33. © Cloudera, Inc. All rights reserved. • Example workflow: • scans data to determine new columns/fields to add • synchronous: if there is an error, the ‘load’ is aborted and the user notified LOAD DATA INPATH ‘/path’ INTO LogData SOURCE FORMAT JSON WITH SCHEMA EXPANSION; Schema Inference and Schema Evolution 33
  • 34. © Cloudera, Inc. All rights reserved. • CREATE TABLE … LIKE <File>: • available today for Parquet • Impala 2.3+ for Avro, JSON, XML • Nested types: Impala 2.3 • Background format conversion: Impala 2.4 • Derived tables: > Impala 2.4 Timeline of Features in Impala 34
  • 35. © Cloudera, Inc. All rights reserved. • Hadoop offers a number of advantages over traditional multi-platform ETL solutions: • availability of all data sets on a single platform • data becomes accessible through SQL as soon as it lands • However, this can be improved further: • a richer analytic SQL that is extended to handle nested data • an automated background conversion process that preserves an up-to- date view of all data while providing BI-typical performance • a declarative transformation process that focuses on application semantics and removes operational complexity • simple automation of initial schema creation and subsequent maintenance that makes dealing with large, complex schemas less labor-intensive Conclusion 35
  • 36. ‹#›© Cloudera, Inc. All rights reserved. Thank you Marcel Kornacker | marcel@cloudera.com 36