SlideShare a Scribd company logo
SQL on Hadoop for the Oracle Professional
Michael Rainey
Collaborate 18
2
GluentBig Data
Sources
IoT Data
MSSQL
App
X
App
Y
App
Z
Gluent Transparent
Data Virtualization
Connect all data
to all applications!
Oracle
3
Why Hadoop?
4
The evolution of data
Source: http://watalon.com/?p=720
5
What is so different about
Hadoop?
6
RDBMS architecture
Bottleneck!
SAN StorageSAN Storage
DB App 1 DB App 2 DB App N
RDBMS requires data
to be “moved” to the
application for
processing
Running out of
space? Expand
SAN.
DiskDiskDisk Disk Disk Disk
7
HDFS scalability
Hadoop Distributed File System (HDFS)
Name
Node
Name
Node2
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
“Where is
my data?”
Running out of
space? Add nodes.
Metadata
Node 1
“Access
your data!”
Node 2 Node 3 Node 4 Node ... Node N Node X
8
HDFS scalability and performance
Hadoop Distributed File System (HDFS)
Name
Node
Name
Node2
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Metadata
Node 1 Node 2 Node 3 Node 4 Node ... Node N Node X
Disk Disk Disk Disk Disk Disk Disk
Computation
occurs where
the data lives
“Moving Computation is Cheaper than Moving Data”
9
How data storage works in HDFS
Data file
Node 1 Node 2 Node 3 Node 4 Node 5
Block 1 Block 2 Block 3Block 1Block 1Block 1 Block 2Block 2Block 2 Block 3Block 3Block 3Block 1 Block 3
Name
Node
NameNode
determines where
blocks will be stored
Blocks that become
unavailable will be
replicated from other
nodes, ensuring at
least 3 copies exist
10
• Scalability in Software!
• Open Data Formats
• Future-proof!
• One Data, Many Engines!
Scalable and open!
Hive Impala Spark Presto Giraph
SQL-on-Hadoop is only
one of many
applications of Hadoop
- Graph Processing
- Full text search
- Image Processing
- Log processing
- Streams
11
Data processing examples
Read Data from Disks
Filter Data
("where sal > 50000")
Simple Processing: Aggregate,
Join, Sort, ...
Complex Processing: Business
Logic, Analytics
Data
App Server
Database
Server
Storage
Array
RDBMS + SAN
Storage Cell
Database
Server
App Server
Oracle Exadata
App code
+
Spark*
on HDFS
Hadoop + Data
Processing framework
12
HDFS considerations
• HDFS is designed for "write once, read many" use cases
• Large sequential streaming reads (scans) of files
• Not optimal for random seeks and small IOs
• "Inserts" are appended to the end of the file
• No updates (by default)
• HDFS performs the replication ("mirroring") and is designed anticipating
frequent DataNode failures
Various other applications
(Hbase, Kudu, etc) allow for
updates, random lookups,
and resolve some of the
other “tradeoffs”, and should
be considered
13
• Standard file formats
• Plain text (CSV, tab-delimited)
• Structured text (XML, JSON)
• Binary (images)
• Serialization formats
• Avro
• Columnar formats
• Parquet
• ORC
Data formats supported by Hadoop
14
Common SQL on Hadoop myths
15
• MapReduce is just one component of Hadoop
• Hadoop v1, MapReduce was THE component for data retrieval and resource
management
• Hadoop v2 includes YARN and the open source community has built up
MapReduce data access alternatives
• MapReduce is not the only way to access data
• You don’t need an army of Java developers to use Hadoop
• SQL engines on Hadoop make use of a familiar syntax for accessing data
• MapReduce is rarely used in practice or by other software (ex. Hive on Spark)
Myth #1: Hadoop is MapReduce
Hadoop != MapReduce
16
• There are many more SQL engines on Hadoop that can provide SQL-like
data access, just as Hive does
Myth #2: SQL on Hadoop is basically Hive
SQL on Hadoop is rapidly evolving!
17
• Hadoop is not a single product, but an ecosystem
• Mostly (Apache) open source
• Every component/layer has multiple alternatives
• Different vendor distributions prefer different components
• SQL-on-Hadoop: Hive, Impala, Drill, SparkSQL, Presto, Phoenix
• Storage subsystems: HDFS, Kudu, S3, MapR-FS, EMC Isilon
• Resource management: YARN, Mesos, Myriad, MapReduce v1
• Hadoop Ecosystem Table
• A good reference for looking up component names & what they do
• https://hadoopecosystemtable.github.io/
Hadoop ecosystem
18
SQL Processing in Hadoop
19
Example syntax: Impala SQL Language
[WITH name AS (select_expression) [, ...] ]
SELECT
[ALL | DISTINCT]
[STRAIGHT_JOIN]
expression [, expression ...]
FROM table_reference [, table_reference ...]
[[FULL | [LEFT | RIGHT] INNER | [LEFT | RIGHT] OUTER | [LEFT | RIGHT]
SEMI | [LEFT | RIGHT] ANTI | CROSS]
JOIN table_reference [ON join_equality_clauses | USING (col1[, col2
...]] ...
WHERE conditions
GROUP BY { column | expression [ASC | DESC] [NULLS FIRST | NULLS LAST]
[, ...] }
HAVING conditions
GROUP BY { column | expression [ASC | DESC] [, ...] }
LIMIT expression [OFFSET expression]
[UNION [ALL] select_statement] ...]
SQL-like syntax, very
familiar to relational
database developers!
20
SQL Execution on Hadoop: MPP Distributed processing (ideal)
Node 1
Scan
Filter
Aggregate
disk
GB/s
disk
GB/s
Node 2
Scan
Filter
Aggregate
disk
GB/s
Node 3
Scan
Filter
Aggregate
GB/s
disk
Node 4
Scan
Filter
Aggregate
disk
GB/s
Node 5
“Query
master”
Scan
Filter
Aggregate
disk
GB/s
Node …
Scan
Filter
Aggregate
disk
GB/s
Node N
Scan
Filter
Aggregate
SELECT state, SUM(amount)
FROM sales
WHERE channel = ‘Online’
GROUP BY state
Hive-
Server2
Thrift
Protocol
Listener
Final
aggregate
Send intermediate resultset to
coordinator for re-aggregation
Apache Hive, Impala,
Spark all use the
HiveServer2 Thrift
protocol
A global view of entire cluster data
21
• SUM, COUNT, AVG, MIN, MAX
• Each cluster node can compute a SUM on their local data
• Coordinator performs a final global SUM (by summing together local SUMs)
Scalability – Additive Aggregations
CA = 10
NY = 5
FL = 1
CA = 7
SC = 4
NY = 3
…
SELECT state, SUM(amount)
FROM sales
WHERE channel = ‘Online’
GROUP BY state
disk disk disk disk
Local
Aggregation
(parallel)
Global/Final
Aggregation (serial)
“Query
Master”
CA = 17
NY = 8
FL = 1
SC = 4
Final results
Intermediate
results
22
SQL Execution on Hadoop: MPP Distributed processing (big queries)
Node 1
Scan
Filter
Aggregate
disk
GB/s
disk
GB/s
Node 2
Scan
Filter
Aggregate
disk
GB/s
Node 3
Scan
Filter
Aggregate
GB/s
disk
Node 4
Scan
Filter
Aggregate
disk
GB/s
Node 5
“Query
master”
Scan
Filter
Aggregate
disk
GB/s
Node …
Scan
Filter
Aggregate
disk
GB/s
Node N
Scan
Filter
Aggregate
SELECT state, COUNT(DISTINCT prod_id)
FROM sales s, customers c
WHERE s.cust_id = c.cust_id
AND c.cust_age > 20
GROUP BY state
Hive-
Server2
Thrift
Protocol
Listener
Final
aggregate
A global view of entire cluster data
Nodes need to
exchange data
(shuffle) to see
full picture
23
• In a distributed system with “random” physical placement of data in shards a join
must be able to compare all data from both its sides (after any direct filters)
• If one dataset is small enough, it can be broadcasted and kept in memory on all
nodes for node-local joining (broadcast join or replicated join)
Scalability – Broadcast Joins
Node 1
Node 4
Node 7
Node 2
Node 5
Node 8
Node 3
Node 6
Node 9
Large fact table
(logical layout)
Small dimension table
(logical layout)
Broadcast small table to all nodes
24
• If both sides of the join are too big for broadcasting & keeping in each
node’s memory
Scalability – Large Fact-to-Fact Joins
Node 1
Node 4
Node 7
Node 2
Node 5
Node 8
Node 3
Node 6
Node 9
Large fact table
(logical layout)
Another large table
(logical layout)
Hash & Shuffle non-overlapping
subsets of data to the other join side.
Hash function on join column.
Node 1
Node 4
Node 7
Node 2
Node 5
Node 8
Node 3
Node 6
Node 9
25
• Pre-join large fact tables (that are frequently joined)
• Denormalization!
• Hadoop storage is cheap – it’s affordable to keep multiple copies
• Columnar data formats
• Queries access & scan only narrow slices of the wide table
• De-normalized duplicate values compressed well thanks to columnar compression
• Approximate Distinct operations
• HyperLogLog –> COUNT(distinct) vs NDV()
• Partition-wise joins help with memory usage
• Hive calls it bucketed map join (SMB map join) – shuffling still will happen
• On Impala, currently the SQL query must join individual partitions (union all)
How to improve/combat/fix these issues
26
• Provides high-performance, low-latency SQL queries
• Enables interactive exploration and fine-tuning of analytic queries
• Multiple open data formats (Parquet, Avro, CSV, etc) and storage formats
(HDFS Kudu, S3, etc) can be accessed via Impala
• Uses the Hive metastore – the “data dictionary”
for Hadoop
• Stores table structure metadata (columns, datatypes)
and file location in HDFS
• Serves two purposes: data abstraction and data discovery
• Impala uses the same metadata (Hive metastore),
SQL syntax, drivers, and UI (Hue) as Hive
Apache Impala
28
Explain plan - Impala
30
Graphical plans
32
Impala Hive Drill Presto
Data sources HDFS, Kudu, Cloud
Storage
HDFS, cloud storage,
Hbase
HDFS, NoSQL, cloud
storage
HDFS, NoSQL, RDBMS
Data model Relational Relational Schema free JSON Relational
Metadata Hive Metastore Hive Metastore Hive Metastore
(Optional)
Hive Metastore
(remote)
Deployment model Collocated with
Hadoop
Collocated with
Hadoop
Standalone or
collocated with
Hadoop
Collocated with
Hadoop
SQL HiveQL syntax HiveQL syntax ANSI SQL ANSI SQL
Use case Fast, interactive
analytics. High
concurrency. Large
datasets.
Data warehouse /
reporting analytics,
data intensive,
complex queries
Self-service, SQL
based analytics,
heterogeneous
sources
Interactive analytics
on large datasets
SQL on Hadoop processing engines compared
33
• Transactional systems (OLTP)
• Simple - it’s possible
• Complex (ERP) - No (not anytime soon!)
• Data warehouse / Reporting / Analytics
• Traditional DW - maybe, but it will take time
• Big Data - Yes
• ETL and data integration
• Probably!
• Most ETL tools already support integration with
Big Data transformation technologies
Is Hadoop going to replace my database?
34
• Large “fact to fact” table joins
• DML & Updates
• SQL engine sophistication & syntax richness
Where SQL on Hadoop can improve
hive> SELECT SUM(duration)
> FROM call_detail_records
> WHERE
> type = 'INTERNET‘
> OR phone_number IN ( SELECT phone_number
> FROM customer_details
> WHERE region = 'R04' );
FAILED: SemanticException [Error 10249]: Line 5:17
Unsupported SubQuery Expression 'phone_number':
Only SubQuery expressions that are top level
conjuncts are allowed
We RDBMS developers have
been spoiled with very
sophisticated SQL engines
for years :)
Remember,
Hadoop
evolves very
fast!
35
Hadoop in Action
36
Store, and access, everything
37
• Challenges storing and processing genomics data constrained
the set of possible research questions
• Data discovery was limited
• Commonly referred to as “lamp-posting,” cancer researchers could only “look
where the light was”—though most interesting research questions had to do with
discovering answers still “in the dark”.
• Proprietary platforms limited scholarly collaboration and reduced research
time
• Data remained constrained in proprietary systems operated by individual
researchers
• This IT legacy of isolated data siloes hindered future collaboration
Genomics cancer research at Arizona State University
Source: https://hortonworks.com/customers/arizona-state-university
38
• Hortonworks Data Platform with Hadoop stores genomic data, making it
broadly accessible at an affordable cost, allowing ASU researchers to:
• store and process huge amounts of data
• make that data and tools accessible to others within and outside of the university
• do it all at a cost that wouldn’t escalate as genomics data grew to petabytes in their
cluster
Store and access all data with Hadoop
“Your average database of 20 billion rows is simply unapproachable
with traditional, standard technology,” said Dr. Kenneth Buetow.
“With our Hadoop infrastructure, we can run data-intensive queries
of these large-scale resources, and they return results in seconds.
This is transformational.”
“My epiphany came when I ran a relatively complex SQL query against
a table that had 20 billion rows in it. The query returned results in a
minute or two. I was dumbfounded. Prior to using this environment,
one never could comprehensively construct the networks. Basically,
you couldn’t or didn’t ask those comprehensive questions.”
Source: https://hortonworks.com/customers/arizona-state-university
39
The hybrid world
40
• Securus Technologies is a major provider of prison telecommunications
and technology services in the United States
• Satellite Tracking of People (STOP)
handles offender ankle bracelet
device tracking
• Application users are parole/probation
officers and law enforcement agencies
• Challenges
• Database becoming increasingly overloaded, restricting application users
• IoT geospatial data constantly streams into overloaded RDBMS
• Only 2 days of history available to query without extreme wait times
• Cost to re-platform or expand database too high
Securus Technologies
41
• Approach
• Use Gluent Offload to sync 100% of application
data to Hadoop with no ETL
• Use Gluent Transparent Query to allow legacy
applications to access data without code rewrite
• Results and benefits
• Significant decrease in application query response
time
• 365 days of historical data returned in just minutes!
• Analysts began to rethink what’s possible with the
historical data
• Predictive and forensic type analytics, etc.
• No immediate need to re-platform application
• Fixed performance issues with zero code changes!
Securus Technologies
Dramatic performance improvements, huge cost savings, new business opportunities
…without rewriting legacy applications
*Did not finish
DNF*
Using Impala -
SQL on Hadoop!
SQL on Hadoop for the Oracle Professional
Thank you for attending!
43
• Email: mrainey@gluent.com
• Twitter: @mRainey
• Website: gluent.com
• Hadoop resources
• https://gluent.com/hadoop-resources
Reach out for more info!
44
”All enterprise data is just a query away”
• Gluent Data Platform
• Supports all major Hadoop distributions, on-premises or in the cloud
• Consolidates data into a centralized location in open data formats
• Transparent Data Virtualization provides simple data sharing across the enterprise
• You decide when to…
• …write ETL code
• …rewrite applications
• …learn new technologies
Gluent Data Platform
45
Gluent’s hybrid data world
No ETL Data Sync
Transparent Query
Gluent connects all data to all applications!
46
• Selected as a Gartner 2017 Cool Vendor in Data Management
• https://gluent.com/cool-vendor-2017
• Recognized in “The 10 Coolest Big Data Startups of 2017” on CRN
• https://gluent.com/gluent-recognized-in-top-10-coolest-startups-of-2017-on-crn/
• Featured in 2017 Strata Startup Showcase
• https://gluent.com/gluent-strata-startup-showcase/
Industry recognition for Gluent

More Related Content

What's hot

Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcIngesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache Orc
DataWorks Summit
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
Rick van den Bosch
 
Apache hive essentials
Apache hive essentialsApache hive essentials
Apache hive essentials
Steve Tran
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Mark Rittman
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business Intelligence
Christopher Foot
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
DataWorks Summit/Hadoop Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
Rick van den Bosch
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
Drew Hansen
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
Holden Ackerman
 
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)
avanttic Consultoría Tecnológica
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
Roman Chukh
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 

What's hot (20)

Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcIngesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache Orc
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Apache hive essentials
Apache hive essentialsApache hive essentials
Apache hive essentials
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business Intelligence
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 

Similar to SQL on Hadoop for the Oracle Professional

Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Chirag Ahuja
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
HADOOP
HADOOPHADOOP
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
Sharjeel Imtiaz
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
מיכאל
מיכאלמיכאל
מיכאל
sqlserver.co.il
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
Sudar Muthu
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Jonathan Seidman
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
Cloudera, Inc.
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 

Similar to SQL on Hadoop for the Oracle Professional (20)

Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
HADOOP
HADOOPHADOOP
HADOOP
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 

More from Michael Rainey

Data Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the CloudData Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the Cloud
Michael Rainey
 
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateContinuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Michael Rainey
 
Going Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueGoing Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS Glue
Michael Rainey
 
Offload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationOffload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data Integration
Michael Rainey
 
Oracle Data Integrator 12c - Getting Started
Oracle Data Integrator 12c - Getting StartedOracle Data Integrator 12c - Getting Started
Oracle Data Integrator 12c - Getting Started
Michael Rainey
 
Streaming with Oracle Data Integration
Streaming with Oracle Data IntegrationStreaming with Oracle Data Integration
Streaming with Oracle Data Integration
Michael Rainey
 
Oracle data integrator 12c - getting started
Oracle data integrator 12c - getting startedOracle data integrator 12c - getting started
Oracle data integrator 12c - getting started
Michael Rainey
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Michael Rainey
 
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationA Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
Michael Rainey
 
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Michael Rainey
 
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Michael Rainey
 
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration - Coll...
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration - Coll...A Walk Through the Kimball ETL Subsystems with Oracle Data Integration - Coll...
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration - Coll...
Michael Rainey
 
Practical Tips for Oracle Business Intelligence Applications 11g Implementations
Practical Tips for Oracle Business Intelligence Applications 11g ImplementationsPractical Tips for Oracle Business Intelligence Applications 11g Implementations
Practical Tips for Oracle Business Intelligence Applications 11g Implementations
Michael Rainey
 
GoldenGate and Oracle Data Integrator - A Perfect Match- Upgrade to 12c
GoldenGate and Oracle Data Integrator - A Perfect Match- Upgrade to 12cGoldenGate and Oracle Data Integrator - A Perfect Match- Upgrade to 12c
GoldenGate and Oracle Data Integrator - A Perfect Match- Upgrade to 12c
Michael Rainey
 
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsReal-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Michael Rainey
 
Tame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data IntegrationTame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data Integration
Michael Rainey
 
Real-time Data Warehouse Upgrade – Success Stories
Real-time Data Warehouse Upgrade – Success StoriesReal-time Data Warehouse Upgrade – Success Stories
Real-time Data Warehouse Upgrade – Success Stories
Michael Rainey
 
A Picture Can Replace A Thousand Words
A Picture Can Replace A Thousand WordsA Picture Can Replace A Thousand Words
A Picture Can Replace A Thousand Words
Michael Rainey
 
GoldenGate and Oracle Data Integrator - A Perfect Match...
GoldenGate and Oracle Data Integrator - A Perfect Match...GoldenGate and Oracle Data Integrator - A Perfect Match...
GoldenGate and Oracle Data Integrator - A Perfect Match...
Michael Rainey
 
GoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing
GoldenGate and ODI - A Perfect Match for Real-Time Data WarehousingGoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing
GoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing
Michael Rainey
 

More from Michael Rainey (20)

Data Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the CloudData Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the Cloud
 
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateContinuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
 
Going Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueGoing Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS Glue
 
Offload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationOffload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data Integration
 
Oracle Data Integrator 12c - Getting Started
Oracle Data Integrator 12c - Getting StartedOracle Data Integrator 12c - Getting Started
Oracle Data Integrator 12c - Getting Started
 
Streaming with Oracle Data Integration
Streaming with Oracle Data IntegrationStreaming with Oracle Data Integration
Streaming with Oracle Data Integration
 
Oracle data integrator 12c - getting started
Oracle data integrator 12c - getting startedOracle data integrator 12c - getting started
Oracle data integrator 12c - getting started
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
 
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationA Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
 
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
 
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
 
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration - Coll...
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration - Coll...A Walk Through the Kimball ETL Subsystems with Oracle Data Integration - Coll...
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration - Coll...
 
Practical Tips for Oracle Business Intelligence Applications 11g Implementations
Practical Tips for Oracle Business Intelligence Applications 11g ImplementationsPractical Tips for Oracle Business Intelligence Applications 11g Implementations
Practical Tips for Oracle Business Intelligence Applications 11g Implementations
 
GoldenGate and Oracle Data Integrator - A Perfect Match- Upgrade to 12c
GoldenGate and Oracle Data Integrator - A Perfect Match- Upgrade to 12cGoldenGate and Oracle Data Integrator - A Perfect Match- Upgrade to 12c
GoldenGate and Oracle Data Integrator - A Perfect Match- Upgrade to 12c
 
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsReal-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
 
Tame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data IntegrationTame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data Integration
 
Real-time Data Warehouse Upgrade – Success Stories
Real-time Data Warehouse Upgrade – Success StoriesReal-time Data Warehouse Upgrade – Success Stories
Real-time Data Warehouse Upgrade – Success Stories
 
A Picture Can Replace A Thousand Words
A Picture Can Replace A Thousand WordsA Picture Can Replace A Thousand Words
A Picture Can Replace A Thousand Words
 
GoldenGate and Oracle Data Integrator - A Perfect Match...
GoldenGate and Oracle Data Integrator - A Perfect Match...GoldenGate and Oracle Data Integrator - A Perfect Match...
GoldenGate and Oracle Data Integrator - A Perfect Match...
 
GoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing
GoldenGate and ODI - A Perfect Match for Real-Time Data WarehousingGoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing
GoldenGate and ODI - A Perfect Match for Real-Time Data Warehousing
 

Recently uploaded

一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 

Recently uploaded (20)

一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 

SQL on Hadoop for the Oracle Professional

  • 1. SQL on Hadoop for the Oracle Professional Michael Rainey Collaborate 18
  • 2. 2 GluentBig Data Sources IoT Data MSSQL App X App Y App Z Gluent Transparent Data Virtualization Connect all data to all applications! Oracle
  • 4. 4 The evolution of data Source: http://watalon.com/?p=720
  • 5. 5 What is so different about Hadoop?
  • 6. 6 RDBMS architecture Bottleneck! SAN StorageSAN Storage DB App 1 DB App 2 DB App N RDBMS requires data to be “moved” to the application for processing Running out of space? Expand SAN. DiskDiskDisk Disk Disk Disk
  • 7. 7 HDFS scalability Hadoop Distributed File System (HDFS) Name Node Name Node2 Data Node Data Node Data Node Data Node Data Node Data Node Data Node “Where is my data?” Running out of space? Add nodes. Metadata Node 1 “Access your data!” Node 2 Node 3 Node 4 Node ... Node N Node X
  • 8. 8 HDFS scalability and performance Hadoop Distributed File System (HDFS) Name Node Name Node2 Data Node Data Node Data Node Data Node Data Node Data Node Data Node Metadata Node 1 Node 2 Node 3 Node 4 Node ... Node N Node X Disk Disk Disk Disk Disk Disk Disk Computation occurs where the data lives “Moving Computation is Cheaper than Moving Data”
  • 9. 9 How data storage works in HDFS Data file Node 1 Node 2 Node 3 Node 4 Node 5 Block 1 Block 2 Block 3Block 1Block 1Block 1 Block 2Block 2Block 2 Block 3Block 3Block 3Block 1 Block 3 Name Node NameNode determines where blocks will be stored Blocks that become unavailable will be replicated from other nodes, ensuring at least 3 copies exist
  • 10. 10 • Scalability in Software! • Open Data Formats • Future-proof! • One Data, Many Engines! Scalable and open! Hive Impala Spark Presto Giraph SQL-on-Hadoop is only one of many applications of Hadoop - Graph Processing - Full text search - Image Processing - Log processing - Streams
  • 11. 11 Data processing examples Read Data from Disks Filter Data ("where sal > 50000") Simple Processing: Aggregate, Join, Sort, ... Complex Processing: Business Logic, Analytics Data App Server Database Server Storage Array RDBMS + SAN Storage Cell Database Server App Server Oracle Exadata App code + Spark* on HDFS Hadoop + Data Processing framework
  • 12. 12 HDFS considerations • HDFS is designed for "write once, read many" use cases • Large sequential streaming reads (scans) of files • Not optimal for random seeks and small IOs • "Inserts" are appended to the end of the file • No updates (by default) • HDFS performs the replication ("mirroring") and is designed anticipating frequent DataNode failures Various other applications (Hbase, Kudu, etc) allow for updates, random lookups, and resolve some of the other “tradeoffs”, and should be considered
  • 13. 13 • Standard file formats • Plain text (CSV, tab-delimited) • Structured text (XML, JSON) • Binary (images) • Serialization formats • Avro • Columnar formats • Parquet • ORC Data formats supported by Hadoop
  • 14. 14 Common SQL on Hadoop myths
  • 15. 15 • MapReduce is just one component of Hadoop • Hadoop v1, MapReduce was THE component for data retrieval and resource management • Hadoop v2 includes YARN and the open source community has built up MapReduce data access alternatives • MapReduce is not the only way to access data • You don’t need an army of Java developers to use Hadoop • SQL engines on Hadoop make use of a familiar syntax for accessing data • MapReduce is rarely used in practice or by other software (ex. Hive on Spark) Myth #1: Hadoop is MapReduce Hadoop != MapReduce
  • 16. 16 • There are many more SQL engines on Hadoop that can provide SQL-like data access, just as Hive does Myth #2: SQL on Hadoop is basically Hive SQL on Hadoop is rapidly evolving!
  • 17. 17 • Hadoop is not a single product, but an ecosystem • Mostly (Apache) open source • Every component/layer has multiple alternatives • Different vendor distributions prefer different components • SQL-on-Hadoop: Hive, Impala, Drill, SparkSQL, Presto, Phoenix • Storage subsystems: HDFS, Kudu, S3, MapR-FS, EMC Isilon • Resource management: YARN, Mesos, Myriad, MapReduce v1 • Hadoop Ecosystem Table • A good reference for looking up component names & what they do • https://hadoopecosystemtable.github.io/ Hadoop ecosystem
  • 19. 19 Example syntax: Impala SQL Language [WITH name AS (select_expression) [, ...] ] SELECT [ALL | DISTINCT] [STRAIGHT_JOIN] expression [, expression ...] FROM table_reference [, table_reference ...] [[FULL | [LEFT | RIGHT] INNER | [LEFT | RIGHT] OUTER | [LEFT | RIGHT] SEMI | [LEFT | RIGHT] ANTI | CROSS] JOIN table_reference [ON join_equality_clauses | USING (col1[, col2 ...]] ... WHERE conditions GROUP BY { column | expression [ASC | DESC] [NULLS FIRST | NULLS LAST] [, ...] } HAVING conditions GROUP BY { column | expression [ASC | DESC] [, ...] } LIMIT expression [OFFSET expression] [UNION [ALL] select_statement] ...] SQL-like syntax, very familiar to relational database developers!
  • 20. 20 SQL Execution on Hadoop: MPP Distributed processing (ideal) Node 1 Scan Filter Aggregate disk GB/s disk GB/s Node 2 Scan Filter Aggregate disk GB/s Node 3 Scan Filter Aggregate GB/s disk Node 4 Scan Filter Aggregate disk GB/s Node 5 “Query master” Scan Filter Aggregate disk GB/s Node … Scan Filter Aggregate disk GB/s Node N Scan Filter Aggregate SELECT state, SUM(amount) FROM sales WHERE channel = ‘Online’ GROUP BY state Hive- Server2 Thrift Protocol Listener Final aggregate Send intermediate resultset to coordinator for re-aggregation Apache Hive, Impala, Spark all use the HiveServer2 Thrift protocol A global view of entire cluster data
  • 21. 21 • SUM, COUNT, AVG, MIN, MAX • Each cluster node can compute a SUM on their local data • Coordinator performs a final global SUM (by summing together local SUMs) Scalability – Additive Aggregations CA = 10 NY = 5 FL = 1 CA = 7 SC = 4 NY = 3 … SELECT state, SUM(amount) FROM sales WHERE channel = ‘Online’ GROUP BY state disk disk disk disk Local Aggregation (parallel) Global/Final Aggregation (serial) “Query Master” CA = 17 NY = 8 FL = 1 SC = 4 Final results Intermediate results
  • 22. 22 SQL Execution on Hadoop: MPP Distributed processing (big queries) Node 1 Scan Filter Aggregate disk GB/s disk GB/s Node 2 Scan Filter Aggregate disk GB/s Node 3 Scan Filter Aggregate GB/s disk Node 4 Scan Filter Aggregate disk GB/s Node 5 “Query master” Scan Filter Aggregate disk GB/s Node … Scan Filter Aggregate disk GB/s Node N Scan Filter Aggregate SELECT state, COUNT(DISTINCT prod_id) FROM sales s, customers c WHERE s.cust_id = c.cust_id AND c.cust_age > 20 GROUP BY state Hive- Server2 Thrift Protocol Listener Final aggregate A global view of entire cluster data Nodes need to exchange data (shuffle) to see full picture
  • 23. 23 • In a distributed system with “random” physical placement of data in shards a join must be able to compare all data from both its sides (after any direct filters) • If one dataset is small enough, it can be broadcasted and kept in memory on all nodes for node-local joining (broadcast join or replicated join) Scalability – Broadcast Joins Node 1 Node 4 Node 7 Node 2 Node 5 Node 8 Node 3 Node 6 Node 9 Large fact table (logical layout) Small dimension table (logical layout) Broadcast small table to all nodes
  • 24. 24 • If both sides of the join are too big for broadcasting & keeping in each node’s memory Scalability – Large Fact-to-Fact Joins Node 1 Node 4 Node 7 Node 2 Node 5 Node 8 Node 3 Node 6 Node 9 Large fact table (logical layout) Another large table (logical layout) Hash & Shuffle non-overlapping subsets of data to the other join side. Hash function on join column. Node 1 Node 4 Node 7 Node 2 Node 5 Node 8 Node 3 Node 6 Node 9
  • 25. 25 • Pre-join large fact tables (that are frequently joined) • Denormalization! • Hadoop storage is cheap – it’s affordable to keep multiple copies • Columnar data formats • Queries access & scan only narrow slices of the wide table • De-normalized duplicate values compressed well thanks to columnar compression • Approximate Distinct operations • HyperLogLog –> COUNT(distinct) vs NDV() • Partition-wise joins help with memory usage • Hive calls it bucketed map join (SMB map join) – shuffling still will happen • On Impala, currently the SQL query must join individual partitions (union all) How to improve/combat/fix these issues
  • 26. 26 • Provides high-performance, low-latency SQL queries • Enables interactive exploration and fine-tuning of analytic queries • Multiple open data formats (Parquet, Avro, CSV, etc) and storage formats (HDFS Kudu, S3, etc) can be accessed via Impala • Uses the Hive metastore – the “data dictionary” for Hadoop • Stores table structure metadata (columns, datatypes) and file location in HDFS • Serves two purposes: data abstraction and data discovery • Impala uses the same metadata (Hive metastore), SQL syntax, drivers, and UI (Hue) as Hive Apache Impala
  • 29. 32 Impala Hive Drill Presto Data sources HDFS, Kudu, Cloud Storage HDFS, cloud storage, Hbase HDFS, NoSQL, cloud storage HDFS, NoSQL, RDBMS Data model Relational Relational Schema free JSON Relational Metadata Hive Metastore Hive Metastore Hive Metastore (Optional) Hive Metastore (remote) Deployment model Collocated with Hadoop Collocated with Hadoop Standalone or collocated with Hadoop Collocated with Hadoop SQL HiveQL syntax HiveQL syntax ANSI SQL ANSI SQL Use case Fast, interactive analytics. High concurrency. Large datasets. Data warehouse / reporting analytics, data intensive, complex queries Self-service, SQL based analytics, heterogeneous sources Interactive analytics on large datasets SQL on Hadoop processing engines compared
  • 30. 33 • Transactional systems (OLTP) • Simple - it’s possible • Complex (ERP) - No (not anytime soon!) • Data warehouse / Reporting / Analytics • Traditional DW - maybe, but it will take time • Big Data - Yes • ETL and data integration • Probably! • Most ETL tools already support integration with Big Data transformation technologies Is Hadoop going to replace my database?
  • 31. 34 • Large “fact to fact” table joins • DML & Updates • SQL engine sophistication & syntax richness Where SQL on Hadoop can improve hive> SELECT SUM(duration) > FROM call_detail_records > WHERE > type = 'INTERNET‘ > OR phone_number IN ( SELECT phone_number > FROM customer_details > WHERE region = 'R04' ); FAILED: SemanticException [Error 10249]: Line 5:17 Unsupported SubQuery Expression 'phone_number': Only SubQuery expressions that are top level conjuncts are allowed We RDBMS developers have been spoiled with very sophisticated SQL engines for years :) Remember, Hadoop evolves very fast!
  • 33. 36 Store, and access, everything
  • 34. 37 • Challenges storing and processing genomics data constrained the set of possible research questions • Data discovery was limited • Commonly referred to as “lamp-posting,” cancer researchers could only “look where the light was”—though most interesting research questions had to do with discovering answers still “in the dark”. • Proprietary platforms limited scholarly collaboration and reduced research time • Data remained constrained in proprietary systems operated by individual researchers • This IT legacy of isolated data siloes hindered future collaboration Genomics cancer research at Arizona State University Source: https://hortonworks.com/customers/arizona-state-university
  • 35. 38 • Hortonworks Data Platform with Hadoop stores genomic data, making it broadly accessible at an affordable cost, allowing ASU researchers to: • store and process huge amounts of data • make that data and tools accessible to others within and outside of the university • do it all at a cost that wouldn’t escalate as genomics data grew to petabytes in their cluster Store and access all data with Hadoop “Your average database of 20 billion rows is simply unapproachable with traditional, standard technology,” said Dr. Kenneth Buetow. “With our Hadoop infrastructure, we can run data-intensive queries of these large-scale resources, and they return results in seconds. This is transformational.” “My epiphany came when I ran a relatively complex SQL query against a table that had 20 billion rows in it. The query returned results in a minute or two. I was dumbfounded. Prior to using this environment, one never could comprehensively construct the networks. Basically, you couldn’t or didn’t ask those comprehensive questions.” Source: https://hortonworks.com/customers/arizona-state-university
  • 37. 40 • Securus Technologies is a major provider of prison telecommunications and technology services in the United States • Satellite Tracking of People (STOP) handles offender ankle bracelet device tracking • Application users are parole/probation officers and law enforcement agencies • Challenges • Database becoming increasingly overloaded, restricting application users • IoT geospatial data constantly streams into overloaded RDBMS • Only 2 days of history available to query without extreme wait times • Cost to re-platform or expand database too high Securus Technologies
  • 38. 41 • Approach • Use Gluent Offload to sync 100% of application data to Hadoop with no ETL • Use Gluent Transparent Query to allow legacy applications to access data without code rewrite • Results and benefits • Significant decrease in application query response time • 365 days of historical data returned in just minutes! • Analysts began to rethink what’s possible with the historical data • Predictive and forensic type analytics, etc. • No immediate need to re-platform application • Fixed performance issues with zero code changes! Securus Technologies Dramatic performance improvements, huge cost savings, new business opportunities …without rewriting legacy applications *Did not finish DNF* Using Impala - SQL on Hadoop!
  • 39. SQL on Hadoop for the Oracle Professional Thank you for attending!
  • 40. 43 • Email: mrainey@gluent.com • Twitter: @mRainey • Website: gluent.com • Hadoop resources • https://gluent.com/hadoop-resources Reach out for more info!
  • 41. 44 ”All enterprise data is just a query away” • Gluent Data Platform • Supports all major Hadoop distributions, on-premises or in the cloud • Consolidates data into a centralized location in open data formats • Transparent Data Virtualization provides simple data sharing across the enterprise • You decide when to… • …write ETL code • …rewrite applications • …learn new technologies Gluent Data Platform
  • 42. 45 Gluent’s hybrid data world No ETL Data Sync Transparent Query Gluent connects all data to all applications!
  • 43. 46 • Selected as a Gartner 2017 Cool Vendor in Data Management • https://gluent.com/cool-vendor-2017 • Recognized in “The 10 Coolest Big Data Startups of 2017” on CRN • https://gluent.com/gluent-recognized-in-top-10-coolest-startups-of-2017-on-crn/ • Featured in 2017 Strata Startup Showcase • https://gluent.com/gluent-strata-startup-showcase/ Industry recognition for Gluent

Editor's Notes

  1. Hadoop is a “huge” topic. With one day, we had to choose what to explain. Where it’s awesome, and where it’s not-so-awesome.`
  2. Gluent is all about enterprise data sharing and transparent data virtualization.
  3. Why are we excited about hadoop?
  4. The evolution of big data. How did we get here and why does this training even exist? 10 yrs ago, what was the biggest DB? 5 TB Today, that’s not “exciting”. If you work on ERP systems, easy to think that all data lives in tables and is easy to store. Not always true, different businesses - different departments - have different needs.
  5. Why not just load up storage array?
  6. Data moved to computation for processing Records must move from storage to process Access to storage is the bottleneck Running out of space? Expand SAN Becomes expensive. SAN storage arrays are not cheap. Two common problems in db performance work Bad sql statements (optimizer issues) Storage issues. Usually not config issue, but pipes between database nodes and san storage is very small.
  7. Describe NameNode & DataNodes at high-level NameNode - where does my file - file blocks - live? DataNode - allows access to locally stored data. Data doesn’t move for computation. Processing occurs where the data lives Running out of space? Add new nodes on commodity hardware. Physically Data nodes on commodity, inexpensive “pizza boxes”. Looks like one big filesystem. Get deeper on HDFS later on.
  8. Name node can be compared to ASM instance in Exadata. (TODO) Not just distributed file system, but distributed computation. Code is distributed as well.
  9. Data file is chunked into blocks 3 replicas of each block will be created If a node goes down, NameNode ensures each block still has 3 replicas Can also rebalance if needed.
  10. Software scalability - past, you had to buy many large, big iron hardware systems. Now, open source software. Hadoop ecosystem is organized around open data formats. Makes your system future proof. One data, many engines. DW on Hadoop, not stuck with Hive SQL - you can use any engine you like. Switching engines when using something like Oracle - export, load into new engine. Takes time.
  11. Workload processing - SQL-like processing in an Oracle example. RDBMS / enterprise database Read data. Filter down to what you need. Perform simple processing. More complex processing. Exadata (similar to Netezza) Push filtering to storage level. Disk I/O (read) and filtering. Less data moves from storage to DB. Still perform aggregations, etc in DB. Hadoop Not just SQL engine, but parallel processing framework that will push application code close to where the data lives. All operations/logic occurs in the same place. Exadata one step closer to hadoop, but not quite.
  12. Load data once, read it many times Best for large, sequential file scans. Not great for random. MapR has own file system, MapR-FS, multi-level namenodes. (..more info needed..) Kudu can solve this problem as well - doesn’t use HDFS Cannot modify file, only insert on end. If could, uncompress file, update, recompress? Not good anyways.
  13. You can really put any type of file there, but main formats are these. Typically start with CSV for tutorials. Sequence files, binary-like files Avro - serializing rows and columns Columnar formats - used when extracting data for analysis. Faster to extract/analyze. Stored in big chunks, organized in columns, often compressed.
  14. 10 yrs ago, yes. HDFS and MapReduce in v1 Now, MR rarely used in new dev. -starting with Hadoop, run SQL against it. It’s very good.
  15. Hive, the first SQL engine developed at Facebook. Converted SQL to MR under the hood to run against distributed filesystem. Now, many engines! Even Hive evolved to use Spark or Tez rather than MR. Hive will be described further later on.
  16. Hadoop has a large ecosystem. Similar to redhat linux. Different hadoop vendors. Not vendor driven, community driven Ecosystem table - open and show how many components
  17. Show a little about how it works Discuss some fundamentals and where Hadoop fits and doesn’t work as well. The big data hype was that it would solve all of your problems and make lots of money. Works for the folks like Facebook/Yahoo. But we don’t want all of our SQL developers to learn Java and mapreduce, etc. The fact you can actually run SQL on Hadoop - that’s the killer app for big data and Hadoop. For example, moving a lot of data from the expensive DW Or using hadoop as a data sharing back-end.
  18. Impala syntax is pretty powerful. UDFs that can be written in C++ DEMO: show Impala, add files to HDFS, create tables, etc. show Hue as well.
  19. On the front-end, it looks a lot like a database. SQL statements are similar, etc. On the backend, it’s a bit different. Every single node has its own local disk - there is no SAN storage. Hadoop does sequential reads/writes, not random I/O - so I/O bottleneck is not there. What about processing? I/O throughput is great, now the CPU might be the bottleneck. So in the ideal MPP world Run a select statement across all data Filter and group by. Group by state, not massive amount of records Thrift server picks one node to connect. The “query master”. Query master takes SQL and sends it to all nodes on the cluster Each node only reads local data (usually) - no network bottlenecks Query master dishes out work so unique data is read on each node. Each node does scanning and filtering on their own. Total sum, for aggregate, can be pushed down to the nodes as well. Each node result set is sent to query master for final aggregation and returned to Hive But, life is not always ideal
  20. Scalability of aggregations Additive - like a Sum, you can sum at lower levels (each node) then aggregate again at the query master. Works similar to Oracle Parallel Query, at a different level. Each parallel slave can do a sum on its own, then later stage in the query plan it sums the sums. But what about count distinct? Or Median? Example - how many different products per customer transaction in a store? Distincts are not additive. That’s where the scale-out “dream” breaks down. We’ll get back to aggregations
  21. Now we’ve added distinct to the count So nodes must exchange data to get to this final distinct count. Each node can order the local dataset, in parallel. Final merge join in the query master, performed serially. Nodes sharing data is called a ‘shuffle’ in order to see the full picture.
  22. Fundamentals about joins Oracle - using shared storage, it doesn’t scale to a thousand nodes The benefit, all nodes “see” the storage and can scan through all of the nodes to return the data. Hadoop has a tradeoff. 9 nodes here, with each having a portion of the data. How do we do a join? Cannot just “join” node1 to node4 and hope all of the data is represented. First, read all data from small table and broadcast into memory on all of the nodes that will process the query. Not usually a problem for smaller dimension tables broadcast to join a large fact. But, it could become a problem. I relate this to the issue of change data capture in ETL.?? maybe
  23. The biggest remaining scalability problem for Hadoop. Big fact to fact joins - probably won’t run well. For small table - optimizer will broadcast to each node. But for a large table, the optimizer will need to partition the join. We can shuffle - read through the first node and run a hash function on the key. Then based on results, send each row to different nodes - selectively. Partition join Then all nodes will do the same. Unfortunately, that’s only half of the story Shuffling must happen on the other side table as well. If right side is sales and left is customers, we need to do the same. Then, the local join can occur. Lots of CPU and network traffic. If Hadoop worked like teradata - not needed. Teradata will contain data of same IDs in same node. Better big table joins that way. But, you have to redistribute data when adding another node.
  24. How do we workaround these issues? We can pre-join large fact tables, during load or after load and store the copy. A form of denormalization. Just scan through the columns you need in columnar formats Aprox count distinct - similar Oracle operation Uses a bitmap for storing an approximate number of distinct values per customer or state or whatever. The magic, they are additive. Each node can count distinct value, then can still add in final aggregation in query master. Gives roughly correct results. Within 1.5% of correct value - so must be able to tolerate deviation. Perfect for exploring data, but then use more intensive query.
  25. Hive metastore Even used by other sql engines on hadoop. It’s like a data dictionary for Hadoop. How do we know which files in which directory to go after when running “SELECT *” in Hive? Hive metastore. Impala, Presto, even Spark, can plug into Hive Metastore. Different from how SQL Sever and Oracle work. Cannot share a data dictionary across these two engines. It’s like having multiple data engines for the same database. DEMO: Hive and Impala both show the same metadata!
  26. Visual of next slide, due to animations hiding content
  27. Hive is Hortonworks Impala is Cloudera Drill is now sort of MapR Presto, created by Facebook but Teradata now contributes Very neat part of open source, companies like Facebook, Yahoo creating these - but even companies like teradata contributing. Use cases A little background. Hive used to be slow for short queries because it always used MapReduce behind the scenes. Fixed on some platforms. Hive LLAP - low latency, so no slow startup overhead. Plus can use Spark or Tez behind the scenes. Impala - Faster analytics. Hive is more functional, SQL syntax is more rich, and has partition-wise joins, which Impala doesn’t have. But, it’s hadoop. You can use Hive for your ETL and complex transformations with user defined functions in a batch job, then use Impala for your BI users - giving them faster response time. “Use the best tool for the job” Drill is really good for exploratory analytics with it’s use of JSON files - and no need for a “relational” type schema. Presto - more modern version of Hive. Preference for Hadoop vendor will drive decision of what to use. Hortonworks - Hive or Spark SQL Cloudera - Impala or Hive TODO - remove alternative vendor based products like Drill / Presto?
  28. Today? No! too much to migrate, some things cannot be migrated easily at all. But as far as capabilities of hadoop in the next 4-5 years, some cases yes. OLTP Mongo DB and Cassandra already doing it - for simple OLTP systems. Complex - not at all. RDBMS is here to stay. DW Traditional data warehouses, quite possible, but might take some time to take over Teradata, etc. But, not quite ready yet. Big table joins are still an issue. But it’s coming. Companies like Netflix are using a cloud-based DW, with parquet files in S3. Big Data - yes. Unless the systems with messages or event feeds started before the advent of big data, these IoT projects are going straight into Hadoop. ETL and DI Yeah, probably. A homegrown system, tough to convert to Hadoop. But ETL tools like Informatica or ODI, just point to Hadoop storage and processing apps. It won’t happen overnight, but it’s movig that way. Cloud does introduce an interesting shift. Services such as Google BigQuery might not even be known for traditional enterprises, using ERP systems. But Marketing firms, they just dump data to the cloud and then run analytics against them. Onsite - that’s more of where hadoop will rule.
  29. Already discussed some of these. Hadoop can improve as well. Just to be realistic - we’re not selling Hadoop licenses or anything! Subquery under OR condition - not supported by Hive or Impala even. Could rewrite this query, but still an issue. You might not be able to just migrate all data and switch from Oracle PL/SQL to HiveQL. DEMO - #3EU 3:00 - Same query in Impala
  30. Case study using HortonWorks on open source software?
  31. https://hortonworks.com/customers/arizona-state-university/
  32. https://hortonworks.com/customers/arizona-state-university/
  33. Gluent case study
  34. TODO add quote?
  35. FYI - cost savings about 5 mill (existing) vs 2 mill (gluent)
  36. TODO https://www.slideshare.net/AmazonWebServices/how-to-build-a-data-lake-with-aws-glue-data-catalog-abd213r
  37. Introducing…Gluent Data Platform Support big 3 Hadoop vendors, on-premises and in the cloud - Many customers are cloud-first - recent customer example - Or shifting to the cloud. They look for help from us to do so
  38. No risk implementation