SlideShare a Scribd company logo
DWH OVER HADOOPDWH OVER HADOOP
THETHE
BASICSBASICS
COLUMNAR FORMATS (ORC/PARQUET)COLUMNAR FORMATS (ORC/PARQUET)
Projection Push Down
Predicate Push Down
Excellent Compression Ratios
Column Indices
Max/Avg/Min values
Rows must be batched to benefit from these optimizations
PARQUETPARQUET
Strongly endorsed by Cloudera
One of the few formats Impala
supports (and the most optimal
for it)
Also supported by Hive, Spark,
Tajo, Drill & Presto.
Speaking from my
own personal experience a bit
more expensive to generate.
ORCORC
Endorsed by Hortonworks
Most optimal for Presto
Spark support was recently
introduced.
QUERYINGQUERYING
ENGINESENGINES
HIVEHIVE
Hive provides a SQL like interface of
accessing the data (files) called HiveQL.
The HQL is translated into
M/R code and executed immediately.
Batch Oriented
Fault tolerant and thus reliable
Not a DB!
Does not support updates & delete and has
no transaction (or does it ?)
LOWLOW
LATENCYLATENCY
SQLSQL
Map-Reduce can be compared to
a Tractor:
It's very strong and can plow a
field better than any other vehicle,
but it's also very slow.
As prices of memory dropped, a
demand emerged to better utilize
it for faster response times.
CLOUDERA IMPALACLOUDERA IMPALA
Writen in C++
Utilizes Hive's metadata
Very fast
Not fault tolerante
Doesn't support custom data
formats
Doesn't support complex data
types (maps/arrays/structs)
A bit complicated setup for non
CDH distributions
FACEBOOK PRESTOFACEBOOK PRESTO
Can connect to:
Cassandra
Hive
JMX Sources
Postgres & Mysql
Allows cross engine joins
Used in Facebook to serve online
dashboards
Easy to setup
SPARK SQLSPARK SQL
Not affiliated with any Hadoop
vendor
Support all of the optimized file
formats (ORC/Parquet/Avro)
Can auto discover schema
Aims to provide second/sub-
second latnecy
Still not very mature
THE USUAL DATA FLOWTHE USUAL DATA FLOW
Collect -> Store -> Convert -> Select
The Data Latency conflict - lots
of fragmented small files or big
optimized files with big latency
Processing efforts involved in
the conversion process should
be minimized
Example..
A BETTER DATA FLOWA BETTER DATA FLOW
Collec-tor-vert -> Select
Convert the data as it is being
collected where possible
Or convert the data as it is
being stored (streaming) but
without losing optimizations
How can this be achieved?
SQOOPSQOOP
Import data from RDBMS into
Hadoop
Create java classes and hive
tables on import
Export data back to RDBMS
Runs a "Map Only" job to
perform the task
Supports incremental imports
Now supports import right
away as Parquet
HIVE & ACIDHIVE & ACID
Recently a conceptual change has been
introduced into Hive: CRUD with ACID
Transactions.
It is not meant to replace your OLTP but
rather supply a better data modification
mechanism to a subset of the data.
Explanation on how it works
Demo simple insert
Still requires M/R :(
HIVE & STREAMING INGESTHIVE & STREAMING INGEST
With the new ACID capabilities it is now
possible to continously insert data into hive
Data apperas almost immediately
Data is optimized in a columnar format
Data is compacted by different triggers
Code snippet
FLUMEFLUME
Distributed
Durable
Scalable
Fault Tolerante
Serves for ingestion and basic
pre-processing of the data
Composed of
source -> channel -> Sink
(Draw Architecture)
Utilized Hive's ACID capabilities
to instantly stream data into hive
- demo

More Related Content

What's hot

How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
Amazon Web Services
 
Optimising your Amazon Redshift Cluster for Peak Performance
Optimising your Amazon Redshift Cluster for Peak PerformanceOptimising your Amazon Redshift Cluster for Peak Performance
Optimising your Amazon Redshift Cluster for Peak Performance
Amazon Web Services
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722
Amazon Web Services
 
Apache sqoop
Apache sqoopApache sqoop
Apache sqoop
megrhi haikel
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
Amazon Web Services
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
joeharris76
 
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Amazon Web Services
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
Amazon Web Services
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
Johan Gustavsson
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
Amazon Web Services
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
rICh morrow
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
Amazon Web Services
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
Amazon Web Services
 
Storm over gearpump
Storm over gearpumpStorm over gearpump
Storm over gearpump
Tianlun Zhang
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
Sujee Maniyam
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Amazon Web Services
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
Mark Kromer
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
Amazon Web Services
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
ScyllaDB
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 

What's hot (20)

How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
 
Optimising your Amazon Redshift Cluster for Peak Performance
Optimising your Amazon Redshift Cluster for Peak PerformanceOptimising your Amazon Redshift Cluster for Peak Performance
Optimising your Amazon Redshift Cluster for Peak Performance
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722
 
Apache sqoop
Apache sqoopApache sqoop
Apache sqoop
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
 
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Storm over gearpump
Storm over gearpumpStorm over gearpump
Storm over gearpump
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 

Viewers also liked

Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
Rohit Agrawal
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
Rohit Agrawal
 
Hadoop/HBase POC framework
Hadoop/HBase POC frameworkHadoop/HBase POC framework
Hadoop/HBase POC framework
Doug Chang
 
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
FX Live Group
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
DataWorks Summit
 
HadoopFileFormats_2016
HadoopFileFormats_2016HadoopFileFormats_2016
HadoopFileFormats_2016
Jakub Wszolek, PhD
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
DataWorks Summit
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
DataWorks Summit
 
Big data hbase
Big data hbase Big data hbase
Big data hbase
ANSHUL GUPTA
 
Transactions Over Apache HBase
Transactions Over Apache HBaseTransactions Over Apache HBase
Transactions Over Apache HBase
Cask Data
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
Chicago Hadoop Users Group
 
Hive ppt (1)
Hive ppt (1)Hive ppt (1)
Hive ppt (1)
marwa baich
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
Venu Anuganti
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
ROHIT KHARABE
 
Apache hbase overview (20160427)
Apache hbase overview (20160427)Apache hbase overview (20160427)
Apache hbase overview (20160427)
Steve Min
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
Josh Elser
 
Future Of Data Paris - BI and Big Data
Future Of Data Paris - BI and Big DataFuture Of Data Paris - BI and Big Data
Future Of Data Paris - BI and Big Data
Mathias Kluba
 
ORC Files
ORC FilesORC Files
ORC Files
Owen O'Malley
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-series
DataWorks Summit
 

Viewers also liked (20)

Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
 
Hadoop/HBase POC framework
Hadoop/HBase POC frameworkHadoop/HBase POC framework
Hadoop/HBase POC framework
 
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
HadoopFileFormats_2016
HadoopFileFormats_2016HadoopFileFormats_2016
HadoopFileFormats_2016
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Big data hbase
Big data hbase Big data hbase
Big data hbase
 
Transactions Over Apache HBase
Transactions Over Apache HBaseTransactions Over Apache HBase
Transactions Over Apache HBase
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
Hive ppt (1)
Hive ppt (1)Hive ppt (1)
Hive ppt (1)
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
Apache hbase overview (20160427)
Apache hbase overview (20160427)Apache hbase overview (20160427)
Apache hbase overview (20160427)
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
 
Future Of Data Paris - BI and Big Data
Future Of Data Paris - BI and Big DataFuture Of Data Paris - BI and Big Data
Future Of Data Paris - BI and Big Data
 
ORC Files
ORC FilesORC Files
ORC Files
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-series
 

Similar to Veracity think bugdata #2 6.7.2015

From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
Guy Harrison
 
Hadoop World Vertica
Hadoop World VerticaHadoop World Vertica
Hadoop World Vertica
Omer Trajman
 
Hw09 Hadoop + Vertica
Hw09   Hadoop + VerticaHw09   Hadoop + Vertica
Hw09 Hadoop + Vertica
Cloudera, Inc.
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
Continuent
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
Jags Ramnarayan
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Continuent
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Continuent
 
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, AnalyticsReal-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Continuent
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
Kunal Gupta
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
Christopher Curtin
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
SnappyData
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
The Apache Software Foundation
 
Oracle Database 12c "New features"
Oracle Database 12c "New features" Oracle Database 12c "New features"
Oracle Database 12c "New features"
Anar Godjaev
 
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
Tony Rogerson
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 

Similar to Veracity think bugdata #2 6.7.2015 (20)

From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Hadoop World Vertica
Hadoop World VerticaHadoop World Vertica
Hadoop World Vertica
 
Hw09 Hadoop + Vertica
Hw09   Hadoop + VerticaHw09   Hadoop + Vertica
Hw09 Hadoop + Vertica
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...
 
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, AnalyticsReal-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Oracle Database 12c "New features"
Oracle Database 12c "New features" Oracle Database 12c "New features"
Oracle Database 12c "New features"
 
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 

Veracity think bugdata #2 6.7.2015

  • 1. DWH OVER HADOOPDWH OVER HADOOP
  • 3. COLUMNAR FORMATS (ORC/PARQUET)COLUMNAR FORMATS (ORC/PARQUET) Projection Push Down Predicate Push Down Excellent Compression Ratios Column Indices Max/Avg/Min values Rows must be batched to benefit from these optimizations
  • 4. PARQUETPARQUET Strongly endorsed by Cloudera One of the few formats Impala supports (and the most optimal for it) Also supported by Hive, Spark, Tajo, Drill & Presto. Speaking from my own personal experience a bit more expensive to generate. ORCORC Endorsed by Hortonworks Most optimal for Presto Spark support was recently introduced.
  • 6. HIVEHIVE Hive provides a SQL like interface of accessing the data (files) called HiveQL. The HQL is translated into M/R code and executed immediately. Batch Oriented Fault tolerant and thus reliable Not a DB! Does not support updates & delete and has no transaction (or does it ?)
  • 7. LOWLOW LATENCYLATENCY SQLSQL Map-Reduce can be compared to a Tractor: It's very strong and can plow a field better than any other vehicle, but it's also very slow. As prices of memory dropped, a demand emerged to better utilize it for faster response times.
  • 8. CLOUDERA IMPALACLOUDERA IMPALA Writen in C++ Utilizes Hive's metadata Very fast Not fault tolerante Doesn't support custom data formats Doesn't support complex data types (maps/arrays/structs) A bit complicated setup for non CDH distributions
  • 9. FACEBOOK PRESTOFACEBOOK PRESTO Can connect to: Cassandra Hive JMX Sources Postgres & Mysql Allows cross engine joins Used in Facebook to serve online dashboards Easy to setup
  • 10. SPARK SQLSPARK SQL Not affiliated with any Hadoop vendor Support all of the optimized file formats (ORC/Parquet/Avro) Can auto discover schema Aims to provide second/sub- second latnecy Still not very mature
  • 11. THE USUAL DATA FLOWTHE USUAL DATA FLOW Collect -> Store -> Convert -> Select The Data Latency conflict - lots of fragmented small files or big optimized files with big latency Processing efforts involved in the conversion process should be minimized Example..
  • 12. A BETTER DATA FLOWA BETTER DATA FLOW Collec-tor-vert -> Select Convert the data as it is being collected where possible Or convert the data as it is being stored (streaming) but without losing optimizations How can this be achieved?
  • 13. SQOOPSQOOP Import data from RDBMS into Hadoop Create java classes and hive tables on import Export data back to RDBMS Runs a "Map Only" job to perform the task Supports incremental imports Now supports import right away as Parquet
  • 14. HIVE & ACIDHIVE & ACID Recently a conceptual change has been introduced into Hive: CRUD with ACID Transactions. It is not meant to replace your OLTP but rather supply a better data modification mechanism to a subset of the data. Explanation on how it works Demo simple insert Still requires M/R :(
  • 15. HIVE & STREAMING INGESTHIVE & STREAMING INGEST With the new ACID capabilities it is now possible to continously insert data into hive Data apperas almost immediately Data is optimized in a columnar format Data is compacted by different triggers Code snippet
  • 16. FLUMEFLUME Distributed Durable Scalable Fault Tolerante Serves for ingestion and basic pre-processing of the data Composed of source -> channel -> Sink (Draw Architecture) Utilized Hive's ACID capabilities to instantly stream data into hive - demo