SlideShare a Scribd company logo
`
Interactive SQL-on-Hadoop
Ronen Ovadya, Ofir Manor,
JethroData
from Impala to Hive/Tez to
Spark SQL to JethroData
About JethroData

Founded 2012

Raised funding from Pitango in 2013

Engineering in Israel, branch in New York

Launched beta July 2014

We're hiring!
About Me
Ofir Manor

Worked in the database industry for 20 years

Started as developer and DBA

Worked for Oracle, Greenplum pre-sales roles

Blogging on Big Data and Hadoop

Currently Product Manager at JethroData
ofir@jethrodata.com
http://ofirm.wordpress.com/
@ofirm
Agenda

How we got here?

How do parallel databases on Hadoop work?

Impala, Hive/Tez, Spark SQL

Use case perspective

JethroData – What? How? Demo
SQL-on-Hadoop?
SQL?
Big SQL
(too serious for a cute logo)
“Stinger”
Apache Tez?
What about NoSQL on Hadoop?
What about Lucene / SOLR?
Sounds Familiar?
“Let's bring all the data that our operational systems
create into one place, and keep it all forever.
When we'll analyze that repository, we will surely
uncover critical business insights...”
What is this concept called?

Today - “Big Data”

Last 20 years - “Enterprise Data Warehouse”
(EDW)
Big Data vs. Data Warehouse?
Everyone calls themselves “Big Data” now...
But “Big Data” is lead by large web companies with new
requirements:

Web scale – orders of magnitude more data
 Page views and clicks, mobile apps, sensor data etc

Cost – dramatically lower price per node

Aversion from vendor lock-in - open-source preference

Methodology – correctness vs. agility
 Classical EDW / Schema-on-write – data must be cleaned and
integrated before business users can access it (EDW vs. ADW)
 Big Data / schema-on-read– let the data science team handle the
unfiltered crap (in addition to some vetted schema-on-write data sets)
Evolution – 90s

“Big Data” was “Enterprise Data Warehouse”

You could either build your own using a big Unix
machine and enterprise storage
 “Scale Up”

Or just buy a Teradata appliance
World's first production 1TB EDW, 1992
at
Evolution – 2000s

A new generation of parallel databases -
“10x-100x faster, 10x cheaper”
 Netezza, Vertica, Greenplum, Aster, Paraccel etc
 All used a cluster of cheap servers without
enterprise storage to run parallel analytic SQL
 Shared-Nothing Architecture /
MPP (Massive Parallel Processing) /
Scale-Out
Evolution – Our Days

It's all about Hadoop

Hadoop started as a batch platform - HDFS and MapReduce

Lately, it became a shared platform for any type of parallel
processing framework
Example – Hortonworks slide
SQL-on-Hadoop
Underlining Design

Hadoop uses the same parallel design pattern as
the parallel databases from last decade

Surprisingly, all SQL-on-Hadoop also uses the same
design pattern of previous parallel databases!

HDFS

MapReduce

Tez

Spark

Hive

Impala

Presto

Tajo

Drill

Shark

Spark SQL

Pivotal HAWQ

IBM BigSQL

Teradata Aster

Hadapt

Rainstor

...
Data
Node
The Parallel Design
(Shared-Nothing MPP)
Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day
Data
Node
Data
Node
Data
Node
Data
Node
Query
Executor
Query
Executor
Query
Executor
Query
Executor
Query
Executor
Query
Planner
/Mgr
Query
Planner/
Mgr
Query
Planner/
Mgr
Query
Planner/
Mgr
Query
Planner/
Mgr
The Parallel Design Principles
(Shared-Nothing MPP)

Full Scan
– Each node reads all its local portion of the table
– Optimize with large sequantial reads

Maximize Locality
– minimize inter-node work
– Work should be evenly distributed to avoid bottlenecks
– Avoid data and processing skew

That leads to a hard design-time trade-off
 Greatly depends the physical table organization
 Choose partition keys, distribution keys and sort keys
How to Get Decent Performance
Hive vs. Impala
1. Minimize Query I/O

Skip reading unneeded data
2. Optimize Query Execution

Pick best plan

Optimize each step

Connect the steps efficiently
1. Minimize Query I/O

Using Partitioning (Hive, Impala etc)
 Full scan of less data
 Manual tuning – too few vs. too many

Using columnar, scan-optimized file formats
 “Write once, full scan a thousand times”
 Skip unneeded columns
 Full scan smaller files - encode and compress per-column
 Skip blocks (for sorted data)

Big effort in the Hadoop space, mostly done
 Built two comparable formats - ORC and Parquet
 Use the right one – Hive/ORC or Impala/Parquet
How Parquet and ORC
columnar format works?
• Data is divided into blocks - chunks of rows
• In each block, data is physically stored column by
column
• Store additional metadata per column in a block,
like min / max values
2. Optimize Query Execution
1. Pick the best execution plan – Query Optimizer
 Cost-based Optimizations - currently generally weak
2. Optimize each step – Efficient Processing
 Vectorized operations, expression compilation etc
3. Combine the steps efficiently - Execution Engines
 Batch-oriented (MapReduce) – focus on recoverability

write intermediate results to disk after every step
 Streaming-oriented (Tez, Impala, HAWK etc) – focus on performance

Move intermediate results directly between processes

Required much more resources at once
 Hybrid (Spark) – enjoy both worlds

Stream and re-use intermediate results, optimize for in-memory

But can recover / recompute on node failure
Impala

Impala Highlights
 Basic partitioning (partition per key)
 Optimized I/O with Parquet
 Built their own streaming execution engine
 Efficient processing
 Basic cost-based query optimizations

Impala vs. Presto vs. Drill vs. Tajo
 Generally same high-level design
 Four independent teams implementing it in their own way
 Impala way ahead of the pack

Performance, functionality, market share, support
Impala

Sample benchmark results from Cloudera (May 2014)

20 nodes running TPCDS data, scale factor 15,000 (GB)

Impala 1.3.0 vs. Hive 0.13 on Tez vs. Shark 0.9.2 vs. Presto 0.6.0
Source: Cloudera blog - New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead
Hive

Hive Highlights
 Rich SQL support
 Basic partitioning (partition per key)
 Optimized I/O with ORC
 Cost-based optimizer coming (Hive 0.14)
 Efficient processing – vectorized query execution
 Reliable execution (MapReduce) or
fast execution (Tez)
 Hive on Spark (Shark)

Have recently reached “End-of-life”, sort-of
Spark SQL

Spark SQL Highlights
 Very early days / alpha state

Announced Mar 2014
 A SQL-on-Spark solution written from scratch
 Should support reading / writing from Hadoop,
specifically from/to Hive and Parquet
 Could become an interesting player next year

Mostly vs. Impala
Query Use Cases
Querying Raw Data

Ad-hoc “Data Science” investigative work
 Typically in the minutes to hours range

Need to make sense of many, ever changing data sources

Need to make sense of text / semi-structured / dynamic
schema sources

Need to mix SQL, text processing (UDFs) and machine
learning algorithm in a manual, multi-step fashion
Raw Data
Query Use Cases
Reporting

Internal analyst teams repeatedly run reports
on common data sets

Likely cleaned and vetted, potentially
aggregated, shared across many users

Use latest Hive/ORC/Tez or Impala/Parquet
– Improves response time from hours to minutes
Raw Data Reporting
Query Use Cases
Pre-Compute Specific Queries (1)

Queries embedded in external websites / apps / dashboards
 Customers expecting page loads of a second or two

Must pre-compute results to get response time

“Old Style” implementation
1. Daily massive batch computation (MapReduce)
2. Results are typically pushed to an external OLAP solution (cubes) or
to a key-value store (HBase / Redis etc)

Great performance but only for a few queries or counters

“Freshness”: data is typically a day or more behind production

Complexity: multiple steps and systems, massive daily spike in
computation, storage, data movement
ImmediateRaw Data Reporting
Query Use Cases
Pre-Compute Specific Queries (2)
“New Style” implementation – Stream Processing

Continuously update pre-computed results as new events arrive
 Events pile up in some queue (Apache Kafka)
 Process events in micro-batches (Apache Storm)
 The new computation results are constantly stored / updated in a key-
value store

Solves the “freshness” problem

Not suitable for complex pre-computations - best for counters

Bleeding-edge technology
ImmediateRaw Data Reporting
Query Use Cases
Fast Ad-hoc Queries
Immediate
 Interactive BI is the “Holy Grail” of SQL on Hadoop
 Analysts / business users want to interact with select data sets
from their BI tool
 drag columns in the BI tool, “slice n' dice”, drill-downs
 Response time of seconds to tens of seconds from their BI tool
 Existing solutions are too slow – users are stuck with reporting
 Very hard to achieve with existing Hadoop technologies
 Need a different solution – maybe something that have worked
for databases in the last 30 years?
 It's time to introduce JethroData...
Interactive BIRaw Data Reporting
JethroData

Index-based, columnar SQL-on-Hadoop
– Delivers interactive BI on Hadoop
– Focused on ad-hoc queries from BI tools
– The more you drill down, the faster you go

SQL-on-Hadoop
– Column data and indexes are stored on HDFS
– Compute is on dedicated edge nodes
• Non-intrusive, nothing installed on the Hadoop cluster
• Our own SQL engine, not using MapReduce / Tez
– Supports standard ANSI SQL, ODBC/JDBC
JethroData
Working Differently
Comparing to Impala

Using Cloudera's benchmark (data and SQLs)

Scale Factor 1000(GB)

Running on AWS
Demo 1
Access JethroData from a remote JDBC client - SQL Workbench
Small CDH5 cluster on AWS
Jethro Indexes

Indexes map each column value
to a set of rows

Jethro stores indexes as
hierarchical compressed bitmaps
 Very fast query operations – AND / OR / NOT
 Can process the entire WHERE clause to a final list of rows.
 Fully indexed – all columns are automatically indexed

INSERT Performance
– Jethro Indexes are append-only
 If needed, a new, repeating entries are allowed
 INSERT is very fast – files are appended, no random read/write
 Compatible with HDFS
 Periodic background merge (non-blocking)
Value Rows
FR rows 5,9,10,11,14
IL rows 1,3,7,12,13
US rows 2,4,6,8,15
Parallel
Execution
Parallel
ExecutionParallel
Execution
Parallel
Execution
Jethro Execution Logic
Index Phase
Use all relevant indexes to compute which rows are needed for query results
Divide the work to multiple parallel units of work.
Parallel Execution Phase

Parallelize both fetch and compute

Works with or without partitions

Currently multi-threaded in a single server, designed for multi-node
execution
Parse QueryParse Query Identify Results
from Indexes
Identify Results
from Indexes Parallel
Execution
Parallel
Execution
ResultsResults
Technical Architecture
Minimize Query I/O
part 1

Automatically combine all relevant indexes for query
execution – dramatically reduce fetch I/O
 Generate a bitmap of all relevant rows after the entire WHERE clause
 Skip indexes when their value is minimal (WHERE a>0)
or when indexes are not applicable (WHERE function(col)=5)

Use Jethro columnar file format to:
 Skip columns
 Encode and compress data (smaller files)
 Skip blocks (when a column must be scanned)

HDFS I/O optimizations
 Optimized fetch size for bulk fetches
 Using skip scans to convert random reads to single I/O, if possible
Minimize Query I/O
part 2

Automatic Local Cache
 Simple to set up
 Metadata is automatically cached locally by priority,
in the background
 Can ask to cache specific columns or tables
 Improves latency and reduces HDFS roundtrips

Use partitioning?
 Only small effect – we only read relevant rows, with or
without partitioning
 At the high-end, helps operating on smaller bitmaps
 Mostly for rolling window operations
Optimize Query Execution

Query Optimizer - pick the best plan (set of steps)
 Using up-to-date detailed statistics from the indexes
 For example, star transformation

Efficient Processing – optimize each step
 Efficient bitmap operations

Execution Engines – combine the steps efficiently
 Multi-threaded, streaming, parallel execution
 Parallelize everything:
Fetch, Filter, Join, Aggregate, Sort etc
Scalability
scale to 100s of billions of rows

Scale out HDFS (Hadoop Cluster), if needed
 Provide extra storage or extra I/O performance (rare)

Scale out Jethro nodes
 Jethro query nodes are stateless
 Add nodes to support additional concurrent queries

Leverage partitioning
 Support rolling window maintenance at scale
 Works best with a few billion rows per partition – usually
partition for maintenance, not performance
High-Availability

Leverage all HDFS HA goodies
 DataNode Replication
 NameNode HA
 HDFS Snapshots
 Any HDFS DR solution

All Jethro nodes are stateless
 Service auto-start on restart
 Can start and stop them on demand
Demo 2
Access JethroData from Tableau Server over ODBC
Small CDH5 cluster on AWS
Questions?
Talk to us:
ronen@jethrodata.com
ofir@jethrodata.com
Join our beta!
www.jethrodata.com

More Related Content

What's hot

A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
Chris George
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
MapR Technologies
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 

What's hot (20)

A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 

Viewers also liked

A Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on TezA Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on Tez
Gw Liu
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionDataWorks Summit
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
N Masahiro
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
Amrit Chhetri
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
 
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesIntroduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Ofir Manor
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
nvvrajesh
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
20150207 何故scalaを選んだのか
20150207 何故scalaを選んだのか20150207 何故scalaを選んだのか
20150207 何故scalaを選んだのか
Katsunori Kanda
 
Moving Data Between Exadata and Hadoop
Moving Data Between Exadata and HadoopMoving Data Between Exadata and Hadoop
Moving Data Between Exadata and HadoopEnkitec
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
Jason Shih
 
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Couchbase Chennai Meetup:  Developing with Couchbase- made easyCouchbase Chennai Meetup:  Developing with Couchbase- made easy
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Karthik Babu Sekar
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !! Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Karthik Babu Sekar
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
Yifeng Jiang
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Big Data Spain
 
Amazon Kinesis Analytics によるストリーミングデータのリアルタイム分析
Amazon Kinesis Analytics によるストリーミングデータのリアルタイム分析Amazon Kinesis Analytics によるストリーミングデータのリアルタイム分析
Amazon Kinesis Analytics によるストリーミングデータのリアルタイム分析
Amazon Web Services Japan
 

Viewers also liked (20)

A Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on TezA Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on Tez
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or question
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesIntroduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
20150207 何故scalaを選んだのか
20150207 何故scalaを選んだのか20150207 何故scalaを選んだのか
20150207 何故scalaを選んだのか
 
Moving Data Between Exadata and Hadoop
Moving Data Between Exadata and HadoopMoving Data Between Exadata and Hadoop
Moving Data Between Exadata and Hadoop
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Couchbase Chennai Meetup:  Developing with Couchbase- made easyCouchbase Chennai Meetup:  Developing with Couchbase- made easy
Couchbase Chennai Meetup: Developing with Couchbase- made easy
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !! Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 
Amazon Kinesis Analytics によるストリーミングデータのリアルタイム分析
Amazon Kinesis Analytics によるストリーミングデータのリアルタイム分析Amazon Kinesis Analytics によるストリーミングデータのリアルタイム分析
Amazon Kinesis Analytics によるストリーミングデータのリアルタイム分析
 

Similar to Interactive SQL-on-Hadoop and JethroData

Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
samthemonad
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
Sachin Holla
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
Andraz Tori
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
eduarderwee
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
Hortonworks
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big Data
Luiz Henrique Zambom Santana
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
 

Similar to Interactive SQL-on-Hadoop and JethroData (20)

Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big Data
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 

Recently uploaded

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 

Recently uploaded (20)

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 

Interactive SQL-on-Hadoop and JethroData

  • 1. ` Interactive SQL-on-Hadoop Ronen Ovadya, Ofir Manor, JethroData from Impala to Hive/Tez to Spark SQL to JethroData
  • 2. About JethroData  Founded 2012  Raised funding from Pitango in 2013  Engineering in Israel, branch in New York  Launched beta July 2014  We're hiring!
  • 3. About Me Ofir Manor  Worked in the database industry for 20 years  Started as developer and DBA  Worked for Oracle, Greenplum pre-sales roles  Blogging on Big Data and Hadoop  Currently Product Manager at JethroData ofir@jethrodata.com http://ofirm.wordpress.com/ @ofirm
  • 4. Agenda  How we got here?  How do parallel databases on Hadoop work?  Impala, Hive/Tez, Spark SQL  Use case perspective  JethroData – What? How? Demo
  • 5. SQL-on-Hadoop? SQL? Big SQL (too serious for a cute logo) “Stinger” Apache Tez? What about NoSQL on Hadoop? What about Lucene / SOLR?
  • 6. Sounds Familiar? “Let's bring all the data that our operational systems create into one place, and keep it all forever. When we'll analyze that repository, we will surely uncover critical business insights...” What is this concept called?  Today - “Big Data”  Last 20 years - “Enterprise Data Warehouse” (EDW)
  • 7. Big Data vs. Data Warehouse? Everyone calls themselves “Big Data” now... But “Big Data” is lead by large web companies with new requirements:  Web scale – orders of magnitude more data  Page views and clicks, mobile apps, sensor data etc  Cost – dramatically lower price per node  Aversion from vendor lock-in - open-source preference  Methodology – correctness vs. agility  Classical EDW / Schema-on-write – data must be cleaned and integrated before business users can access it (EDW vs. ADW)  Big Data / schema-on-read– let the data science team handle the unfiltered crap (in addition to some vetted schema-on-write data sets)
  • 8. Evolution – 90s  “Big Data” was “Enterprise Data Warehouse”  You could either build your own using a big Unix machine and enterprise storage  “Scale Up”  Or just buy a Teradata appliance World's first production 1TB EDW, 1992 at
  • 9. Evolution – 2000s  A new generation of parallel databases - “10x-100x faster, 10x cheaper”  Netezza, Vertica, Greenplum, Aster, Paraccel etc  All used a cluster of cheap servers without enterprise storage to run parallel analytic SQL  Shared-Nothing Architecture / MPP (Massive Parallel Processing) / Scale-Out
  • 10. Evolution – Our Days  It's all about Hadoop  Hadoop started as a batch platform - HDFS and MapReduce  Lately, it became a shared platform for any type of parallel processing framework Example – Hortonworks slide
  • 11. SQL-on-Hadoop Underlining Design  Hadoop uses the same parallel design pattern as the parallel databases from last decade  Surprisingly, all SQL-on-Hadoop also uses the same design pattern of previous parallel databases!  HDFS  MapReduce  Tez  Spark  Hive  Impala  Presto  Tajo  Drill  Shark  Spark SQL  Pivotal HAWQ  IBM BigSQL  Teradata Aster  Hadapt  Rainstor  ...
  • 12. Data Node The Parallel Design (Shared-Nothing MPP) Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day Data Node Data Node Data Node Data Node Query Executor Query Executor Query Executor Query Executor Query Executor Query Planner /Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr
  • 13. The Parallel Design Principles (Shared-Nothing MPP)  Full Scan – Each node reads all its local portion of the table – Optimize with large sequantial reads  Maximize Locality – minimize inter-node work – Work should be evenly distributed to avoid bottlenecks – Avoid data and processing skew  That leads to a hard design-time trade-off  Greatly depends the physical table organization  Choose partition keys, distribution keys and sort keys
  • 14. How to Get Decent Performance Hive vs. Impala 1. Minimize Query I/O  Skip reading unneeded data 2. Optimize Query Execution  Pick best plan  Optimize each step  Connect the steps efficiently
  • 15. 1. Minimize Query I/O  Using Partitioning (Hive, Impala etc)  Full scan of less data  Manual tuning – too few vs. too many  Using columnar, scan-optimized file formats  “Write once, full scan a thousand times”  Skip unneeded columns  Full scan smaller files - encode and compress per-column  Skip blocks (for sorted data)  Big effort in the Hadoop space, mostly done  Built two comparable formats - ORC and Parquet  Use the right one – Hive/ORC or Impala/Parquet
  • 16. How Parquet and ORC columnar format works? • Data is divided into blocks - chunks of rows • In each block, data is physically stored column by column • Store additional metadata per column in a block, like min / max values
  • 17. 2. Optimize Query Execution 1. Pick the best execution plan – Query Optimizer  Cost-based Optimizations - currently generally weak 2. Optimize each step – Efficient Processing  Vectorized operations, expression compilation etc 3. Combine the steps efficiently - Execution Engines  Batch-oriented (MapReduce) – focus on recoverability  write intermediate results to disk after every step  Streaming-oriented (Tez, Impala, HAWK etc) – focus on performance  Move intermediate results directly between processes  Required much more resources at once  Hybrid (Spark) – enjoy both worlds  Stream and re-use intermediate results, optimize for in-memory  But can recover / recompute on node failure
  • 18. Impala  Impala Highlights  Basic partitioning (partition per key)  Optimized I/O with Parquet  Built their own streaming execution engine  Efficient processing  Basic cost-based query optimizations  Impala vs. Presto vs. Drill vs. Tajo  Generally same high-level design  Four independent teams implementing it in their own way  Impala way ahead of the pack  Performance, functionality, market share, support
  • 19. Impala  Sample benchmark results from Cloudera (May 2014)  20 nodes running TPCDS data, scale factor 15,000 (GB)  Impala 1.3.0 vs. Hive 0.13 on Tez vs. Shark 0.9.2 vs. Presto 0.6.0 Source: Cloudera blog - New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead
  • 20. Hive  Hive Highlights  Rich SQL support  Basic partitioning (partition per key)  Optimized I/O with ORC  Cost-based optimizer coming (Hive 0.14)  Efficient processing – vectorized query execution  Reliable execution (MapReduce) or fast execution (Tez)  Hive on Spark (Shark)  Have recently reached “End-of-life”, sort-of
  • 21. Spark SQL  Spark SQL Highlights  Very early days / alpha state  Announced Mar 2014  A SQL-on-Spark solution written from scratch  Should support reading / writing from Hadoop, specifically from/to Hive and Parquet  Could become an interesting player next year  Mostly vs. Impala
  • 22. Query Use Cases Querying Raw Data  Ad-hoc “Data Science” investigative work  Typically in the minutes to hours range  Need to make sense of many, ever changing data sources  Need to make sense of text / semi-structured / dynamic schema sources  Need to mix SQL, text processing (UDFs) and machine learning algorithm in a manual, multi-step fashion Raw Data
  • 23. Query Use Cases Reporting  Internal analyst teams repeatedly run reports on common data sets  Likely cleaned and vetted, potentially aggregated, shared across many users  Use latest Hive/ORC/Tez or Impala/Parquet – Improves response time from hours to minutes Raw Data Reporting
  • 24. Query Use Cases Pre-Compute Specific Queries (1)  Queries embedded in external websites / apps / dashboards  Customers expecting page loads of a second or two  Must pre-compute results to get response time  “Old Style” implementation 1. Daily massive batch computation (MapReduce) 2. Results are typically pushed to an external OLAP solution (cubes) or to a key-value store (HBase / Redis etc)  Great performance but only for a few queries or counters  “Freshness”: data is typically a day or more behind production  Complexity: multiple steps and systems, massive daily spike in computation, storage, data movement ImmediateRaw Data Reporting
  • 25. Query Use Cases Pre-Compute Specific Queries (2) “New Style” implementation – Stream Processing  Continuously update pre-computed results as new events arrive  Events pile up in some queue (Apache Kafka)  Process events in micro-batches (Apache Storm)  The new computation results are constantly stored / updated in a key- value store  Solves the “freshness” problem  Not suitable for complex pre-computations - best for counters  Bleeding-edge technology ImmediateRaw Data Reporting
  • 26. Query Use Cases Fast Ad-hoc Queries Immediate  Interactive BI is the “Holy Grail” of SQL on Hadoop  Analysts / business users want to interact with select data sets from their BI tool  drag columns in the BI tool, “slice n' dice”, drill-downs  Response time of seconds to tens of seconds from their BI tool  Existing solutions are too slow – users are stuck with reporting  Very hard to achieve with existing Hadoop technologies  Need a different solution – maybe something that have worked for databases in the last 30 years?  It's time to introduce JethroData... Interactive BIRaw Data Reporting
  • 27. JethroData  Index-based, columnar SQL-on-Hadoop – Delivers interactive BI on Hadoop – Focused on ad-hoc queries from BI tools – The more you drill down, the faster you go  SQL-on-Hadoop – Column data and indexes are stored on HDFS – Compute is on dedicated edge nodes • Non-intrusive, nothing installed on the Hadoop cluster • Our own SQL engine, not using MapReduce / Tez – Supports standard ANSI SQL, ODBC/JDBC
  • 29. Comparing to Impala  Using Cloudera's benchmark (data and SQLs)  Scale Factor 1000(GB)  Running on AWS
  • 30. Demo 1 Access JethroData from a remote JDBC client - SQL Workbench Small CDH5 cluster on AWS
  • 31. Jethro Indexes  Indexes map each column value to a set of rows  Jethro stores indexes as hierarchical compressed bitmaps  Very fast query operations – AND / OR / NOT  Can process the entire WHERE clause to a final list of rows.  Fully indexed – all columns are automatically indexed  INSERT Performance – Jethro Indexes are append-only  If needed, a new, repeating entries are allowed  INSERT is very fast – files are appended, no random read/write  Compatible with HDFS  Periodic background merge (non-blocking) Value Rows FR rows 5,9,10,11,14 IL rows 1,3,7,12,13 US rows 2,4,6,8,15
  • 32. Parallel Execution Parallel ExecutionParallel Execution Parallel Execution Jethro Execution Logic Index Phase Use all relevant indexes to compute which rows are needed for query results Divide the work to multiple parallel units of work. Parallel Execution Phase  Parallelize both fetch and compute  Works with or without partitions  Currently multi-threaded in a single server, designed for multi-node execution Parse QueryParse Query Identify Results from Indexes Identify Results from Indexes Parallel Execution Parallel Execution ResultsResults
  • 34. Minimize Query I/O part 1  Automatically combine all relevant indexes for query execution – dramatically reduce fetch I/O  Generate a bitmap of all relevant rows after the entire WHERE clause  Skip indexes when their value is minimal (WHERE a>0) or when indexes are not applicable (WHERE function(col)=5)  Use Jethro columnar file format to:  Skip columns  Encode and compress data (smaller files)  Skip blocks (when a column must be scanned)  HDFS I/O optimizations  Optimized fetch size for bulk fetches  Using skip scans to convert random reads to single I/O, if possible
  • 35. Minimize Query I/O part 2  Automatic Local Cache  Simple to set up  Metadata is automatically cached locally by priority, in the background  Can ask to cache specific columns or tables  Improves latency and reduces HDFS roundtrips  Use partitioning?  Only small effect – we only read relevant rows, with or without partitioning  At the high-end, helps operating on smaller bitmaps  Mostly for rolling window operations
  • 36. Optimize Query Execution  Query Optimizer - pick the best plan (set of steps)  Using up-to-date detailed statistics from the indexes  For example, star transformation  Efficient Processing – optimize each step  Efficient bitmap operations  Execution Engines – combine the steps efficiently  Multi-threaded, streaming, parallel execution  Parallelize everything: Fetch, Filter, Join, Aggregate, Sort etc
  • 37. Scalability scale to 100s of billions of rows  Scale out HDFS (Hadoop Cluster), if needed  Provide extra storage or extra I/O performance (rare)  Scale out Jethro nodes  Jethro query nodes are stateless  Add nodes to support additional concurrent queries  Leverage partitioning  Support rolling window maintenance at scale  Works best with a few billion rows per partition – usually partition for maintenance, not performance
  • 38. High-Availability  Leverage all HDFS HA goodies  DataNode Replication  NameNode HA  HDFS Snapshots  Any HDFS DR solution  All Jethro nodes are stateless  Service auto-start on restart  Can start and stop them on demand
  • 39. Demo 2 Access JethroData from Tableau Server over ODBC Small CDH5 cluster on AWS