SlideShare a Scribd company logo
1 of 64
Download to read offline
SQL on Hadoop 
a Perspective of a Cloud-based, 
Managed Service Provider 
Masahiro Nakagawa 
Sep 13, 2014 
Hadoop Meetup in Taiwan
Today’s agenda 
> Self introduction 
> Why SQL? 
> Hive 
> Presto 
> Conclusion
Who are you? 
> Masahiro Nakagawa 
> github/twitter: @repeatedly 
> Treasure Data, Inc. 
> Senior Software Engineer 
> Fluentd / td-agent developer 
> I love OSS :) 
> D language - Phobos committer 
> Fluentd - Main maintainer 
> MessagePack / RPC- D and Python (only RPC) 
> The organizer of Presto Source Code Reading 
> etc…
Do you love SQL?
Why we love SQL? 
> Easy to understand what we are doing 
> declarative language 
> common interface for data manipulation 
> There are many users 
> SQL is not the best but 
better than uncommon interfaces
We want to use SQL 
in the Hadoop world
SQL Players on Hadoop 
This color indicates a commercial product 
> Hive 
> Spark SQL Batch 
Short Batch 
Low latency 
Stream 
> Presto 
> Impala 
> Drill 
> Norikra 
> StreamSQL 
> HAWQ 
> Actian 
> etc… 
Latency: minutes - hours 
Latency: seconds - minutes 
Latency: immediate
SQL Players on Hadoop 
This color indicates a commercial product 
> Hive 
> Spark SQL 
Batch 
Short Batch 
Low latency 
Stream 
> Presto 
> Impala 
> Drill 
> HAWQ 
> Actian 
> etc… 
Red Ocean 
Blue Ocean? 
> Norikra 
> StreamSQL
3 query engines on Treasure Data 
> Hive (batch) 
> for ETL and scheduled reporting 
> Presto (short batch / low latency) 
> for Ad hoc queries 
> Pig 
> Not SQL 
> There aren’t as many users… ;( 
Today’s talk
Presto 
https://hive.apache.org/
What’s Hive 
> Needs no explanation ;) 
> Most popular project in the ecosystem 
> HiveQL and MapReduce 
> Writing MapReduce code is hard 
> Hive is growing rapidly by Stinger initiative 
> Vectorized Processing 
> Query optimization with statistics 
> Tez instead of MapReduce 
> etc…
Apache Tez 
> Low level framework for YARN applications 
> Next generation query engine 
> Provide good IR for Hive, Pig and more 
> Task and DAG based pipelining 
> Spark uses a similar DAG model 
Input Processor Output 
Task DAG 
http://tez.apache.org/
Hive on MR vs. Hive on Tez 
SELECT g1.x, g2.avg, g2.cnt 
FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1" 
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2" 
ON (g1.x = g2.x) ORDER BY avg; 
MapReduce Tez 
M M 
M M 
R 
HDFS HDFS 
M M M 
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9 
M M 
R 
HDFS 
M 
R 
R 
R 
M 
R 
M M 
R 
R 
R 
Can avoid unnecessary HDFS write 
GROUP a BY a.x GROUP b BY b.x 
JOIN (a, b) 
ORDER BY 
GROUP BY x 
GROUP BY a.x" 
JOIN (a, b) 
ORDER BY
Why still use MapReduce? 
> The emphasis is on stability / reliability 
> Speed is important but not most important 
> Can use a MPP query engine for short batch 
> Tez/Spark are immature 
> Hard to manage in a multi-tenant env 
> Different failure models 
> We are now testing Tez for Hive 
•No code change needed for Hive. Spark is hard… 
• Disabling Tez is easy. Just remove 
‘set hive.execution.engine=tez;’
Presto 
http://prestodb.io/
What’s Presto? 
A distributed SQL query engine 
for interactive data analisys 
against GBs to PBs of data.
Presto’s history 
> 2012 Fall: Project started at Facebook 
> Designed for interactive query 
with speed of commercial data 
warehouse 
> and scalability to the size of Facebook 
> 2013 Winter: Open sourced! 
> 30+ contributes in 6 months 
> including people outside of Facebook
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
HDFS 
Hive 
PostgreSQL, etc. 
Daily/Hourly Batch 
Interactive query 
Dashboard 
Commercial 
BI Tools 
Batch analysis platform Visualization platform
HDFS 
Daily/Hourly Batch 
Hive 
Interactive query 
PostgreSQL, etc. 
✓ Less scalable 
✓ Extra cost 
Dashboard 
Commercial 
BI Tools 
✓ Can’t query against 
“live” data directly 
Batch analysis platform Visualization platform 
✓ More work to manage 
2 platforms
HDFS 
Hive Dashboard 
Presto 
PostgreSQL, etc. 
Daily/Hourly Batch 
HDFS 
Hive 
Dashboard 
Daily/Hourly Batch 
Interactive query 
Interactive query
Presto 
HDFS 
Hive 
Dashboard 
Daily/Hourly Batch 
Interactive query 
SQL on any data sets 
Cassandra MySQL Commertial DBs
Presto 
HDFS 
Hive 
Dashboard 
Daily/Hourly Batch 
Interactive query 
SQL on any data sets Commercial 
Cassandra MySQL Commertial DBs 
BI Tools 
✓ IBM Cognos 
✓ Tableau 
✓ ... 
Data analysis platform
dashboard on chart.io: https://chartio.com/
What can Presto do? 
> Query interactively (in milliseconds to minutes) 
> MapReduce and Hive are still necessary for ETL 
> Query using commercial BI tools or dashboards 
> Reliable ODBC/JDBC connectivity 
> Query across multiple data sources such as 
Hive, HBase, Cassandra, or even commercial DBs 
> Plugin mechanism 
> Integrate batch analysis + visualization 
into a single data analysis platform
Presto’s deployment 
> Facebook 
> Multiple geographical regions 
> scaled to 1,000 nodes 
> actively used by 1,000+ employees 
> processing 1PB/day 
> Netflix, Dropbox, Treasure Data, Airbnb, 
Qubole, LINE, GREE, Scaleout, etc 
> Presto as a Service 
> Treasure Data, Qubole
Distributed architecture
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
1. Client sends a query 
using HTTP
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
2. Coordinator builds 
a query plan 
Connector plugin 
provides metadata 
(table schema, etc.)
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
3. Coordinator sends 
tasks to workers
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
4. Workers read data 
through connector plugin
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
5. Workers run tasks 
in memory and 
in parallel
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service 
Client 
6. Client gets the result 
from a worker
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service
What’s Connectors? 
> Access to storage and metadata 
> provide table schema to coordinators 
> provide table rows to workers 
> Connectors are pluggable to Presto 
> written in Java 
> Implementations: 
> Hive connector 
> Cassandra connector 
> MySQL through JDBC connector (prerelease) 
> Or your own connector
Hive connector 
Client 
Coordinator Hive 
Connector 
Worker 
Worker 
Worker 
HDFS, 
Hive Metastore 
Discovery Service 
find servers in a cluster
Cassandra connector 
Client 
Coordinator Cassandra 
Connector 
Worker 
Worker 
Worker 
Cassandra 
Discovery Service 
find servers in a cluster
Client 
Coordinator 
other 
connectors 
... 
Worker 
Worker 
Worker 
Cassandra 
Discovery Service 
find servers in a cluster 
Hive 
Connector 
HDFS / Metastore 
Multiple connectors in a query 
Cassandra 
Connector 
Other data sources...
Distributed architecture 
> 3 type of servers: 
> Coordinator, worker, discovery service 
> Get data/metadata through connector 
plugins. 
> Presto is NOT a database 
> Presto provides SQL to existent data stores 
> Client protocol is HTTP + JSON 
> Language bindings: 
Ruby, Python, PHP, Java (JDBC), R, Node.JS...
Query Execution
Presto’s execution model 
> Presto is NOT MapReduce 
> Use its own execution engine 
> Presto’s query plan is based on DAG 
> more like Apache Tez / Spark or 
traditional MPP databases 
> Impala and Drill use a similar model
How query runs? 
> Coordinator 
> SQL Parser 
> Query Planner 
> Execution planner 
> Workers 
> Task execution scheduler
SQL 
SQL 
Parser 
AST 
Logical 
Planner 
Metadata 
Distributed 
Planner 
Logical 
Query Plan 
Optimizer 
Execution 
Planner 
Discovery Server 
Connector 
Distributed 
Query Plan Execution Plan 
NodeManager 
✓ node list 
✓ table schema
SQL 
SQL 
Parser 
SQL 
Metadata 
Distributed 
Planner 
Logical 
Query Plan 
Optimizer 
Execution 
Planner 
Discovery Service 
Connector 
Query Plan Execution Plan 
NodeManager 
✓ node list 
✓ table schema 
(today’s talk) 
Query 
Planner
Query Planner 
SQL 
SELECT 
name, 
count(*) AS c 
FROM impressions 
GROUP BY name 
Table schema 
impressions ( 
name varchar 
time bigint 
) 
Output 
(name, c) 
GROUP BY 
(name, 
count(*)) 
Table scan 
(name:varchar) 
+ 
Output 
Exchange 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Logical query plan 
Distributed query plan
Query Planner - Stages 
Output 
Exchange 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
inter-worker 
data transfer Stage-0 
pipelined 
aggregation 
inter-worker 
data transfer 
Stage-1 
Stage-2
Output 
Exchange 
Sink 
Partial aggr 
Table scan 
Sink 
Partial aggr 
Table scan 
Execution Planner 
+Node list 
✓ 2 workers 
Sink 
Final aggr 
Exchange 
Sink 
Final aggr 
Exchange 
Output 
Exchange 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Worker 1 Worker 2
Execution Planner - Tasks 
Worker 1 Worker 2 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Task 
1 task / worker / stage 
✓ All tasks in parallel 
Output 
Exchange
Execution Planner - Split 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Output 
Exchange 
Split 
1 split / task 
= 1 thread / worker 
many splits / task 
= many threads / 
worker (table scan) 
Worker 1 Worker 2 
1 split / worker 
= 1 thread / worker
All stages are pipe-lined 
✓ No wait time 
✓ No fault-tolerance 
MapReduce vs. Presto 
MapReduce Presto 
reduce reduce 
disk 
map map 
disk 
reduce reduce 
map map 
task 
task 
task task 
task task 
memory-to-memory 
data transfer 
✓ No disk IO 
✓ Data chunk must 
fit in memory 
task 
disk 
Wait between 
stages 
Write data 
to disk
Query Execution 
> SQL is converted into stages, tasks and splits 
> All tasks run in parallel 
> No wait time between stages (pipelined) 
> If one task fails, all tasks fail at once (query fails) 
> Memory-to-memory data transfer 
> No disk IO 
> If aggregated data doesn’t fit in memory, 
query fails 
•Note: query dies but worker doesn’t die. 
Memory consumption of all queries is fully managed
Why select Presto? 
> The ease of operations 
> Easy to deploy. Just drop a jar 
> Easy to extend its functionalities 
• Pluggable and DI based loose coupling 
> Doesn’t crash when a query fails 
> Standard SQL syntax 
> Important for existing DB/DWH users 
> HiveQL is for MapReduce, not MPP DB
Our customer use cases 
Hive Presto 
> Scheduled reporting 
for customers 
> once every hour 
Online Ad 
Web/Social 
Retail 
> Scheduled reporting 
for management 
> Compute KPIs 
> Scheduled reporting 
for website, PoS and 
touch panel data 
> Hard deadlines! 
> Check ad-network 
performance 
> delivery logic 
optimization in realtime 
> Aggregation for 
user support 
> Measuring the effect 
of user campaigns 
> Ad-hoc query for 
Basket Analysis 
> Aggregate data for the 
product development
Conclusion
Batch summary 
> MapReduce-based Hive is still the default choice 
> Stable & Lots of shared experience and knowledge 
> Hive with Tez is for Hadoop users 
> No code change needed 
> HDP includes Tez by default 
> Spark and Spark SQL is a good alternative 
> Can’t reuse Hadoop knowledge 
> Mainly for in-memory processing for now
Short batch summary 
> Presto is a good default choice 
> Easy to manage and have useful features 
> Need faster queries? Try Impala 
> for HDFS and HBase 
> CDH includes Impala by default 
> If you are a challenger, check out Drill 
> The project’s goal is ambitious 
> The status is developer preview
Stream summary 
> Fluentd and Norikra 
> Fluentd is for robust log collection 
> Norikra is for SQL based CEP 
! 
> StreamSQL 
> for Spark users 
> Current status is POC
Lastly… 
> Use different engines for different requirements 
> Hadoop/Spark for batch jobs 
> MapReduce won't die for the time being 
> MPP query engine for interactive queries 
> These engines are integrated into 
one system in the future 
> Batch now use DAG pipeline 
> Short Batch will support Task recovery 
The differences will be minimum
Enjoy SQL!
Cloud service for the entire data pipeline, 
including Presto 
Check: treasuredata.com

More Related Content

What's hot

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 

What's hot (20)

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 

Viewers also liked

HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache HiveCarl Steinbach
 
INTEROP2013 ORC チームvyavyavya
INTEROP2013 ORC チームvyavyavyaINTEROP2013 ORC チームvyavyavya
INTEROP2013 ORC チームvyavyavyaupaa
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetupnvvrajesh
 
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Couchbase Chennai Meetup:  Developing with Couchbase- made easyCouchbase Chennai Meetup:  Developing with Couchbase- made easy
Couchbase Chennai Meetup: Developing with Couchbase- made easyKarthik Babu Sekar
 
Gartner Predictions for Hadoop
Gartner Predictions for HadoopGartner Predictions for Hadoop
Gartner Predictions for HadoopBruno Aziza
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on HadoopDataWorks Summit
 
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !! Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !! Karthik Babu Sekar
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...Big Data Spain
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Knowledge Representation in Artificial intelligence
Knowledge Representation in Artificial intelligence Knowledge Representation in Artificial intelligence
Knowledge Representation in Artificial intelligence Yasir Khan
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Knowledge representation in AI
Knowledge representation in AIKnowledge representation in AI
Knowledge representation in AIVishal Singh
 
Knowledge representation and Predicate logic
Knowledge representation and Predicate logicKnowledge representation and Predicate logic
Knowledge representation and Predicate logicAmey Kerkar
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementDataWorks Summit
 

Viewers also liked (15)

HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache Hive
 
INTEROP2013 ORC チームvyavyavya
INTEROP2013 ORC チームvyavyavyaINTEROP2013 ORC チームvyavyavya
INTEROP2013 ORC チームvyavyavya
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Couchbase Chennai Meetup:  Developing with Couchbase- made easyCouchbase Chennai Meetup:  Developing with Couchbase- made easy
Couchbase Chennai Meetup: Developing with Couchbase- made easy
 
Gartner Predictions for Hadoop
Gartner Predictions for HadoopGartner Predictions for Hadoop
Gartner Predictions for Hadoop
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !! Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Knowledge Representation in Artificial intelligence
Knowledge Representation in Artificial intelligence Knowledge Representation in Artificial intelligence
Knowledge Representation in Artificial intelligence
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Knowledge representation in AI
Knowledge representation in AIKnowledge representation in AI
Knowledge representation in AI
 
Knowledge representation and Predicate logic
Knowledge representation and Predicate logicKnowledge representation and Predicate logic
Knowledge representation and Predicate logic
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data Management
 

Similar to SQL on Hadoop in Taiwan

SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014N Masahiro
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSSN Masahiro
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65N Masahiro
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystemmashoodsyed66
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsightAnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsightŁukasz Grala
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
 

Similar to SQL on Hadoop in Taiwan (20)

SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Presto
PrestoPresto
Presto
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsightAnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
AnalyticsConf2016 - Zaawansowana analityka na platformie Azure HDInsight
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 

More from Treasure Data, Inc.

GDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for MarketersGDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for MarketersTreasure Data, Inc.
 
AR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and MarketAR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and MarketTreasure Data, Inc.
 
Introduction to Customer Data Platforms
Introduction to Customer Data PlatformsIntroduction to Customer Data Platforms
Introduction to Customer Data PlatformsTreasure Data, Inc.
 
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowHands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowTreasure Data, Inc.
 
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsBrand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsTreasure Data, Inc.
 
How to Power Your Customer Experience with Data
How to Power Your Customer Experience with DataHow to Power Your Customer Experience with Data
How to Power Your Customer Experience with DataTreasure Data, Inc.
 
Why Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without DataWhy Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without DataTreasure Data, Inc.
 
Connecting the Customer Data Dots
Connecting the Customer Data DotsConnecting the Customer Data Dots
Connecting the Customer Data DotsTreasure Data, Inc.
 
Harnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company SuccessHarnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company SuccessTreasure Data, Inc.
 
Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017Treasure Data, Inc.
 
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)Treasure Data, Inc.
 
Introduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of HivemallIntroduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of HivemallTreasure Data, Inc.
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataTreasure Data, Inc.
 
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...Treasure Data, Inc.
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to RedshiftTreasure Data, Inc.
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudTreasure Data, Inc.
 

More from Treasure Data, Inc. (20)

GDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for MarketersGDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for Marketers
 
AR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and MarketAR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and Market
 
Introduction to Customer Data Platforms
Introduction to Customer Data PlatformsIntroduction to Customer Data Platforms
Introduction to Customer Data Platforms
 
Hands On: Javascript SDK
Hands On: Javascript SDKHands On: Javascript SDK
Hands On: Javascript SDK
 
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowHands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
 
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsBrand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
 
How to Power Your Customer Experience with Data
How to Power Your Customer Experience with DataHow to Power Your Customer Experience with Data
How to Power Your Customer Experience with Data
 
Why Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without DataWhy Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without Data
 
Connecting the Customer Data Dots
Connecting the Customer Data DotsConnecting the Customer Data Dots
Connecting the Customer Data Dots
 
Harnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company SuccessHarnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company Success
 
Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017
 
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
 
Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14
 
Introduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of HivemallIntroduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of Hivemall
 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloud
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 

Recently uploaded

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

SQL on Hadoop in Taiwan

  • 1. SQL on Hadoop a Perspective of a Cloud-based, Managed Service Provider Masahiro Nakagawa Sep 13, 2014 Hadoop Meetup in Taiwan
  • 2. Today’s agenda > Self introduction > Why SQL? > Hive > Presto > Conclusion
  • 3. Who are you? > Masahiro Nakagawa > github/twitter: @repeatedly > Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer > I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC- D and Python (only RPC) > The organizer of Presto Source Code Reading > etc…
  • 4. Do you love SQL?
  • 5. Why we love SQL? > Easy to understand what we are doing > declarative language > common interface for data manipulation > There are many users > SQL is not the best but better than uncommon interfaces
  • 6. We want to use SQL in the Hadoop world
  • 7. SQL Players on Hadoop This color indicates a commercial product > Hive > Spark SQL Batch Short Batch Low latency Stream > Presto > Impala > Drill > Norikra > StreamSQL > HAWQ > Actian > etc… Latency: minutes - hours Latency: seconds - minutes Latency: immediate
  • 8. SQL Players on Hadoop This color indicates a commercial product > Hive > Spark SQL Batch Short Batch Low latency Stream > Presto > Impala > Drill > HAWQ > Actian > etc… Red Ocean Blue Ocean? > Norikra > StreamSQL
  • 9. 3 query engines on Treasure Data > Hive (batch) > for ETL and scheduled reporting > Presto (short batch / low latency) > for Ad hoc queries > Pig > Not SQL > There aren’t as many users… ;( Today’s talk
  • 11. What’s Hive > Needs no explanation ;) > Most popular project in the ecosystem > HiveQL and MapReduce > Writing MapReduce code is hard > Hive is growing rapidly by Stinger initiative > Vectorized Processing > Query optimization with statistics > Tez instead of MapReduce > etc…
  • 12. Apache Tez > Low level framework for YARN applications > Next generation query engine > Provide good IR for Hive, Pig and more > Task and DAG based pipelining > Spark uses a similar DAG model Input Processor Output Task DAG http://tez.apache.org/
  • 13. Hive on MR vs. Hive on Tez SELECT g1.x, g2.avg, g2.cnt FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1" JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2" ON (g1.x = g2.x) ORDER BY avg; MapReduce Tez M M M M R HDFS HDFS M M M http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9 M M R HDFS M R R R M R M M R R R Can avoid unnecessary HDFS write GROUP a BY a.x GROUP b BY b.x JOIN (a, b) ORDER BY GROUP BY x GROUP BY a.x" JOIN (a, b) ORDER BY
  • 14. Why still use MapReduce? > The emphasis is on stability / reliability > Speed is important but not most important > Can use a MPP query engine for short batch > Tez/Spark are immature > Hard to manage in a multi-tenant env > Different failure models > We are now testing Tez for Hive •No code change needed for Hive. Spark is hard… • Disabling Tez is easy. Just remove ‘set hive.execution.engine=tez;’
  • 16. What’s Presto? A distributed SQL query engine for interactive data analisys against GBs to PBs of data.
  • 17. Presto’s history > 2012 Fall: Project started at Facebook > Designed for interactive query with speed of commercial data warehouse > and scalability to the size of Facebook > 2013 Winter: Open sourced! > 30+ contributes in 6 months > including people outside of Facebook
  • 18. What problems does it solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 19. What problems does it solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 20. What problems does it solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 21. What problems does it solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 22. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query Dashboard Commercial BI Tools Batch analysis platform Visualization platform
  • 23. HDFS Daily/Hourly Batch Hive Interactive query PostgreSQL, etc. ✓ Less scalable ✓ Extra cost Dashboard Commercial BI Tools ✓ Can’t query against “live” data directly Batch analysis platform Visualization platform ✓ More work to manage 2 platforms
  • 24. HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query
  • 25. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query SQL on any data sets Cassandra MySQL Commertial DBs
  • 26. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query SQL on any data sets Commercial Cassandra MySQL Commertial DBs BI Tools ✓ IBM Cognos ✓ Tableau ✓ ... Data analysis platform
  • 27. dashboard on chart.io: https://chartio.com/
  • 28. What can Presto do? > Query interactively (in milliseconds to minutes) > MapReduce and Hive are still necessary for ETL > Query using commercial BI tools or dashboards > Reliable ODBC/JDBC connectivity > Query across multiple data sources such as Hive, HBase, Cassandra, or even commercial DBs > Plugin mechanism > Integrate batch analysis + visualization into a single data analysis platform
  • 29. Presto’s deployment > Facebook > Multiple geographical regions > scaled to 1,000 nodes > actively used by 1,000+ employees > processing 1PB/day > Netflix, Dropbox, Treasure Data, Airbnb, Qubole, LINE, GREE, Scaleout, etc > Presto as a Service > Treasure Data, Qubole
  • 31. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service
  • 32. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 1. Client sends a query using HTTP
  • 33. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 2. Coordinator builds a query plan Connector plugin provides metadata (table schema, etc.)
  • 34. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 3. Coordinator sends tasks to workers
  • 35. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 4. Workers read data through connector plugin
  • 36. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 5. Workers run tasks in memory and in parallel
  • 37. Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service Client 6. Client gets the result from a worker
  • 38. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service
  • 39. What’s Connectors? > Access to storage and metadata > provide table schema to coordinators > provide table rows to workers > Connectors are pluggable to Presto > written in Java > Implementations: > Hive connector > Cassandra connector > MySQL through JDBC connector (prerelease) > Or your own connector
  • 40. Hive connector Client Coordinator Hive Connector Worker Worker Worker HDFS, Hive Metastore Discovery Service find servers in a cluster
  • 41. Cassandra connector Client Coordinator Cassandra Connector Worker Worker Worker Cassandra Discovery Service find servers in a cluster
  • 42. Client Coordinator other connectors ... Worker Worker Worker Cassandra Discovery Service find servers in a cluster Hive Connector HDFS / Metastore Multiple connectors in a query Cassandra Connector Other data sources...
  • 43. Distributed architecture > 3 type of servers: > Coordinator, worker, discovery service > Get data/metadata through connector plugins. > Presto is NOT a database > Presto provides SQL to existent data stores > Client protocol is HTTP + JSON > Language bindings: Ruby, Python, PHP, Java (JDBC), R, Node.JS...
  • 45. Presto’s execution model > Presto is NOT MapReduce > Use its own execution engine > Presto’s query plan is based on DAG > more like Apache Tez / Spark or traditional MPP databases > Impala and Drill use a similar model
  • 46. How query runs? > Coordinator > SQL Parser > Query Planner > Execution planner > Workers > Task execution scheduler
  • 47. SQL SQL Parser AST Logical Planner Metadata Distributed Planner Logical Query Plan Optimizer Execution Planner Discovery Server Connector Distributed Query Plan Execution Plan NodeManager ✓ node list ✓ table schema
  • 48. SQL SQL Parser SQL Metadata Distributed Planner Logical Query Plan Optimizer Execution Planner Discovery Service Connector Query Plan Execution Plan NodeManager ✓ node list ✓ table schema (today’s talk) Query Planner
  • 49. Query Planner SQL SELECT name, count(*) AS c FROM impressions GROUP BY name Table schema impressions ( name varchar time bigint ) Output (name, c) GROUP BY (name, count(*)) Table scan (name:varchar) + Output Exchange Sink Final aggr Exchange Sink Partial aggr Table scan Logical query plan Distributed query plan
  • 50. Query Planner - Stages Output Exchange Sink Final aggr Exchange Sink Partial aggr Table scan inter-worker data transfer Stage-0 pipelined aggregation inter-worker data transfer Stage-1 Stage-2
  • 51. Output Exchange Sink Partial aggr Table scan Sink Partial aggr Table scan Execution Planner +Node list ✓ 2 workers Sink Final aggr Exchange Sink Final aggr Exchange Output Exchange Sink Final aggr Exchange Sink Partial aggr Table scan Worker 1 Worker 2
  • 52. Execution Planner - Tasks Worker 1 Worker 2 Sink Final aggr Exchange Sink Partial aggr Table scan Sink Final aggr Exchange Sink Partial aggr Table scan Task 1 task / worker / stage ✓ All tasks in parallel Output Exchange
  • 53. Execution Planner - Split Sink Final aggr Exchange Sink Partial aggr Table scan Sink Final aggr Exchange Sink Partial aggr Table scan Output Exchange Split 1 split / task = 1 thread / worker many splits / task = many threads / worker (table scan) Worker 1 Worker 2 1 split / worker = 1 thread / worker
  • 54. All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance MapReduce vs. Presto MapReduce Presto reduce reduce disk map map disk reduce reduce map map task task task task task task memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory task disk Wait between stages Write data to disk
  • 55. Query Execution > SQL is converted into stages, tasks and splits > All tasks run in parallel > No wait time between stages (pipelined) > If one task fails, all tasks fail at once (query fails) > Memory-to-memory data transfer > No disk IO > If aggregated data doesn’t fit in memory, query fails •Note: query dies but worker doesn’t die. Memory consumption of all queries is fully managed
  • 56. Why select Presto? > The ease of operations > Easy to deploy. Just drop a jar > Easy to extend its functionalities • Pluggable and DI based loose coupling > Doesn’t crash when a query fails > Standard SQL syntax > Important for existing DB/DWH users > HiveQL is for MapReduce, not MPP DB
  • 57. Our customer use cases Hive Presto > Scheduled reporting for customers > once every hour Online Ad Web/Social Retail > Scheduled reporting for management > Compute KPIs > Scheduled reporting for website, PoS and touch panel data > Hard deadlines! > Check ad-network performance > delivery logic optimization in realtime > Aggregation for user support > Measuring the effect of user campaigns > Ad-hoc query for Basket Analysis > Aggregate data for the product development
  • 59. Batch summary > MapReduce-based Hive is still the default choice > Stable & Lots of shared experience and knowledge > Hive with Tez is for Hadoop users > No code change needed > HDP includes Tez by default > Spark and Spark SQL is a good alternative > Can’t reuse Hadoop knowledge > Mainly for in-memory processing for now
  • 60. Short batch summary > Presto is a good default choice > Easy to manage and have useful features > Need faster queries? Try Impala > for HDFS and HBase > CDH includes Impala by default > If you are a challenger, check out Drill > The project’s goal is ambitious > The status is developer preview
  • 61. Stream summary > Fluentd and Norikra > Fluentd is for robust log collection > Norikra is for SQL based CEP ! > StreamSQL > for Spark users > Current status is POC
  • 62. Lastly… > Use different engines for different requirements > Hadoop/Spark for batch jobs > MapReduce won't die for the time being > MPP query engine for interactive queries > These engines are integrated into one system in the future > Batch now use DAG pipeline > Short Batch will support Task recovery The differences will be minimum
  • 64. Cloud service for the entire data pipeline, including Presto Check: treasuredata.com