SlideShare a Scribd company logo
1 of 55
Download to read offline
MARK RITTMAN, COO, RITTMAN MEAD BUDAPEST DATA FORUM, JUNE 2016
SQL-ON-HADOOP FOR ANALYTICS + BI: 

WHAT ARE MY OPTIONS, WHAT’S THE FUTURE?
info@rittmanmead.com www.rittmanmead.com @rittmanmead 2
•Many customers and organisations are now running initiatives around “big data”

•Some are IT-led and are looking for cost-savings around data warehouse storage + ETL

•Others are “skunkworks” projects in the marketing department that are now scaling-up

•Projects now emerging from pilot exercises

•And design patterns starting to emerge
Many Organisations are Running Big Data Initiatives
info@rittmanmead.com www.rittmanmead.com @rittmanmead
Highly Scalable (and Affordable) Cluster Computing
•Enterprise High-End RDBMSs such as Oracle can scale into the petabytes, using clustering

‣Sharded databases (e.g. Netezza) can scale further but with complexity / single workload trade-offs

•Hadoop was designed from outside for massive horizontal scalability - using cheap hardware

•Anticipates hardware failure and makes multiple copies of data as protection

•More nodes you add, more stable it becomes

•And at a fraction of the cost of traditional

RDBMS platforms
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•We can now affordably create a single, massive archive of all our corporate data

•Separate to our OLTP and operational BI tools

•Leave it stored at the individual transaction level

•And then run multiple compute frameworks on it

‣SQL queries through Hive, Impala etc

‣Data processing using Spark, MR 

‣Graph Analysis, etc
One Platform, Multiple Processing Frameworks
Hadoop Data Lake
Webserver

Log Files (txt)
Social Media

Logs (JSON)
DB Archives

(CSV)
Sensor Data

(XML)
`Spatial & Graph

(XML, txt)
IoT Logs

(JSON, txt)
Chat Transcripts

(Txt)
DB Transactions

(CSV, XML)
Blogs, Articles

(TXT, HTML)
Raw Data Processed Data
NoSQL Key-Value

Store DB Tabular Data

(Hive Tables)
Aggregates

(Impala Tables) NoSQL Document 

Store DB
info@rittmanmead.com www.rittmanmead.com @rittmanmead 5
•Typical implementation of Hadoop and big data in an analytic context is the “data lake”

•Additional data storage platform with cheap storage, flexible schema support + compute

•Data lands in the data lake or reservoir in raw form, then minimally processed

•Data then accessed directly by “data scientists”, or processed further into DW
In the Context of BI & Analytics : The Data Reservoir
Data	Transfer Data	Access
Data	Factory
Data	Reservoir
Business	
Intelligence	Tools
Hadoop	Platform
File	Based	
Integration
Stream	
Based	
Integration
Data	streams
Discovery	&	Development	Labs
Safe	&	secure	Discovery	and	Development	
environment
Data	sets	and	
samples
Models	and	
programs
Marketing	/
Sales	Applications
Models
Machine
Learning
Segments
Operational	Data
Transactions
Customer
Master	ata
Unstructured	Data
Voice	+	Chat	
Transcripts
ETL	Based
Integration
Raw	
Customer	Data
Data	stored	in	
the	original	
format	(usually	
files)		such	as	
SS7,	ASN.1,	
JSON	etc.
Mapped	
Customer	Data
Data	sets	
produced	by	
mapping	and	
transforming	
raw	data
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Programming model for processing large data sets in parallel on a cluster

•Not specific to a particular language, but usually written in Java

•Inspired by the map and reduce functions commonly used in functional programming

‣Map() performs filtering and sorting

‣Reduce() aggregates the output of mappers

‣and a Shuffle() step to redistribute output by keys

•Resolved several complications of distributed computing:

‣Allows unlimited computations on unlimited data

‣Map and reduce functions can be easily distributed

‣Originated at Google; Hadoop was Yahoo’s open-source

implementation of MapReduce, + two are synonymous
MapReduce - The Original Big Data Query Framework
Mapper
Filter, Project
Mapper
Filter, Project
Mapper
Filter, Project
Reducer
Aggregate
Reducer
Aggregate
Output

One HDFS file per reducer,

in a directory
info@rittmanmead.com www.rittmanmead.com @rittmanmead 7
•Original developed at Facebook, now foundational within the Hadoop project

•Allows users to query Hadoop data using SQL-like language

•Tabular metadata layer that overlays files, can interpret semi-structured data (e.g. JSON)

•Generates MapReduce code to return required data

•Extensible through SerDes and Storage Handlers

•JDBC and ODBC drivers for most platforms/tools

•Perfect for set-based access + batch ETL work
Apache Hive : SQL Metadata + Engine over Hadoop
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Data integration tools could now load and process Hadoop data

•BI tools could treat Hadoop as just another data source

•Generally use MapReduce and Hive to access data

‣ODBC and JDBC access to Hive tabular data

‣Allows Hadoop unstructured/semi-structured

data on HDFS to be accessed like RDBMS
Hive Provides a SQL Interface for BI + ETL Tools
Access direct Hive or extract using ODI12c
for structured OBIEE dashboard analysis
What pages are people visiting?
Who is referring to us on Twitter?
What content has the most reach?
Hive is slow
MapReduce is for losers
But the future is fast
info@rittmanmead.com www.rittmanmead.com @rittmanmead 16
Hadoop 2.0 Processing Frameworks + Tools
info@rittmanmead.com www.rittmanmead.com @rittmanmead 17
•MapReduce 2 (MR2) splits the functionality of the JobTracker

by separating resource management and job scheduling/monitoring

•Introduces YARN (Yet Another Resource Manager)

•Permits other processing frameworks to MR

‣For example, Apache Spark

•Maintains backwards compatibility with MR1

•Introduced with CDH5+
MapReduce 2 and YARN
Node

Manager
Node

Manager
Node

Manager
Resource

Manager
Client
Client
info@rittmanmead.com www.rittmanmead.com @rittmanmead 18
•Runs on top of YARN, provides a faster execution engine than MapReduce for Hive, Pig etc

•Models processing as an entire data flow graph (DAG), rather than separate job steps

‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems

‣Dataflow steps pass data between them as streams, rather than writing/reading from disk

•Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez

•Favoured In-memory / Hive v2 

route by Hortonworks
Apache Tez
InputData
TEZ DAG
Map()
Map()
Map()
Reduce()
OutputData
Reduce()
Reduce()
Reduce()
InputData
Map()
Map()
Reduce()
Reduce()
info@rittmanmead.com www.rittmanmead.com @rittmanmead 19
Tez Advantage - Drop-In Replacement for MR with Hive, Pig
set hive.execution.engine=mr
set hive.execution.engine=tez
4m 17s
2m 25s
info@rittmanmead.com www.rittmanmead.com @rittmanmead 20
•Cloudera’s answer to Hive query response time issues

•MPP SQL query engine running on Hadoop, bypasses MapReduce for
direct data access

•Mostly in-memory, but spills to disk if required

•Uses Hive metastore to access Hive table metadata

•Similar SQL dialect to Hive - not as rich though and no support for Hive
SerDes, storage handlers etc
Cloudera Impala - Fast, MPP-style Access to Hadoop Data
info@rittmanmead.com www.rittmanmead.com @rittmanmead 21
•Log into Impala Shell, run INVALIDATE METADATA command to refresh Impala table list

•Run SHOW TABLES Impala SQL command to view tables available

•Run COUNT(*) on main ACCESS_PER_POST table to see typical response time
Enabling Hive Tables for Impala
[oracle@bigdatalite ~]$ impala-shell
Starting Impala Shell without Kerberos authentication
[bigdatalite.localdomain:21000] > invalidate metadata;
Query: invalidate metadata
Fetched 0 row(s) in 2.18s
[bigdatalite.localdomain:21000] > show tables;
Query: show tables
+-----------------------------------+
| name |
+-----------------------------------+
| access_per_post |
| access_per_post_cat_author |
| … |
| posts |
|——————————————————————————————————-+
Fetched 45 row(s) in 0.15s
[bigdatalite.localdomain:21000] > select count(*) 

from access_per_post;
Query: select count(*) from access_per_post
+----------+
| count(*) |
+----------+
| 343 |
+----------+
Fetched 1 row(s) in 2.76s
info@rittmanmead.com www.rittmanmead.com @rittmanmead 22
•Beginners usually store data in HDFS using text file formats (CSV) but these have limitations

•Apache AVRO often used for general-purpose processing

‣Splitability, schema evolution, in-built metadata, support for block compression

•Parquet now commonly used with Impala due to column-orientated storage

‣Mirrors work in RDBMS world around column-store

‣Only return (project) the columns you require across a wide table
Apache Parquet - Column-Orientated Storage for Analytics
Hive (and HDFS) is Append-Only
info@rittmanmead.com www.rittmanmead.com @rittmanmead 24
•Main DML operation in HiveQL is SELECT statement

HiveQL Limited to SELECT and Bulk Load
•CREATE TABLE test (

product_id int,

product_desc string);

•SHOW TABLES;

•CREATE TABLE test2

AS SELECT * FROM test;
•SELECT SUM(sales)

FROM sales_summary;

•LOAD DATA INPATH ‘/user/mrittman/logs’ INTO TABLE log_entries;
What if you need to UPDATE,

INSERT or DELETE?
For example … 

SCD2 Dimension Tables?
info@rittmanmead.com www.rittmanmead.com @rittmanmead 27
•Extract data from source into flat files

•Load files into Apache HBase tables, with Apache Hive table metadata over them

‣Loading into HBase permits full CRUD activity, rather than append-only loading

•Query using Apache Hive and BI/ETL tool; update existing rows by LOAD DATA into Hive
Standard Updatable-Hive Table Solution - Hive-on-Base
info@rittmanmead.com www.rittmanmead.com @rittmanmead 28
•Based on BigTable paper from Google, 2006, Dean et al.

‣“Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.”Key Features:

‣Distributed storage across cluster of machines – Random, online read and write data
access

‣Schemaless data model (“NoSQL”)

‣Self-managed data partitions

•Alternative to HDFS as Hive storage layer

‣Allows us to do update and delete

activity rather than just Hive append-only

‣Very fast for incremental loading

‣Can define Hive tables over HBase ones,

allowing OBIEE to then access them
What is HBase?
info@rittmanmead.com www.rittmanmead.com @rittmanmead 29
•HBase Shell CLI allows you to create HBase tables

•GET and PUT commands can then be used to add/update cells, query cells etc
Creating HBase Tables using HBase Shell
hbase shell
create 'carriers','details'
create 'geog_origin','origin'
create 'geog_dest','dest'
create 'flight_delays','dims','measures'
put 'geog_dest','LAX','dest:airport_name','Los Angeles, CA: Los Angeles'
put 'geog_dest','LAX','dest:city','Los Angeles, CA'
put 'geog_dest','LAX','dest:state','California'
put 'geog_dest','LAX','dest:id','12892'
hbase(main):015:0> scan 'geog_dest'
ROW COLUMN+CELL

LAX column=dest:airport_name, timestamp=1432067861347, value=Los Angeles, CA: Los Angeles

LAX column=dest:city, timestamp=1432067861375, value=Los Angeles,CA

LAX column=dest:id, timestamp=1432067862018,value=12892

LAX column=dest:state, timestamp=1432067861404,value=California

1 row(s) in 0.0240 seconds
info@rittmanmead.com www.rittmanmead.com @rittmanmead 30
•Create Hive tables over the HBase ones to provide SQL load/query capabilities

‣Uses HBaseStorageHandler Storage Handler for HBAse

‣HBase columns mapped to Hive columns using SERDEPROPERTIES
Create Hive Table Metadata over HBase Tables
CREATE EXTERNAL TABLE hbase_flight_delays
(key string,
year string,
carrier string,
orig string,
dest string,
flights string,
late string,
cancelled string,
distance string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key,dims:year,dims:carrier,dims:orig,dims:dest,

measures:flights,measures:late,measures:cancelled,measures:distance")
TBLPROPERTIES ("hbase.table.name" = "test1_flight_delays");
info@rittmanmead.com www.rittmanmead.com @rittmanmead 31
•Hive-on-HBase tables have the opposite query characteristics to Hive-on-HDFS tables

‣For small tables, query response-time is fast (HBase cell-level random access)

‣But response time increases (beyond regular HDFS-stored Hive tables) when lots of

data needs to be retrieved and aggregated
But … (and there’s always a “But…”)
hive> select * from hbase_geog_dest where dest_city = 'San Francisco, CA';
OK
JCC San Francisco, CA: China Basin Heliport San Francisco, CA California 12457
SFO San Francisco, CA: San Francisco International San Francisco, CA California 14771
Time taken: 0.171 seconds, Fetched: 2 row(s)
hive> select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from flight_delays f
> join hbase_geog_origin o on f.orig = o.key
> and o.origin_state = 'California'
> group by o.origin_airport_name;
...
OK
17638Arcata/Eureka, CA: Arcata
9146Bakersfield, CA: Meadows Field
125433Burbank, CA: Bob Hope
...
1653Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field
Time taken: 51.757 seconds, Fetched: 27 row(s)
info@rittmanmead.com www.rittmanmead.com @rittmanmead 32
•Either use Impala engine to query the Hive-on-HBase tables as they are

‣Makes overall SELECT operations faster

•Or CREATE TABLE … AS SELECT snapshot the Hive-on-HBase tables to parquet files

‣Takes advantage of Parquet column-storage optimization
Partial Solution - Snapshot to Impala + Parquet
[oracle@bigdatalite ~]$ impala-shell
Starting Impala Shell without Kerberos authentication
[bigdatalite.localdomain:21000] > invalidate metadata;
Query: invalidate metadata
Fetched 0 row(s) in 2.18s
[bigdatalite.localdomain:21000] > show tables;
Query: show tables
+-----------------------------------+
| name |
+-----------------------------------+
| access_per_post |
| access_per_post_cat_author |
| … |
| posts |
|——————————————————————————————————-+
Fetched 45 row(s) in 0.15s
[bigdatalite.localdomain:21000] > select count(*) 

from access_per_post;
Query: select count(*) from access_per_post
+----------+
| count(*) |
+----------+
| 343 |
+----------+
Fetched 1 row(s) in 2.76s
info@rittmanmead.com www.rittmanmead.com @rittmanmead 33
•But Parquet (and HDFS) have significant limitation for real-time analytics applications

‣Append-only orientation, focus on column-store 

makes streaming ingestion harder

•Cloudera Kudu aims to combine best of HDFS + HBase

‣Real-time analytics-optimised 

‣Supports updates to data

‣Fast ingestion of data

‣Accessed using SQL-style tables

and get/put/update/delete API
Cloudera Kudu - Combining Best of HBase and Column-Store
info@rittmanmead.com www.rittmanmead.com @rittmanmead 34
•Kudu storage used with Impala - create tables using Kudu storage handler

•Can now UPDATE, DELETE and INSERT into Hadoop tables, not just SELECT and LOAD DATA
Example Impala DDL + DML Commands with Kudu
CREATE TABLE `my_first_table` (
`id` BIGINT,
`name` STRING
)
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'my_first_table',
'kudu.master_addresses' = 'kudu-master.example.com:7051',
'kudu.key_columns' = 'id'
);
INSERT INTO my_first_table VALUES (99, "sarah");
INSERT IGNORE INTO my_first_table VALUES (99, "sarah");
UPDATE my_first_table SET name="bob" where id = 3;
DELETE FROM my_first_table WHERE id < 3;
DELETE c FROM my_second_table c, stock_symbols s WHERE c.name = s.symbol;
and there’s more
info@rittmanmead.com www.rittmanmead.com @rittmanmead 36
•Many Hadoop tools now use in-memory
processing (R, Python, Spark etc)

•But they all work to their own standards

•Considerable overhead in serialising /
deserialising data between tools

•Apache Arrow standardised how in-memory
data is held

•Considerably reduced overhead and latency
between tools
Apache Arrow - Standardising In-Memory Structures
info@rittmanmead.com www.rittmanmead.com @rittmanmead 37
Apache Arrow : Common In-Memory Layer for Hadoop
info@rittmanmead.com www.rittmanmead.com @rittmanmead 38
•Similar concept to Oracle BI Server

‣Logical star scheme for business model

‣Maps to federated data sources

‣Integrates and models data for query

‣Can run alongside Kylin MOLAP server

•Part of wider next-gen Hadoop BI stack

‣Apache Zepplin web-based notebook

‣Spark/MR/Tez execution engines

‣Hive and Drill for ETL/ad-hoc SQL

•Slowly the BI stack gets built out...
Apache Lens : Logical Dimensional Model for Hadoop
For data processing,
In Oracle we have PL/SQL
In SQL Server we have TSQL
In Hadoop we have ... Spark
info@rittmanmead.com www.rittmanmead.com @rittmanmead 43
•Another DAG execution engine running on YARN

•More mature than TEZ, with richer API and more vendor support

•Uses concept of an RDD (Resilient Distributed Dataset)

‣RDDs like tables or Pig relations, but can be cached in-memory

‣Great for in-memory transformations, or iterative/cyclic processes

•Spark jobs comprise of a DAG of tasks operating on RDDs

•Access through Scala, Python or Java APIs

•Related projects include

‣Spark SQL

‣Spark Streaming
Apache Spark
info@rittmanmead.com www.rittmanmead.com @rittmanmead 44
•Native support for multiple languages 

with identical APIs

‣Python - prototyping, data wrangling

‣Scala - functional programming features

‣Java - lower-level, application integration

•Use of closures, iterations, and other 

common language constructs to minimize code

•Integrated support for distributed +

functional programming

•Unified API for batch and streaming
Rich Developer Support + Wide Developer Ecosystem
scala> val logfile = sc.textFile("logs/access_log")
14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353) 

called with curMem=234759, maxMem=309225062
14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2 

stored as values to memory (estimated size 75.5 KB, free 294.6 MB)
logfile: org.apache.spark.rdd.RDD[String] = 

MappedRDD[31] at textFile at <console>:15
scala> logfile.count()
14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1
14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1
...
14/05/12 21:19:06 INFO SparkContext: Job finished: 

count at <console>:18, took 0.192536694 s
res7: Long = 154563
scala> val logfile = sc.textFile("logs/access_log").cache
scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/"))
biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17
scala> biapps11g.count()
...
14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s
res9: Long = 403
info@rittmanmead.com www.rittmanmead.com @rittmanmead 45
•Spark SQL, and Data Frames, allow RDDs in Spark to be processed using SQL queries

•Bring in and federate additional data from JDBC sources

•Load, read and save data in Hive, Parquet and other structured tabular formats
Spark SQL - SQL within Apache Spark
val accessLogsFilteredDF = accessLogs
.filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*"))
.filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF()
.registerTempTable("accessLogsFiltered")
val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) 

as total 

FROM accessLogsFiltered a 

JOIN posts p ON a.endpoint = p.POST_SLUG 

GROUP BY p.POST_TITLE, p.POST_AUTHOR 

ORDER BY total DESC LIMIT 10 ")
// Persist top ten table for this window to HDFS as parquet file
topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet"

, "parquet", SaveMode.Overwrite)
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Apache Drill is another SQL-on-Hadoop project that focus on schema-free data discovery

•Inspired by Google Dremel, innovation is querying raw data with schema optional

•Automatically infers and detects schema from semi-structured datasets and NoSQL DBs

•Join across different silos of data e.g. JSON records, Hive tables and HBase database

•Aimed at different use-cases than Hive - 

low-latency queries, discovery 

(think Endeca vs OBIEE)
Apache Drill - Leverage Embedded Metadata in Files
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Most modern datasource formats embed their schema in the data (“schema-on-read”)

•Apache Drill makes these as easy to join to traditional datasets as “point me at the data”

•Cuts out unnecessary work in defining Hive schemas for data that’s self-describing

•Supports joining across files,

databases, NoSQL etc
Self-Describing Data - Parquet, AVRO, JSON etc
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Files can exist either on the local filesystem, or on HDFS

•Connection to directory or file defined in storage configuration

•Can work with CSV, TXT, TSV etc

•First row of file can provide schema (column names)
Apache Drill and Text Files
SELECT * FROM dfs.`/tmp/csv_with_header.csv2`;
+-------+------+------+------+
| name | num1 | num2 | num3 |
+-------+------+------+------+
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
+-------+------+------+------+
7 rows selected (0.12 seconds)
SELECT * FROM dfs.`/tmp/csv_no_header.csv`;
+------------------------+
| columns |
+------------------------+
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
+------------------------+
7 rows selected (0.112 seconds)
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•JSON (Javascript Object Notation) documents are often
used for data interchange

•Exports from Twitter and other consumer services

•Web service responses and other B2B interfaces

•A more lightweight form of XML that is “self-describing”

•Handles evolving schemas, and optional attributes

•Drill treats each document as a row, and has features to

•Flatten nested data (extract elements from arrays)

•Generate key/value pairs for loosely structured data
Apache Drill and JSON Documents
use dfs.iot;
show files;
select in_reply_to_user_id, text from `all_tweets.json`
limit 5;
+---------------------+------+
| in_reply_to_user_id | text |
+---------------------+------+
| null | BI Forum 2013 in Brighton has now sold-out |
| null | "Football has become a numbers game |
| null | Just bought Lyndsay Wise’s Book |
| null | An Oracle BI "Blast from the Past" |
| 14716125 | Dilbert on Agile Programming |
+---------------------+------+
5 rows selected (0.229 seconds)
select name, flatten(fillings) as f 

from dfs.users.`/donuts.json` 

where f.cal < 300;
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Drill can connect to Hive to make use of metastore (incl. multiple Hive metastores)

•NoSQL databases (HBase etc)

•Parquet files (native storage format - columnar + self describing)
Apache Drill and Hive, HBase, Parquet Sources etc
USE hbase;
SELECT * FROM students;
+-------------+-----------------------+-----------------------------------------------------+
| row_key | account | address |
+-------------+-----------------------+------------------------------------------------------+
| [B@e6d9eb7 | {"name":"QWxpY2U="} | {"state":"Q0E=","street":"MTIzIEJhbGxtZXIgQXY="} |
| [B@2823a2b4 | {"name":"Qm9i"} | {"state":"Q0E=","street":"MSBJbmZpbml0ZSBMb29w"} |
| [B@3b8eec02 | {"name":"RnJhbms="} | {"state":"Q0E=","street":"NDM1IFdhbGtlciBDdA=="} |
| [B@242895da | {"name":"TWFyeQ=="} | {"state":"Q0E=","street":"NTYgU291dGhlcm4gUGt3eQ=="} |
+-------------+-----------------------+----------------------------------------------------------------------+
SELECT firstname,lastname FROM 

hiveremote.`customers` limit 10;`

+------------+------------+
| firstname | lastname |
+------------+------------+
| Essie | Vaill |
| Cruz | Roudabush |
| Billie | Tinnes |
| Zackary | Mockus |
| Rosemarie | Fifield |
| Bernard | Laboy |
| Marianne | Earman |
+------------+------------+
SELECT * FROM dfs.`iot_demo/geodata/region.parquet`;
+--------------+--------------+-----------------------+
| R_REGIONKEY | R_NAME | R_COMMENT |
+--------------+--------------+-----------------------+
| 0 | AFRICA | lar deposits. blithe |
| 1 | AMERICA | hs use ironic, even |
| 2 | ASIA | ges. thinly even pin |
| 3 | EUROPE | ly final courts cajo |
| 4 | MIDDLE EAST | uickly special accou |
+--------------+--------------+-----------------------+
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Drill developed for real-time, ad-hoc data exploration with schema discovery on-the-fly
•Individual analysts exploring new datasets, leveraging corporate metadata/data to help

•Hive is more about large-scale, centrally curated set-based big data access

•Drill models conceptually as JSON, vs. Hive’s tabular approach

•Drill introspects schema from whatever it connects to, vs. formal modeling in Hive
Apache Drill vs. Apache Hive
Interactive Queries

(Data Discovery, Tableau/VA)
Reporting Queries

(Canned Reports, OBIEE)
ETL

(ODI, Scripting, Informatica)
Apache Drill Apache Hive
Interactive Queries
100ms - 3mins
Reporting Queries
3mins - 20mins
ETL & Batch Queries
20mins - hours
www.rittmanmead.com
MARK RITTMAN, COO, RITTMAN MEAD BUDAPEST DATA FORUM, JUNE 2016
SQL-ON-HADOOP FOR ANALYTICS + BI: 

WHAT ARE MY OPTIONS, WHAT’S THE FUTURE?

More Related Content

What's hot

OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle CloudOTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle CloudMark Rittman
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Mark Rittman
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Mark Rittman
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017Rittman Analytics
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing ArchitectureGang Tao
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story Roman Chukh
 
Deploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudDeploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudMark Rittman
 
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyOracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyMark Rittman
 
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...Mark Rittman
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWSGary Stafford
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 

What's hot (20)

OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle CloudOTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Deploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudDeploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle Cloud
 
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyOracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
 
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 

Similar to SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasThoughtworks
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real WorldMark Kromer
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overviewRohit Jain
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetupRemus Rusanu
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Vantara
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerMark Kromer
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 

Similar to SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future? (20)

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Apache drill
Apache drillApache drill
Apache drill
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 

More from Mark Rittman

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Mark Rittman
 
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements  traditional business anal...What is Big Data Discovery, and how it complements  traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
 
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Mark Rittman
 
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Mark Rittman
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...Mark Rittman
 
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015Mark Rittman
 
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODIBIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODIMark Rittman
 
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI ProjectsOGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI ProjectsMark Rittman
 
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cUKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cMark Rittman
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Mark Rittman
 
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gPart 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gMark Rittman
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
 

More from Mark Rittman (13)

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
 
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements  traditional business anal...What is Big Data Discovery, and how it complements  traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...
 
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
 
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
 
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
 
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODIBIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
 
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI ProjectsOGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
 
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cUKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
 
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gPart 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
 

Recently uploaded

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 

Recently uploaded (20)

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 

SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?

  • 1. MARK RITTMAN, COO, RITTMAN MEAD BUDAPEST DATA FORUM, JUNE 2016 SQL-ON-HADOOP FOR ANALYTICS + BI: 
 WHAT ARE MY OPTIONS, WHAT’S THE FUTURE?
  • 2. info@rittmanmead.com www.rittmanmead.com @rittmanmead 2 •Many customers and organisations are now running initiatives around “big data” •Some are IT-led and are looking for cost-savings around data warehouse storage + ETL •Others are “skunkworks” projects in the marketing department that are now scaling-up •Projects now emerging from pilot exercises •And design patterns starting to emerge Many Organisations are Running Big Data Initiatives
  • 3. info@rittmanmead.com www.rittmanmead.com @rittmanmead Highly Scalable (and Affordable) Cluster Computing •Enterprise High-End RDBMSs such as Oracle can scale into the petabytes, using clustering ‣Sharded databases (e.g. Netezza) can scale further but with complexity / single workload trade-offs •Hadoop was designed from outside for massive horizontal scalability - using cheap hardware •Anticipates hardware failure and makes multiple copies of data as protection •More nodes you add, more stable it becomes •And at a fraction of the cost of traditional
 RDBMS platforms
  • 4. info@rittmanmead.com www.rittmanmead.com @rittmanmead •We can now affordably create a single, massive archive of all our corporate data •Separate to our OLTP and operational BI tools •Leave it stored at the individual transaction level •And then run multiple compute frameworks on it ‣SQL queries through Hive, Impala etc ‣Data processing using Spark, MR ‣Graph Analysis, etc One Platform, Multiple Processing Frameworks Hadoop Data Lake Webserver
 Log Files (txt) Social Media
 Logs (JSON) DB Archives
 (CSV) Sensor Data
 (XML) `Spatial & Graph
 (XML, txt) IoT Logs
 (JSON, txt) Chat Transcripts
 (Txt) DB Transactions
 (CSV, XML) Blogs, Articles
 (TXT, HTML) Raw Data Processed Data NoSQL Key-Value
 Store DB Tabular Data
 (Hive Tables) Aggregates
 (Impala Tables) NoSQL Document 
 Store DB
  • 5. info@rittmanmead.com www.rittmanmead.com @rittmanmead 5 •Typical implementation of Hadoop and big data in an analytic context is the “data lake” •Additional data storage platform with cheap storage, flexible schema support + compute •Data lands in the data lake or reservoir in raw form, then minimally processed •Data then accessed directly by “data scientists”, or processed further into DW In the Context of BI & Analytics : The Data Reservoir Data Transfer Data Access Data Factory Data Reservoir Business Intelligence Tools Hadoop Platform File Based Integration Stream Based Integration Data streams Discovery & Development Labs Safe & secure Discovery and Development environment Data sets and samples Models and programs Marketing / Sales Applications Models Machine Learning Segments Operational Data Transactions Customer Master ata Unstructured Data Voice + Chat Transcripts ETL Based Integration Raw Customer Data Data stored in the original format (usually files) such as SS7, ASN.1, JSON etc. Mapped Customer Data Data sets produced by mapping and transforming raw data
  • 6. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Programming model for processing large data sets in parallel on a cluster •Not specific to a particular language, but usually written in Java •Inspired by the map and reduce functions commonly used in functional programming ‣Map() performs filtering and sorting ‣Reduce() aggregates the output of mappers ‣and a Shuffle() step to redistribute output by keys •Resolved several complications of distributed computing: ‣Allows unlimited computations on unlimited data ‣Map and reduce functions can be easily distributed ‣Originated at Google; Hadoop was Yahoo’s open-source
 implementation of MapReduce, + two are synonymous MapReduce - The Original Big Data Query Framework Mapper Filter, Project Mapper Filter, Project Mapper Filter, Project Reducer Aggregate Reducer Aggregate Output
 One HDFS file per reducer,
 in a directory
  • 7. info@rittmanmead.com www.rittmanmead.com @rittmanmead 7 •Original developed at Facebook, now foundational within the Hadoop project •Allows users to query Hadoop data using SQL-like language •Tabular metadata layer that overlays files, can interpret semi-structured data (e.g. JSON) •Generates MapReduce code to return required data •Extensible through SerDes and Storage Handlers •JDBC and ODBC drivers for most platforms/tools •Perfect for set-based access + batch ETL work Apache Hive : SQL Metadata + Engine over Hadoop
  • 8. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Data integration tools could now load and process Hadoop data •BI tools could treat Hadoop as just another data source •Generally use MapReduce and Hive to access data ‣ODBC and JDBC access to Hive tabular data ‣Allows Hadoop unstructured/semi-structured
 data on HDFS to be accessed like RDBMS Hive Provides a SQL Interface for BI + ETL Tools Access direct Hive or extract using ODI12c for structured OBIEE dashboard analysis What pages are people visiting? Who is referring to us on Twitter? What content has the most reach?
  • 9.
  • 10.
  • 13.
  • 14. But the future is fast
  • 15.
  • 16. info@rittmanmead.com www.rittmanmead.com @rittmanmead 16 Hadoop 2.0 Processing Frameworks + Tools
  • 17. info@rittmanmead.com www.rittmanmead.com @rittmanmead 17 •MapReduce 2 (MR2) splits the functionality of the JobTracker
 by separating resource management and job scheduling/monitoring •Introduces YARN (Yet Another Resource Manager) •Permits other processing frameworks to MR ‣For example, Apache Spark •Maintains backwards compatibility with MR1 •Introduced with CDH5+ MapReduce 2 and YARN Node
 Manager Node
 Manager Node
 Manager Resource
 Manager Client Client
  • 18. info@rittmanmead.com www.rittmanmead.com @rittmanmead 18 •Runs on top of YARN, provides a faster execution engine than MapReduce for Hive, Pig etc •Models processing as an entire data flow graph (DAG), rather than separate job steps ‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems ‣Dataflow steps pass data between them as streams, rather than writing/reading from disk •Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez •Favoured In-memory / Hive v2 
 route by Hortonworks Apache Tez InputData TEZ DAG Map() Map() Map() Reduce() OutputData Reduce() Reduce() Reduce() InputData Map() Map() Reduce() Reduce()
  • 19. info@rittmanmead.com www.rittmanmead.com @rittmanmead 19 Tez Advantage - Drop-In Replacement for MR with Hive, Pig set hive.execution.engine=mr set hive.execution.engine=tez 4m 17s 2m 25s
  • 20. info@rittmanmead.com www.rittmanmead.com @rittmanmead 20 •Cloudera’s answer to Hive query response time issues •MPP SQL query engine running on Hadoop, bypasses MapReduce for direct data access •Mostly in-memory, but spills to disk if required •Uses Hive metastore to access Hive table metadata •Similar SQL dialect to Hive - not as rich though and no support for Hive SerDes, storage handlers etc Cloudera Impala - Fast, MPP-style Access to Hadoop Data
  • 21. info@rittmanmead.com www.rittmanmead.com @rittmanmead 21 •Log into Impala Shell, run INVALIDATE METADATA command to refresh Impala table list •Run SHOW TABLES Impala SQL command to view tables available •Run COUNT(*) on main ACCESS_PER_POST table to see typical response time Enabling Hive Tables for Impala [oracle@bigdatalite ~]$ impala-shell Starting Impala Shell without Kerberos authentication [bigdatalite.localdomain:21000] > invalidate metadata; Query: invalidate metadata Fetched 0 row(s) in 2.18s [bigdatalite.localdomain:21000] > show tables; Query: show tables +-----------------------------------+ | name | +-----------------------------------+ | access_per_post | | access_per_post_cat_author | | … | | posts | |——————————————————————————————————-+ Fetched 45 row(s) in 0.15s [bigdatalite.localdomain:21000] > select count(*) 
 from access_per_post; Query: select count(*) from access_per_post +----------+ | count(*) | +----------+ | 343 | +----------+ Fetched 1 row(s) in 2.76s
  • 22. info@rittmanmead.com www.rittmanmead.com @rittmanmead 22 •Beginners usually store data in HDFS using text file formats (CSV) but these have limitations •Apache AVRO often used for general-purpose processing ‣Splitability, schema evolution, in-built metadata, support for block compression •Parquet now commonly used with Impala due to column-orientated storage ‣Mirrors work in RDBMS world around column-store ‣Only return (project) the columns you require across a wide table Apache Parquet - Column-Orientated Storage for Analytics
  • 23. Hive (and HDFS) is Append-Only
  • 24. info@rittmanmead.com www.rittmanmead.com @rittmanmead 24 •Main DML operation in HiveQL is SELECT statement HiveQL Limited to SELECT and Bulk Load •CREATE TABLE test (
 product_id int,
 product_desc string);
 •SHOW TABLES;
 •CREATE TABLE test2
 AS SELECT * FROM test; •SELECT SUM(sales)
 FROM sales_summary;
 •LOAD DATA INPATH ‘/user/mrittman/logs’ INTO TABLE log_entries;
  • 25. What if you need to UPDATE,
 INSERT or DELETE?
  • 26. For example … 
 SCD2 Dimension Tables?
  • 27. info@rittmanmead.com www.rittmanmead.com @rittmanmead 27 •Extract data from source into flat files •Load files into Apache HBase tables, with Apache Hive table metadata over them ‣Loading into HBase permits full CRUD activity, rather than append-only loading •Query using Apache Hive and BI/ETL tool; update existing rows by LOAD DATA into Hive Standard Updatable-Hive Table Solution - Hive-on-Base
  • 28. info@rittmanmead.com www.rittmanmead.com @rittmanmead 28 •Based on BigTable paper from Google, 2006, Dean et al. ‣“Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.”Key Features: ‣Distributed storage across cluster of machines – Random, online read and write data access ‣Schemaless data model (“NoSQL”) ‣Self-managed data partitions •Alternative to HDFS as Hive storage layer ‣Allows us to do update and delete
 activity rather than just Hive append-only ‣Very fast for incremental loading ‣Can define Hive tables over HBase ones,
 allowing OBIEE to then access them What is HBase?
  • 29. info@rittmanmead.com www.rittmanmead.com @rittmanmead 29 •HBase Shell CLI allows you to create HBase tables •GET and PUT commands can then be used to add/update cells, query cells etc Creating HBase Tables using HBase Shell hbase shell create 'carriers','details' create 'geog_origin','origin' create 'geog_dest','dest' create 'flight_delays','dims','measures' put 'geog_dest','LAX','dest:airport_name','Los Angeles, CA: Los Angeles' put 'geog_dest','LAX','dest:city','Los Angeles, CA' put 'geog_dest','LAX','dest:state','California' put 'geog_dest','LAX','dest:id','12892' hbase(main):015:0> scan 'geog_dest' ROW COLUMN+CELL
 LAX column=dest:airport_name, timestamp=1432067861347, value=Los Angeles, CA: Los Angeles
 LAX column=dest:city, timestamp=1432067861375, value=Los Angeles,CA
 LAX column=dest:id, timestamp=1432067862018,value=12892
 LAX column=dest:state, timestamp=1432067861404,value=California
 1 row(s) in 0.0240 seconds
  • 30. info@rittmanmead.com www.rittmanmead.com @rittmanmead 30 •Create Hive tables over the HBase ones to provide SQL load/query capabilities ‣Uses HBaseStorageHandler Storage Handler for HBAse ‣HBase columns mapped to Hive columns using SERDEPROPERTIES Create Hive Table Metadata over HBase Tables CREATE EXTERNAL TABLE hbase_flight_delays (key string, year string, carrier string, orig string, dest string, flights string, late string, cancelled string, distance string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,dims:year,dims:carrier,dims:orig,dims:dest,
 measures:flights,measures:late,measures:cancelled,measures:distance") TBLPROPERTIES ("hbase.table.name" = "test1_flight_delays");
  • 31. info@rittmanmead.com www.rittmanmead.com @rittmanmead 31 •Hive-on-HBase tables have the opposite query characteristics to Hive-on-HDFS tables ‣For small tables, query response-time is fast (HBase cell-level random access) ‣But response time increases (beyond regular HDFS-stored Hive tables) when lots of
 data needs to be retrieved and aggregated But … (and there’s always a “But…”) hive> select * from hbase_geog_dest where dest_city = 'San Francisco, CA'; OK JCC San Francisco, CA: China Basin Heliport San Francisco, CA California 12457 SFO San Francisco, CA: San Francisco International San Francisco, CA California 14771 Time taken: 0.171 seconds, Fetched: 2 row(s) hive> select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from flight_delays f > join hbase_geog_origin o on f.orig = o.key > and o.origin_state = 'California' > group by o.origin_airport_name; ... OK 17638Arcata/Eureka, CA: Arcata 9146Bakersfield, CA: Meadows Field 125433Burbank, CA: Bob Hope ... 1653Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field Time taken: 51.757 seconds, Fetched: 27 row(s)
  • 32. info@rittmanmead.com www.rittmanmead.com @rittmanmead 32 •Either use Impala engine to query the Hive-on-HBase tables as they are ‣Makes overall SELECT operations faster •Or CREATE TABLE … AS SELECT snapshot the Hive-on-HBase tables to parquet files ‣Takes advantage of Parquet column-storage optimization Partial Solution - Snapshot to Impala + Parquet [oracle@bigdatalite ~]$ impala-shell Starting Impala Shell without Kerberos authentication [bigdatalite.localdomain:21000] > invalidate metadata; Query: invalidate metadata Fetched 0 row(s) in 2.18s [bigdatalite.localdomain:21000] > show tables; Query: show tables +-----------------------------------+ | name | +-----------------------------------+ | access_per_post | | access_per_post_cat_author | | … | | posts | |——————————————————————————————————-+ Fetched 45 row(s) in 0.15s [bigdatalite.localdomain:21000] > select count(*) 
 from access_per_post; Query: select count(*) from access_per_post +----------+ | count(*) | +----------+ | 343 | +----------+ Fetched 1 row(s) in 2.76s
  • 33. info@rittmanmead.com www.rittmanmead.com @rittmanmead 33 •But Parquet (and HDFS) have significant limitation for real-time analytics applications ‣Append-only orientation, focus on column-store 
 makes streaming ingestion harder •Cloudera Kudu aims to combine best of HDFS + HBase ‣Real-time analytics-optimised ‣Supports updates to data ‣Fast ingestion of data ‣Accessed using SQL-style tables
 and get/put/update/delete API Cloudera Kudu - Combining Best of HBase and Column-Store
  • 34. info@rittmanmead.com www.rittmanmead.com @rittmanmead 34 •Kudu storage used with Impala - create tables using Kudu storage handler •Can now UPDATE, DELETE and INSERT into Hadoop tables, not just SELECT and LOAD DATA Example Impala DDL + DML Commands with Kudu CREATE TABLE `my_first_table` ( `id` BIGINT, `name` STRING ) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'my_first_table', 'kudu.master_addresses' = 'kudu-master.example.com:7051', 'kudu.key_columns' = 'id' ); INSERT INTO my_first_table VALUES (99, "sarah"); INSERT IGNORE INTO my_first_table VALUES (99, "sarah"); UPDATE my_first_table SET name="bob" where id = 3; DELETE FROM my_first_table WHERE id < 3; DELETE c FROM my_second_table c, stock_symbols s WHERE c.name = s.symbol;
  • 36. info@rittmanmead.com www.rittmanmead.com @rittmanmead 36 •Many Hadoop tools now use in-memory processing (R, Python, Spark etc) •But they all work to their own standards •Considerable overhead in serialising / deserialising data between tools •Apache Arrow standardised how in-memory data is held •Considerably reduced overhead and latency between tools Apache Arrow - Standardising In-Memory Structures
  • 37. info@rittmanmead.com www.rittmanmead.com @rittmanmead 37 Apache Arrow : Common In-Memory Layer for Hadoop
  • 38. info@rittmanmead.com www.rittmanmead.com @rittmanmead 38 •Similar concept to Oracle BI Server ‣Logical star scheme for business model ‣Maps to federated data sources ‣Integrates and models data for query ‣Can run alongside Kylin MOLAP server •Part of wider next-gen Hadoop BI stack ‣Apache Zepplin web-based notebook ‣Spark/MR/Tez execution engines ‣Hive and Drill for ETL/ad-hoc SQL •Slowly the BI stack gets built out... Apache Lens : Logical Dimensional Model for Hadoop
  • 39. For data processing, In Oracle we have PL/SQL
  • 40. In SQL Server we have TSQL
  • 41. In Hadoop we have ... Spark
  • 42.
  • 43. info@rittmanmead.com www.rittmanmead.com @rittmanmead 43 •Another DAG execution engine running on YARN •More mature than TEZ, with richer API and more vendor support •Uses concept of an RDD (Resilient Distributed Dataset) ‣RDDs like tables or Pig relations, but can be cached in-memory ‣Great for in-memory transformations, or iterative/cyclic processes •Spark jobs comprise of a DAG of tasks operating on RDDs •Access through Scala, Python or Java APIs •Related projects include ‣Spark SQL ‣Spark Streaming Apache Spark
  • 44. info@rittmanmead.com www.rittmanmead.com @rittmanmead 44 •Native support for multiple languages 
 with identical APIs ‣Python - prototyping, data wrangling ‣Scala - functional programming features ‣Java - lower-level, application integration •Use of closures, iterations, and other 
 common language constructs to minimize code •Integrated support for distributed +
 functional programming •Unified API for batch and streaming Rich Developer Support + Wide Developer Ecosystem scala> val logfile = sc.textFile("logs/access_log") 14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353) 
 called with curMem=234759, maxMem=309225062 14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2 
 stored as values to memory (estimated size 75.5 KB, free 294.6 MB) logfile: org.apache.spark.rdd.RDD[String] = 
 MappedRDD[31] at textFile at <console>:15 scala> logfile.count() 14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1 14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1 ... 14/05/12 21:19:06 INFO SparkContext: Job finished: 
 count at <console>:18, took 0.192536694 s res7: Long = 154563 scala> val logfile = sc.textFile("logs/access_log").cache scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/")) biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17 scala> biapps11g.count() ... 14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s res9: Long = 403
  • 45. info@rittmanmead.com www.rittmanmead.com @rittmanmead 45 •Spark SQL, and Data Frames, allow RDDs in Spark to be processed using SQL queries •Bring in and federate additional data from JDBC sources •Load, read and save data in Hive, Parquet and other structured tabular formats Spark SQL - SQL within Apache Spark val accessLogsFilteredDF = accessLogs .filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*")) .filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF() .registerTempTable("accessLogsFiltered") val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) 
 as total 
 FROM accessLogsFiltered a 
 JOIN posts p ON a.endpoint = p.POST_SLUG 
 GROUP BY p.POST_TITLE, p.POST_AUTHOR 
 ORDER BY total DESC LIMIT 10 ") // Persist top ten table for this window to HDFS as parquet file topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet"
 , "parquet", SaveMode.Overwrite)
  • 46.
  • 47. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Apache Drill is another SQL-on-Hadoop project that focus on schema-free data discovery •Inspired by Google Dremel, innovation is querying raw data with schema optional •Automatically infers and detects schema from semi-structured datasets and NoSQL DBs •Join across different silos of data e.g. JSON records, Hive tables and HBase database •Aimed at different use-cases than Hive - 
 low-latency queries, discovery 
 (think Endeca vs OBIEE) Apache Drill - Leverage Embedded Metadata in Files
  • 48. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Most modern datasource formats embed their schema in the data (“schema-on-read”) •Apache Drill makes these as easy to join to traditional datasets as “point me at the data” •Cuts out unnecessary work in defining Hive schemas for data that’s self-describing •Supports joining across files,
 databases, NoSQL etc Self-Describing Data - Parquet, AVRO, JSON etc
  • 49. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Files can exist either on the local filesystem, or on HDFS •Connection to directory or file defined in storage configuration •Can work with CSV, TXT, TSV etc •First row of file can provide schema (column names) Apache Drill and Text Files SELECT * FROM dfs.`/tmp/csv_with_header.csv2`; +-------+------+------+------+ | name | num1 | num2 | num3 | +-------+------+------+------+ | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | +-------+------+------+------+ 7 rows selected (0.12 seconds) SELECT * FROM dfs.`/tmp/csv_no_header.csv`; +------------------------+ | columns | +------------------------+ | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | +------------------------+ 7 rows selected (0.112 seconds)
  • 50. info@rittmanmead.com www.rittmanmead.com @rittmanmead •JSON (Javascript Object Notation) documents are often used for data interchange •Exports from Twitter and other consumer services •Web service responses and other B2B interfaces •A more lightweight form of XML that is “self-describing” •Handles evolving schemas, and optional attributes •Drill treats each document as a row, and has features to •Flatten nested data (extract elements from arrays) •Generate key/value pairs for loosely structured data Apache Drill and JSON Documents use dfs.iot; show files; select in_reply_to_user_id, text from `all_tweets.json` limit 5; +---------------------+------+ | in_reply_to_user_id | text | +---------------------+------+ | null | BI Forum 2013 in Brighton has now sold-out | | null | "Football has become a numbers game | | null | Just bought Lyndsay Wise’s Book | | null | An Oracle BI "Blast from the Past" | | 14716125 | Dilbert on Agile Programming | +---------------------+------+ 5 rows selected (0.229 seconds) select name, flatten(fillings) as f 
 from dfs.users.`/donuts.json` 
 where f.cal < 300;
  • 51. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Drill can connect to Hive to make use of metastore (incl. multiple Hive metastores) •NoSQL databases (HBase etc) •Parquet files (native storage format - columnar + self describing) Apache Drill and Hive, HBase, Parquet Sources etc USE hbase; SELECT * FROM students; +-------------+-----------------------+-----------------------------------------------------+ | row_key | account | address | +-------------+-----------------------+------------------------------------------------------+ | [B@e6d9eb7 | {"name":"QWxpY2U="} | {"state":"Q0E=","street":"MTIzIEJhbGxtZXIgQXY="} | | [B@2823a2b4 | {"name":"Qm9i"} | {"state":"Q0E=","street":"MSBJbmZpbml0ZSBMb29w"} | | [B@3b8eec02 | {"name":"RnJhbms="} | {"state":"Q0E=","street":"NDM1IFdhbGtlciBDdA=="} | | [B@242895da | {"name":"TWFyeQ=="} | {"state":"Q0E=","street":"NTYgU291dGhlcm4gUGt3eQ=="} | +-------------+-----------------------+----------------------------------------------------------------------+ SELECT firstname,lastname FROM 
 hiveremote.`customers` limit 10;`
 +------------+------------+ | firstname | lastname | +------------+------------+ | Essie | Vaill | | Cruz | Roudabush | | Billie | Tinnes | | Zackary | Mockus | | Rosemarie | Fifield | | Bernard | Laboy | | Marianne | Earman | +------------+------------+ SELECT * FROM dfs.`iot_demo/geodata/region.parquet`; +--------------+--------------+-----------------------+ | R_REGIONKEY | R_NAME | R_COMMENT | +--------------+--------------+-----------------------+ | 0 | AFRICA | lar deposits. blithe | | 1 | AMERICA | hs use ironic, even | | 2 | ASIA | ges. thinly even pin | | 3 | EUROPE | ly final courts cajo | | 4 | MIDDLE EAST | uickly special accou | +--------------+--------------+-----------------------+
  • 52. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Drill developed for real-time, ad-hoc data exploration with schema discovery on-the-fly •Individual analysts exploring new datasets, leveraging corporate metadata/data to help •Hive is more about large-scale, centrally curated set-based big data access •Drill models conceptually as JSON, vs. Hive’s tabular approach •Drill introspects schema from whatever it connects to, vs. formal modeling in Hive Apache Drill vs. Apache Hive Interactive Queries
 (Data Discovery, Tableau/VA) Reporting Queries
 (Canned Reports, OBIEE) ETL
 (ODI, Scripting, Informatica) Apache Drill Apache Hive Interactive Queries 100ms - 3mins Reporting Queries 3mins - 20mins ETL & Batch Queries 20mins - hours
  • 53.
  • 55. MARK RITTMAN, COO, RITTMAN MEAD BUDAPEST DATA FORUM, JUNE 2016 SQL-ON-HADOOP FOR ANALYTICS + BI: 
 WHAT ARE MY OPTIONS, WHAT’S THE FUTURE?