The Economics of SQL on Hadoop

The Economics of SQL on
Hadoop

© 2013 Datameer, Inc. All rights reserved.

Watch the Recording of this Webinar

View the entire recorded webinar at:

http://info.datameer.com/SlideshareEconomics-SQL-Hadoop.html

About our Speakers
John Myers
!
John Myers joined Enterprise Management Associates
in 2011 as senior analyst of the business intelligence
(BI) practice area. John has 10+ years of experience
working in areas related to business analytics in
professional services consulting and product
development roles, as well as helping organizations
solve their business analytics problems, whether they
relate to operational platforms, such as customer care
or billing, or applied analytical applications, such as
revenue assurance or fraud management. !

Slide 3


About our Speakers
Stefan Groschupf!
!
▪  Stefan Groschupf is the co-founder and CEO of

Datameer. He is one of the original contributors to
Nutch, the open source predecessor of Hadoop,
Stefan has been at the forefront of the Hadoop and
Big Data market.
Prior to Datameer, Stefan was the co-founder and
CEO of Scale Unlimited, which implemented
custom Hadoop analytic solutions for HP, Sun,
Deutsche Telekom, Nokia and others. Earlier,
Stefan was CEO of 101Tec, a supplier of Hadoop
and Nutch-based search and text classiﬁcation
software to industry-leading companies such as
Apple, DHL and EMI Music. Stefan has also served
as CTO at multiple companies, including Sproose,
a social search engine company.

Slide 4


About our Speakers
Matt Schumpert!
!
Matt has been working in enterprise software of
over 10 years in various capacities, including sales
engineering, strategic alliances and consulting. !
!
Matt currently runs the pre-sales engineering team
at Datameer, supporting all technical aspects of
customer engagement through roll-out of customers
into production. !
!
Matt holds a BS in Computer Science from the
University of Virginia.!

Slide 5


Agenda
▪  EMA on Current State of the Big Data Industry!
– 
– 
– 
– 
– 

Online Archiving in Practice!
SQL on NoSQL: Metadata!
Exploratory Use Cases!
Late Binding Schemas better for Discovery!
Economics of Hadoop!

▪  Datameer on how to solve these problems!
–  Use Case #1: Semi-Structured Data !
–  Use Case #2: Text Analytics data!
–  Use Case #3: Path Analysis!

▪  Takeaways; and Question and Answer!

Slide 6


State of Big Data Industry


Online Archiving is the majority use case for Big
Data projects

Slide 8

© 2013Enterprise Management Associates, Inc.

Moving Beyond select * from tablename
SQL requires a managed set of metadata

Slide 9


Big Data Platforms have Multiple Uses:
Discovery is a significant portion

Slide 10


Late Binding Schemas are good for Discovery

Slide 11


Free as a Free puppy…

Slide 12

© 2013 Enterprise Management Associates, Inc.

Datameer Demos


Use Case #1: Semi-Structured Data

▪  Noisy, log-structured data à signal

Slide 14



▪  Extract, cast, & deﬁne ﬁelds on demand

Slide 15



▪  Painful/impossible without inspection

Slide 16



▪  “One-offs” are possible with SQL+UDFs
▪  But better to collaborate with shared “views”

Slide 17



▪  “One-offs” are possible with SQL+UDFs
▪  But better to collaborate with shared “views”

▪  Examples:
▪  “User-agent” string
▪  URL Parameters
▪  JSON
Slide 18


Use Case #2: Text Analytics
▪  Few/no known ﬁelds

Slide 19


▪  Notion of a record is nebulous / ﬂuid

Slide 20


▪  Wrangling and mining

Slide 21


▪  “Bag-of-Words” is a sensible start

Slide 22


▪  “Bag-of-Words” is a sensible start
▪  Again, frequent inspection is key

Slide 23


Use Case #3: Path Analysis
▪  Key component of clickstream analysis

Slide 24


▪  Compares each record to the next/previous

Slide 25


▪  Deﬁnes/summarizes transitions, not events

Slide 26


▪  Supported by list/array types

Slide 27


▪  Supported by list/array types
▪  Requires multi-pass queries

Slide 28


Takeaways


When NOT to use SQL on Hadoop
▪  Structured Schemas

or “Schema on Write”

Slide 30



▪  “Realtime” Query
SLAs for operational
or reporting tasks

Slide 31



▪  “Realtime” Query
SLAs for operational
or reporting tasks
▪  Highly detailed SQL
query requirements
(SQL-2003)

Slide 32


When to use SQL on Hadoop
▪  Unstructured

Datasets and
“Schema on Read”

Slide 33


▪  Unstructured

Datasets and
▪  Discovery tasks
designed to ﬁnd new
connections and new
business value

Slide 34


▪  Unstructured

Datasets and
▪  Discovery tasks
designed to ﬁnd new
connections and new
business value
▪  Lower level SQL
queries (SQL-99)

Slide 35


Summary
▪  EMA on Current State of the Big Data Industry
–  Online Archiving in Practice
–  SQL on NoSQL: Metadata
–  Exploratory Use Cases
–  Late Binding Schemas better for Discovery

▪  Datameer on how to solve these problems
–  Use Case #1: Semi-Structured Data
–  Use Case #2: Text Analytics
–  Use Case #3: Path Analysis

Slide 36


Call To Action
■  Visit our website
–  www.datameer.com

■  Download our Trial
–  http://www.datameer.com/Datameer-trial.html

Slide 37


The Economics of SQL on Hadoop

The Economics of SQL on Hadoop

Recommended

Recommended

More Related Content

Similar to The Economics of SQL on Hadoop

Similar to The Economics of SQL on Hadoop (20)

More from Datameer

More from Datameer (17)

Recently uploaded

Recently uploaded (20)

The Economics of SQL on Hadoop

Editor's Notes