SlideShare a Scribd company logo
The Economics of SQL on
Hadoop

© 2013 Datameer, Inc. All rights reserved.
Watch the Recording of this Webinar


View the entire recorded webinar at:

http://info.datameer.com/SlideshareEconomics-SQL-Hadoop.html
About our Speakers
John Myers
!
John Myers joined Enterprise Management Associates
in 2011 as senior analyst of the business intelligence
(BI) practice area. John has 10+ years of experience
working in areas related to business analytics in
professional services consulting and product
development roles, as well as helping organizations
solve their business analytics problems, whether they
relate to operational platforms, such as customer care
or billing, or applied analytical applications, such as
revenue assurance or fraud management. !

Slide 3

© 2013 Datameer, Inc. All rights reserved.
About our Speakers
Stefan Groschupf!
!
▪  Stefan Groschupf is the co-founder and CEO of

Datameer. He is one of the original contributors to
Nutch, the open source predecessor of Hadoop,
Stefan has been at the forefront of the Hadoop and
Big Data market.
Prior to Datameer, Stefan was the co-founder and
CEO of Scale Unlimited, which implemented
custom Hadoop analytic solutions for HP, Sun,
Deutsche Telekom, Nokia and others. Earlier,
Stefan was CEO of 101Tec, a supplier of Hadoop
and Nutch-based search and text classification
software to industry-leading companies such as
Apple, DHL and EMI Music. Stefan has also served
as CTO at multiple companies, including Sproose,
a social search engine company.

Slide 4

© 2013 Datameer, Inc. All rights reserved.
About our Speakers
Matt Schumpert!
!
Matt has been working in enterprise software of
over 10 years in various capacities, including sales
engineering, strategic alliances and consulting.  !
!
Matt currently runs the pre-sales engineering team
at Datameer, supporting all technical aspects of
customer engagement through roll-out of customers
into production. !
 !
Matt holds a BS in Computer Science from the
University of Virginia.!

Slide 5

© 2013 Datameer, Inc. All rights reserved.
Agenda
▪  EMA on Current State of the Big Data Industry!
– 
– 
– 
– 
– 

Online Archiving in Practice!
SQL on NoSQL: Metadata!
Exploratory Use Cases!
Late Binding Schemas better for Discovery!
Economics of Hadoop!

▪  Datameer on how to solve these problems!
–  Use Case #1: Semi-Structured Data !
–  Use Case #2: Text Analytics data!
–  Use Case #3: Path Analysis!

▪  Takeaways; and Question and Answer!

Slide 6

© 2013 Datameer, Inc. All rights reserved.
State of Big Data Industry

© 2013 Datameer, Inc. All rights reserved.
Online Archiving is the majority use case for Big
Data projects

Slide 8

© 2013Enterprise Management Associates, Inc.
Moving Beyond select * from tablename
SQL requires a managed set of metadata

Slide 9

© 2013Enterprise Management Associates, Inc.
Big Data Platforms have Multiple Uses:
Discovery is a significant portion

Slide 10

© 2013Enterprise Management Associates, Inc.
Late Binding Schemas are good for Discovery

Slide 11

© 2013Enterprise Management Associates, Inc.
Free as a Free puppy…

Slide 12

© 2013 Enterprise Management Associates, Inc.
Datameer Demos

© 2013 Datameer, Inc. All rights reserved.
Use Case #1: Semi-Structured Data

▪  Noisy, log-structured data à signal

Slide 14

© 2013 Datameer, Inc. All rights reserved.
Use Case #1: Semi-Structured Data

▪  Noisy, log-structured data à signal
▪  Extract, cast, & define fields on demand

Slide 15

© 2013 Datameer, Inc. All rights reserved.
Use Case #1: Semi-Structured Data

▪  Noisy, log-structured data à signal
▪  Extract, cast, & define fields on demand
▪  Painful/impossible without inspection

Slide 16

© 2013 Datameer, Inc. All rights reserved.
Use Case #1: Semi-Structured Data

▪  Noisy, log-structured data à signal
▪  Extract, cast, & define fields on demand
▪  Painful/impossible without inspection
▪  “One-offs” are possible with SQL+UDFs
▪  But better to collaborate with shared “views”

Slide 17

© 2013 Datameer, Inc. All rights reserved.
Use Case #1: Semi-Structured Data

▪  Noisy, log-structured data à signal
▪  Extract, cast, & define fields on demand
▪  Painful/impossible without inspection
▪  “One-offs” are possible with SQL+UDFs
▪  But better to collaborate with shared “views”

▪  Examples:
▪  “User-agent” string
▪  URL Parameters 
▪  JSON
Slide 18

© 2013 Datameer, Inc. All rights reserved.
Use Case #2: Text Analytics
▪  Few/no known fields

Slide 19

© 2013 Datameer, Inc. All rights reserved.
Use Case #2: Text Analytics
▪  Few/no known fields
▪  Notion of a record is nebulous / fluid

Slide 20

© 2013 Datameer, Inc. All rights reserved.
Use Case #2: Text Analytics
▪  Few/no known fields
▪  Notion of a record is nebulous / fluid
▪  Wrangling and mining

Slide 21

© 2013 Datameer, Inc. All rights reserved.
Use Case #2: Text Analytics
▪  Few/no known fields
▪  Notion of a record is nebulous / fluid
▪  Wrangling and mining
▪  “Bag-of-Words” is a sensible start

Slide 22

© 2013 Datameer, Inc. All rights reserved.
Use Case #2: Text Analytics
▪  Few/no known fields
▪  Notion of a record is nebulous / fluid
▪  Wrangling and mining
▪  “Bag-of-Words” is a sensible start
▪  Again, frequent inspection is key

Slide 23

© 2013 Datameer, Inc. All rights reserved.
Use Case #3: Path Analysis 
▪  Key component of clickstream analysis

Slide 24

© 2013 Datameer, Inc. All rights reserved.
Use Case #3: Path Analysis 
▪  Key component of clickstream analysis
▪  Compares each record to the next/previous

Slide 25

© 2013 Datameer, Inc. All rights reserved.
Use Case #3: Path Analysis 
▪  Key component of clickstream analysis
▪  Compares each record to the next/previous
▪  Defines/summarizes transitions, not events

Slide 26

© 2013 Datameer, Inc. All rights reserved.
Use Case #3: Path Analysis 
▪  Key component of clickstream analysis
▪  Compares each record to the next/previous
▪  Defines/summarizes transitions, not events
▪  Supported by list/array types

Slide 27

© 2013 Datameer, Inc. All rights reserved.
Use Case #3: Path Analysis 
▪  Key component of clickstream analysis
▪  Compares each record to the next/previous
▪  Defines/summarizes transitions, not events
▪  Supported by list/array types
▪  Requires multi-pass queries

Slide 28

© 2013 Datameer, Inc. All rights reserved.
Takeaways

© 2013 Datameer, Inc. All rights reserved.
When NOT to use SQL on Hadoop
▪  Structured Schemas

or “Schema on Write”

Slide 30

© 2013 Datameer, Inc. All rights reserved.
When NOT to use SQL on Hadoop
▪  Structured Schemas

or “Schema on Write”
▪  “Realtime” Query
SLAs for operational
or reporting tasks

Slide 31

© 2013 Datameer, Inc. All rights reserved.
When NOT to use SQL on Hadoop
▪  Structured Schemas

or “Schema on Write”
▪  “Realtime” Query
SLAs for operational
or reporting tasks
▪  Highly detailed SQL
query requirements
(SQL-2003)

Slide 32

© 2013 Datameer, Inc. All rights reserved.
When to use SQL on Hadoop
▪  Unstructured

Datasets and
“Schema on Read”

Slide 33

© 2013 Datameer, Inc. All rights reserved.
When to use SQL on Hadoop
▪  Unstructured

Datasets and
“Schema on Read”
▪  Discovery tasks
designed to find new
connections and new
business value

Slide 34

© 2013 Datameer, Inc. All rights reserved.
When to use SQL on Hadoop
▪  Unstructured

Datasets and
“Schema on Read”
▪  Discovery tasks
designed to find new
connections and new
business value
▪  Lower level SQL
queries (SQL-99) 

Slide 35

© 2013 Datameer, Inc. All rights reserved.
Summary
▪  EMA on Current State of the Big Data Industry
–  Online Archiving in Practice
–  SQL on NoSQL: Metadata
–  Exploratory Use Cases
–  Late Binding Schemas better for Discovery

▪  Datameer on how to solve these problems
–  Use Case #1: Semi-Structured Data
–  Use Case #2: Text Analytics
–  Use Case #3: Path Analysis

Slide 36

© 2013 Datameer, Inc. All rights reserved.
Call To Action
■  Visit our website
–  www.datameer.com

■  Download our Trial
–  http://www.datameer.com/Datameer-trial.html

Slide 37

© 2013 Datameer, Inc. All rights reserved.
The Economics of SQL on Hadoop

More Related Content

Similar to The Economics of SQL on Hadoop

The New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudThe New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the Cloud
Inside Analysis
 
How to do Data Science Without the Scientist
How to do Data Science Without the ScientistHow to do Data Science Without the Scientist
How to do Data Science Without the Scientist
Datameer
 
How to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics WebinarHow to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics Webinar
Datameer
 
Big data oracle_introduccion
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccion
Fran Navarro
 
Looking Before You Leap into the Cloud: A proactive approach to machine learn...
Looking Before You Leap into the Cloud: A proactive approach to machine learn...Looking Before You Leap into the Cloud: A proactive approach to machine learn...
Looking Before You Leap into the Cloud: A proactive approach to machine learn...
Enterprise Management Associates
 
Customer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data AnalyticsCustomer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data Analytics
Datameer
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
Wilfried Hoge
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Matt Stubbs
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Matt Stubbs
 
The new dominant companies are running on data
The new dominant companies are running on data The new dominant companies are running on data
The new dominant companies are running on data
SnapLogic
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enough
Cloudera, Inc.
 
6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop
Dr. Wilfred Lin (Ph.D.)
 
Modern data integration expert sessions
Modern data integration expert sessionsModern data integration expert sessions
Modern data integration expert sessions
JessicaMurrell3
 
Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar
ibi
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
SnapLogic
 
Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User
Datameer
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
Neo4j
 
Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
Datameer
 
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondStanding Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Cloudera, Inc.
 
Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17
Cloudera, Inc.
 

Similar to The Economics of SQL on Hadoop (20)

The New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudThe New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the Cloud
 
How to do Data Science Without the Scientist
How to do Data Science Without the ScientistHow to do Data Science Without the Scientist
How to do Data Science Without the Scientist
 
How to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics WebinarHow to Avoid Pitfalls in Big Data Analytics Webinar
How to Avoid Pitfalls in Big Data Analytics Webinar
 
Big data oracle_introduccion
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccion
 
Looking Before You Leap into the Cloud: A proactive approach to machine learn...
Looking Before You Leap into the Cloud: A proactive approach to machine learn...Looking Before You Leap into the Cloud: A proactive approach to machine learn...
Looking Before You Leap into the Cloud: A proactive approach to machine learn...
 
Customer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data AnalyticsCustomer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data Analytics
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on DataBig Data LDN 2017: The New Dominant Companies Are Running on Data
Big Data LDN 2017: The New Dominant Companies Are Running on Data
 
The new dominant companies are running on data
The new dominant companies are running on data The new dominant companies are running on data
The new dominant companies are running on data
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enough
 
6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop
 
Modern data integration expert sessions
Modern data integration expert sessionsModern data integration expert sessions
Modern data integration expert sessions
 
Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 
Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
 
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondStanding Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
 
Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17
 

More from Datameer

Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data Analytics
Datameer
 
Getting Started with Big Data for Business Managers
Getting Started with Big Data for Business ManagersGetting Started with Big Data for Business Managers
Getting Started with Big Data for Business Managers
Datameer
 
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...
Datameer
 
Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data
Datameer
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
Datameer
 
Webinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-EndWebinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-End
Datameer
 
Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?
Datameer
 
Why Use Hadoop?
Why Use Hadoop?Why Use Hadoop?
Why Use Hadoop?
Datameer
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics Webinar
Datameer
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
Datameer
 
Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?
Datameer
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data Analytics
Datameer
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
Datameer
 
Lean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use CaseLean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use Case
Datameer
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big Data
Datameer
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by Datameer
Datameer
 
How to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited DataHow to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited Data
Datameer
 

More from Datameer (17)

Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data Analytics
 
Getting Started with Big Data for Business Managers
Getting Started with Big Data for Business ManagersGetting Started with Big Data for Business Managers
Getting Started with Big Data for Business Managers
 
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...
The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...
 
Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
Webinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-EndWebinar - Introducing Datameer 4.0: Visual, End-to-End
Webinar - Introducing Datameer 4.0: Visual, End-to-End
 
Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?
 
Why Use Hadoop?
Why Use Hadoop?Why Use Hadoop?
Why Use Hadoop?
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics Webinar
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
 
Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data Analytics
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
 
Lean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use CaseLean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use Case
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big Data
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by Datameer
 
How to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited DataHow to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited Data
 

Recently uploaded

System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
maazsz111
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 

Recently uploaded (20)

System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 

The Economics of SQL on Hadoop

  • 1. The Economics of SQL on Hadoop © 2013 Datameer, Inc. All rights reserved.
  • 2. Watch the Recording of this Webinar View the entire recorded webinar at: http://info.datameer.com/SlideshareEconomics-SQL-Hadoop.html
  • 3. About our Speakers John Myers ! John Myers joined Enterprise Management Associates in 2011 as senior analyst of the business intelligence (BI) practice area. John has 10+ years of experience working in areas related to business analytics in professional services consulting and product development roles, as well as helping organizations solve their business analytics problems, whether they relate to operational platforms, such as customer care or billing, or applied analytical applications, such as revenue assurance or fraud management. ! Slide 3 © 2013 Datameer, Inc. All rights reserved.
  • 4. About our Speakers Stefan Groschupf! ! ▪  Stefan Groschupf is the co-founder and CEO of Datameer. He is one of the original contributors to Nutch, the open source predecessor of Hadoop, Stefan has been at the forefront of the Hadoop and Big Data market. Prior to Datameer, Stefan was the co-founder and CEO of Scale Unlimited, which implemented custom Hadoop analytic solutions for HP, Sun, Deutsche Telekom, Nokia and others. Earlier, Stefan was CEO of 101Tec, a supplier of Hadoop and Nutch-based search and text classification software to industry-leading companies such as Apple, DHL and EMI Music. Stefan has also served as CTO at multiple companies, including Sproose, a social search engine company. Slide 4 © 2013 Datameer, Inc. All rights reserved.
  • 5. About our Speakers Matt Schumpert! ! Matt has been working in enterprise software of over 10 years in various capacities, including sales engineering, strategic alliances and consulting.  ! ! Matt currently runs the pre-sales engineering team at Datameer, supporting all technical aspects of customer engagement through roll-out of customers into production. !  ! Matt holds a BS in Computer Science from the University of Virginia.! Slide 5 © 2013 Datameer, Inc. All rights reserved.
  • 6. Agenda ▪  EMA on Current State of the Big Data Industry! –  –  –  –  –  Online Archiving in Practice! SQL on NoSQL: Metadata! Exploratory Use Cases! Late Binding Schemas better for Discovery! Economics of Hadoop! ▪  Datameer on how to solve these problems! –  Use Case #1: Semi-Structured Data ! –  Use Case #2: Text Analytics data! –  Use Case #3: Path Analysis! ▪  Takeaways; and Question and Answer! Slide 6 © 2013 Datameer, Inc. All rights reserved.
  • 7. State of Big Data Industry © 2013 Datameer, Inc. All rights reserved.
  • 8. Online Archiving is the majority use case for Big Data projects Slide 8 © 2013Enterprise Management Associates, Inc.
  • 9. Moving Beyond select * from tablename SQL requires a managed set of metadata Slide 9 © 2013Enterprise Management Associates, Inc.
  • 10. Big Data Platforms have Multiple Uses: Discovery is a significant portion Slide 10 © 2013Enterprise Management Associates, Inc.
  • 11. Late Binding Schemas are good for Discovery Slide 11 © 2013Enterprise Management Associates, Inc.
  • 12. Free as a Free puppy… Slide 12 © 2013 Enterprise Management Associates, Inc.
  • 13. Datameer Demos © 2013 Datameer, Inc. All rights reserved.
  • 14. Use Case #1: Semi-Structured Data ▪  Noisy, log-structured data à signal Slide 14 © 2013 Datameer, Inc. All rights reserved.
  • 15. Use Case #1: Semi-Structured Data ▪  Noisy, log-structured data à signal ▪  Extract, cast, & define fields on demand Slide 15 © 2013 Datameer, Inc. All rights reserved.
  • 16. Use Case #1: Semi-Structured Data ▪  Noisy, log-structured data à signal ▪  Extract, cast, & define fields on demand ▪  Painful/impossible without inspection Slide 16 © 2013 Datameer, Inc. All rights reserved.
  • 17. Use Case #1: Semi-Structured Data ▪  Noisy, log-structured data à signal ▪  Extract, cast, & define fields on demand ▪  Painful/impossible without inspection ▪  “One-offs” are possible with SQL+UDFs ▪  But better to collaborate with shared “views” Slide 17 © 2013 Datameer, Inc. All rights reserved.
  • 18. Use Case #1: Semi-Structured Data ▪  Noisy, log-structured data à signal ▪  Extract, cast, & define fields on demand ▪  Painful/impossible without inspection ▪  “One-offs” are possible with SQL+UDFs ▪  But better to collaborate with shared “views” ▪  Examples: ▪  “User-agent” string ▪  URL Parameters ▪  JSON Slide 18 © 2013 Datameer, Inc. All rights reserved.
  • 19. Use Case #2: Text Analytics ▪  Few/no known fields Slide 19 © 2013 Datameer, Inc. All rights reserved.
  • 20. Use Case #2: Text Analytics ▪  Few/no known fields ▪  Notion of a record is nebulous / fluid Slide 20 © 2013 Datameer, Inc. All rights reserved.
  • 21. Use Case #2: Text Analytics ▪  Few/no known fields ▪  Notion of a record is nebulous / fluid ▪  Wrangling and mining Slide 21 © 2013 Datameer, Inc. All rights reserved.
  • 22. Use Case #2: Text Analytics ▪  Few/no known fields ▪  Notion of a record is nebulous / fluid ▪  Wrangling and mining ▪  “Bag-of-Words” is a sensible start Slide 22 © 2013 Datameer, Inc. All rights reserved.
  • 23. Use Case #2: Text Analytics ▪  Few/no known fields ▪  Notion of a record is nebulous / fluid ▪  Wrangling and mining ▪  “Bag-of-Words” is a sensible start ▪  Again, frequent inspection is key Slide 23 © 2013 Datameer, Inc. All rights reserved.
  • 24. Use Case #3: Path Analysis ▪  Key component of clickstream analysis Slide 24 © 2013 Datameer, Inc. All rights reserved.
  • 25. Use Case #3: Path Analysis ▪  Key component of clickstream analysis ▪  Compares each record to the next/previous Slide 25 © 2013 Datameer, Inc. All rights reserved.
  • 26. Use Case #3: Path Analysis ▪  Key component of clickstream analysis ▪  Compares each record to the next/previous ▪  Defines/summarizes transitions, not events Slide 26 © 2013 Datameer, Inc. All rights reserved.
  • 27. Use Case #3: Path Analysis ▪  Key component of clickstream analysis ▪  Compares each record to the next/previous ▪  Defines/summarizes transitions, not events ▪  Supported by list/array types Slide 27 © 2013 Datameer, Inc. All rights reserved.
  • 28. Use Case #3: Path Analysis ▪  Key component of clickstream analysis ▪  Compares each record to the next/previous ▪  Defines/summarizes transitions, not events ▪  Supported by list/array types ▪  Requires multi-pass queries Slide 28 © 2013 Datameer, Inc. All rights reserved.
  • 29. Takeaways © 2013 Datameer, Inc. All rights reserved.
  • 30. When NOT to use SQL on Hadoop ▪  Structured Schemas or “Schema on Write” Slide 30 © 2013 Datameer, Inc. All rights reserved.
  • 31. When NOT to use SQL on Hadoop ▪  Structured Schemas or “Schema on Write” ▪  “Realtime” Query SLAs for operational or reporting tasks Slide 31 © 2013 Datameer, Inc. All rights reserved.
  • 32. When NOT to use SQL on Hadoop ▪  Structured Schemas or “Schema on Write” ▪  “Realtime” Query SLAs for operational or reporting tasks ▪  Highly detailed SQL query requirements (SQL-2003) Slide 32 © 2013 Datameer, Inc. All rights reserved.
  • 33. When to use SQL on Hadoop ▪  Unstructured Datasets and “Schema on Read” Slide 33 © 2013 Datameer, Inc. All rights reserved.
  • 34. When to use SQL on Hadoop ▪  Unstructured Datasets and “Schema on Read” ▪  Discovery tasks designed to find new connections and new business value Slide 34 © 2013 Datameer, Inc. All rights reserved.
  • 35. When to use SQL on Hadoop ▪  Unstructured Datasets and “Schema on Read” ▪  Discovery tasks designed to find new connections and new business value ▪  Lower level SQL queries (SQL-99) Slide 35 © 2013 Datameer, Inc. All rights reserved.
  • 36. Summary ▪  EMA on Current State of the Big Data Industry –  Online Archiving in Practice –  SQL on NoSQL: Metadata –  Exploratory Use Cases –  Late Binding Schemas better for Discovery ▪  Datameer on how to solve these problems –  Use Case #1: Semi-Structured Data –  Use Case #2: Text Analytics –  Use Case #3: Path Analysis Slide 36 © 2013 Datameer, Inc. All rights reserved.
  • 37. Call To Action ■  Visit our website –  www.datameer.com ■  Download our Trial –  http://www.datameer.com/Datameer-trial.html Slide 37 © 2013 Datameer, Inc. All rights reserved.

Editor's Notes

  1. According to 2012 EMA research, Online Archiving, or Hadumping, is the Phase “zero” of most Big Data initiatives Teaches Internal teams about the data delivery and structure How to interact with the data How to apply data to business cases as opposed to simply a technology project It is the where you start when: “you don’t know what you don’t know…” 2013 EMA Research shows that over half of Big Data projects have online archiving as an ‘In Operation’ status In Production or as a Pilot Project with hands on keyboards. Software installed. Over 4 in 10 respondents say “Economics” are a Business Reason for Online Archiving Use Case. These organizations are attempting to lower their operational costs
  2. Moving beyond select * requires a standard requires a facility that manages and tracks metadata Select * tablename is the rough equivalent to cat filename SQL starts to become truly “special” when you use a query such as Select t.columnA, s.columnB, s.columnC from tablename t tablename s Where t.columnZ = s.column.X NoSQL and specifically Hadoop have focused on the ability to be flexible in data storage often at the expense of metadata management SQL doesn’t do with an “or” data structure (image on right) SQL works best with a defined data structure (image on right) When you ask Hive a question it doesn’t understand…. You get the error message. In2013 EMA Research Big Data initiatives used the following datasets Machine generated (JSON, XML, etc) almost 40% Process mediated (structured) just under 30% Human sourced (emails, texts,) over 30% Over 30% of respondents indicate that a lack of self-service data access (SQL) is a challenge to operate a Hadoop platform Nearly 40% of respondents say a lack of SQL data access is a challenge to operate a NoSQL platform In each of these instances, it indicates that while you “CAN” perform certain applications on Hadoop, SQL-based data access is a high concern.
  3. Big Data environments aren’t just for EDW replacement as some would say There are multiple use cases Operational Analytical Exploratory Nearly 3 of 10 respondents in 2013 research say that they are using Exploratory or Discovery use cases Just under 50% of respondents say operational costs (staff head count is included) are a challenge to operate a discovery platform. 3 of 10 respondents want to utilize the features and functions of products to speed their skills acquisition. Often times these are features that they feel most comfortable with. Interfaces and processes that they use every day. MS Excel is an example. Nearly 4 out 10 respondents indicate new skills development is a challenge to operate a discovery platform
  4. When you are using exploratory or discovery use cases, you need flexibility… applying a hard schema (structured) presupposes particular questions AND answers. Square wooden peg and round wooden hole – not a lot of give. Being able to apply a schema or structure at the time of query or late binding schema enables the best method of discovery Flexible schema at the time of processing…. Sausage grinder 2013 EMA research says Over 30% of respondents use late binding schemas when processing data Nearly a third use multiple approaches Over 10% don’t apply a schema at all… “Only” about one third of Respondents are using external technical resources to bridge their skills gaps. This comes from the costs associated with the outside consultants vs existing staff
  5. “Free as in Speech” or “Free as in Beer”… Big Data is “Free as a Free Puppy” Over 40% of respondents say Economics are a Business Reason for Online Archiving Use Case Back to Metadata…. Over one third of respondents indicate shortage of technical metadata a challenge to operate a discovery platform. Applying that technical metadata layer takes a manual effort and thus additional headcount. When you link this to ‘only’ a 1% increase in big data budget from 2013 to 2014 for Hadoop implementations, it is important to put the best use for hadoop platforms. 36% implementation time to implement is a challenge to operate a hadoop platform 43% say operational costs are a challenge to operate a discovery platform (link to a 1% increase in big data operational budget from 2013 to 2014) Over one third of respondents say they lack the skills to manage multi-structured data platforms as an obstacle to implement (Top answer)