The Economics of SQL on Hadoop

  • 308 views
Uploaded on

Watch the recorded event at: http://info.datameer.com/Slideshare- …

Watch the recorded event at: http://info.datameer.com/Slideshare-
Economics-SQL-Hadoop.html

As organizations clamor to utilize their new investments in Hadoop ecosystems AND leverage their existing analytical infrastructures, many rush to integrate SQL as a data access layer to leverage existing skill sets and get started faster.

However, this approach relegates Hadoop to a data management and processing platform rather than the storage and compute engine optimized for analytical workloads it was purpose-built to be.

These slides by EMA and Datameer, will discuss the technical limitations of SQL on Hadoop and propose alternative ways to fully maximize Hadoop investments.

You will understanding:

*how SQL negates the inherent benefits of Hadoop
*why technological paradigm changes can sometimes be good
*use cases when SQL on Hadoop makes sense

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
308
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • According to 2012 EMA research, Online Archiving, or Hadumping, is the Phase “zero” of most Big Data initiatives
    Teaches Internal teams about the data delivery and structure
    How to interact with the data
    How to apply data to business cases as opposed to simply a technology project
    It is the where you start when:
    “you don’t know what you don’t know…”
    2013 EMA Research shows that over half of Big Data projects have online archiving as an ‘In Operation’ status
    In Production or as a Pilot Project with hands on keyboards. Software installed.
    Over 4 in 10 respondents say “Economics” are a Business Reason for Online Archiving Use Case.
    These organizations are attempting to lower their operational costs
  • Moving beyond select * requires a standard requires a facility that manages and tracks metadata
    Select * tablename is the rough equivalent to cat filename
    SQL starts to become truly “special” when you use a query such as
    Select t.columnA, s.columnB, s.columnC from tablename t tablename s
    Where t.columnZ = s.column.X
    NoSQL and specifically Hadoop have focused on the ability to be flexible in data storage often at the expense of metadata management
    SQL doesn’t do with an “or” data structure (image on right)
    SQL works best with a defined data structure (image on right)
    When you ask Hive a question it doesn’t understand…. You get the error message.
    In2013 EMA Research Big Data initiatives used the following datasets
    Machine generated (JSON, XML, etc) almost 40%
    Process mediated (structured) just under 30%
    Human sourced (emails, texts,) over 30%
    Over 30% of respondents indicate that a lack of self-service data access (SQL) is a challenge to operate a Hadoop platform
    Nearly 40% of respondents say a lack of SQL data access is a challenge to operate a NoSQL platform
    In each of these instances, it indicates that while you “CAN” perform certain applications on Hadoop, SQL-based data access is a high concern.
  • Big Data environments aren’t just for EDW replacement as some would say
    There are multiple use cases
    Operational
    Analytical
    Exploratory
    Nearly 3 of 10 respondents in 2013 research say that they are using Exploratory or Discovery use cases
    Just under 50% of respondents say operational costs (staff head count is included) are a challenge to operate a discovery platform.
    3 of 10 respondents want to utilize the features and functions of products to speed their skills acquisition. Often times these are features that they feel most comfortable with. Interfaces and processes that they use every day. MS Excel is an example.
    Nearly 4 out 10 respondents indicate new skills development is a challenge to operate a discovery platform
  • When you are using exploratory or discovery use cases, you need flexibility… applying a hard schema (structured) presupposes particular questions AND answers.
    Square wooden peg and round wooden hole – not a lot of give.
    Being able to apply a schema or structure at the time of query or late binding schema enables the best method of discovery
    Flexible schema at the time of processing…. Sausage grinder
    2013 EMA research says
    Over 30% of respondents use late binding schemas when processing data
    Nearly a third use multiple approaches
    Over 10% don’t apply a schema at all…
    “Only” about one third of Respondents are using external technical resources to bridge their skills gaps. This comes from the costs associated with the outside consultants vs existing staff
  • “Free as in Speech” or “Free as in Beer”… Big Data is “Free as a Free Puppy”
    Over 40% of respondents say Economics are a Business Reason for Online Archiving Use Case
    Back to Metadata….
    Over one third of respondents indicate shortage of technical metadata a challenge to operate a discovery platform. Applying that technical metadata layer takes a manual effort and thus additional headcount. When you link this to ‘only’ a 1% increase in big data budget from 2013 to 2014 for Hadoop implementations, it is important to put the best use for hadoop platforms.
    36% implementation time to implement is a challenge to operate a hadoop platform
    43% say operational costs are a challenge to operate a discovery platform (link to a 1% increase in big data operational budget from 2013 to 2014)
    Over one third of respondents say they lack the skills to manage multi-structured data platforms as an obstacle to implement (Top answer)

Transcript

  • 1. The Economics of SQL on Hadoop © 2013 Datameer, Inc. All rights reserved.
  • 2. Watch the Recording of this Webinar View the entire recorded webinar at: http://info.datameer.com/SlideshareEconomics-SQL-Hadoop.html
  • 3. About our Speakers John Myers ! John Myers joined Enterprise Management Associates in 2011 as senior analyst of the business intelligence (BI) practice area. John has 10+ years of experience working in areas related to business analytics in professional services consulting and product development roles, as well as helping organizations solve their business analytics problems, whether they relate to operational platforms, such as customer care or billing, or applied analytical applications, such as revenue assurance or fraud management. ! Slide 3 © 2013 Datameer, Inc. All rights reserved.
  • 4. About our Speakers Stefan Groschupf! ! ▪  Stefan Groschupf is the co-founder and CEO of Datameer. He is one of the original contributors to Nutch, the open source predecessor of Hadoop, Stefan has been at the forefront of the Hadoop and Big Data market. Prior to Datameer, Stefan was the co-founder and CEO of Scale Unlimited, which implemented custom Hadoop analytic solutions for HP, Sun, Deutsche Telekom, Nokia and others. Earlier, Stefan was CEO of 101Tec, a supplier of Hadoop and Nutch-based search and text classification software to industry-leading companies such as Apple, DHL and EMI Music. Stefan has also served as CTO at multiple companies, including Sproose, a social search engine company. Slide 4 © 2013 Datameer, Inc. All rights reserved.
  • 5. About our Speakers Matt Schumpert! ! Matt has been working in enterprise software of over 10 years in various capacities, including sales engineering, strategic alliances and consulting.  ! ! Matt currently runs the pre-sales engineering team at Datameer, supporting all technical aspects of customer engagement through roll-out of customers into production. !  ! Matt holds a BS in Computer Science from the University of Virginia.! Slide 5 © 2013 Datameer, Inc. All rights reserved.
  • 6. Agenda ▪  EMA on Current State of the Big Data Industry! –  –  –  –  –  Online Archiving in Practice! SQL on NoSQL: Metadata! Exploratory Use Cases! Late Binding Schemas better for Discovery! Economics of Hadoop! ▪  Datameer on how to solve these problems! –  Use Case #1: Semi-Structured Data ! –  Use Case #2: Text Analytics data! –  Use Case #3: Path Analysis! ▪  Takeaways; and Question and Answer! Slide 6 © 2013 Datameer, Inc. All rights reserved.
  • 7. State of Big Data Industry © 2013 Datameer, Inc. All rights reserved.
  • 8. Online Archiving is the majority use case for Big Data projects Slide 8 © 2013Enterprise Management Associates, Inc.
  • 9. Moving Beyond select * from tablename SQL requires a managed set of metadata Slide 9 © 2013Enterprise Management Associates, Inc.
  • 10. Big Data Platforms have Multiple Uses: Discovery is a significant portion Slide 10 © 2013Enterprise Management Associates, Inc.
  • 11. Late Binding Schemas are good for Discovery Slide 11 © 2013Enterprise Management Associates, Inc.
  • 12. Free as a Free puppy… Slide 12 © 2013 Enterprise Management Associates, Inc.
  • 13. Datameer Demos © 2013 Datameer, Inc. All rights reserved.
  • 14. Use Case #1: Semi-Structured Data ▪  Noisy, log-structured data à signal Slide 14 © 2013 Datameer, Inc. All rights reserved.
  • 15. Use Case #1: Semi-Structured Data ▪  Noisy, log-structured data à signal ▪  Extract, cast, & define fields on demand Slide 15 © 2013 Datameer, Inc. All rights reserved.
  • 16. Use Case #1: Semi-Structured Data ▪  Noisy, log-structured data à signal ▪  Extract, cast, & define fields on demand ▪  Painful/impossible without inspection Slide 16 © 2013 Datameer, Inc. All rights reserved.
  • 17. Use Case #1: Semi-Structured Data ▪  Noisy, log-structured data à signal ▪  Extract, cast, & define fields on demand ▪  Painful/impossible without inspection ▪  “One-offs” are possible with SQL+UDFs ▪  But better to collaborate with shared “views” Slide 17 © 2013 Datameer, Inc. All rights reserved.
  • 18. Use Case #1: Semi-Structured Data ▪  Noisy, log-structured data à signal ▪  Extract, cast, & define fields on demand ▪  Painful/impossible without inspection ▪  “One-offs” are possible with SQL+UDFs ▪  But better to collaborate with shared “views” ▪  Examples: ▪  “User-agent” string ▪  URL Parameters ▪  JSON Slide 18 © 2013 Datameer, Inc. All rights reserved.
  • 19. Use Case #2: Text Analytics ▪  Few/no known fields Slide 19 © 2013 Datameer, Inc. All rights reserved.
  • 20. Use Case #2: Text Analytics ▪  Few/no known fields ▪  Notion of a record is nebulous / fluid Slide 20 © 2013 Datameer, Inc. All rights reserved.
  • 21. Use Case #2: Text Analytics ▪  Few/no known fields ▪  Notion of a record is nebulous / fluid ▪  Wrangling and mining Slide 21 © 2013 Datameer, Inc. All rights reserved.
  • 22. Use Case #2: Text Analytics ▪  Few/no known fields ▪  Notion of a record is nebulous / fluid ▪  Wrangling and mining ▪  “Bag-of-Words” is a sensible start Slide 22 © 2013 Datameer, Inc. All rights reserved.
  • 23. Use Case #2: Text Analytics ▪  Few/no known fields ▪  Notion of a record is nebulous / fluid ▪  Wrangling and mining ▪  “Bag-of-Words” is a sensible start ▪  Again, frequent inspection is key Slide 23 © 2013 Datameer, Inc. All rights reserved.
  • 24. Use Case #3: Path Analysis ▪  Key component of clickstream analysis Slide 24 © 2013 Datameer, Inc. All rights reserved.
  • 25. Use Case #3: Path Analysis ▪  Key component of clickstream analysis ▪  Compares each record to the next/previous Slide 25 © 2013 Datameer, Inc. All rights reserved.
  • 26. Use Case #3: Path Analysis ▪  Key component of clickstream analysis ▪  Compares each record to the next/previous ▪  Defines/summarizes transitions, not events Slide 26 © 2013 Datameer, Inc. All rights reserved.
  • 27. Use Case #3: Path Analysis ▪  Key component of clickstream analysis ▪  Compares each record to the next/previous ▪  Defines/summarizes transitions, not events ▪  Supported by list/array types Slide 27 © 2013 Datameer, Inc. All rights reserved.
  • 28. Use Case #3: Path Analysis ▪  Key component of clickstream analysis ▪  Compares each record to the next/previous ▪  Defines/summarizes transitions, not events ▪  Supported by list/array types ▪  Requires multi-pass queries Slide 28 © 2013 Datameer, Inc. All rights reserved.
  • 29. Takeaways © 2013 Datameer, Inc. All rights reserved.
  • 30. When NOT to use SQL on Hadoop ▪  Structured Schemas or “Schema on Write” Slide 30 © 2013 Datameer, Inc. All rights reserved.
  • 31. When NOT to use SQL on Hadoop ▪  Structured Schemas or “Schema on Write” ▪  “Realtime” Query SLAs for operational or reporting tasks Slide 31 © 2013 Datameer, Inc. All rights reserved.
  • 32. When NOT to use SQL on Hadoop ▪  Structured Schemas or “Schema on Write” ▪  “Realtime” Query SLAs for operational or reporting tasks ▪  Highly detailed SQL query requirements (SQL-2003) Slide 32 © 2013 Datameer, Inc. All rights reserved.
  • 33. When to use SQL on Hadoop ▪  Unstructured Datasets and “Schema on Read” Slide 33 © 2013 Datameer, Inc. All rights reserved.
  • 34. When to use SQL on Hadoop ▪  Unstructured Datasets and “Schema on Read” ▪  Discovery tasks designed to find new connections and new business value Slide 34 © 2013 Datameer, Inc. All rights reserved.
  • 35. When to use SQL on Hadoop ▪  Unstructured Datasets and “Schema on Read” ▪  Discovery tasks designed to find new connections and new business value ▪  Lower level SQL queries (SQL-99) Slide 35 © 2013 Datameer, Inc. All rights reserved.
  • 36. Summary ▪  EMA on Current State of the Big Data Industry –  Online Archiving in Practice –  SQL on NoSQL: Metadata –  Exploratory Use Cases –  Late Binding Schemas better for Discovery ▪  Datameer on how to solve these problems –  Use Case #1: Semi-Structured Data –  Use Case #2: Text Analytics –  Use Case #3: Path Analysis Slide 36 © 2013 Datameer, Inc. All rights reserved.
  • 37. Call To Action ■  Visit our website –  www.datameer.com ■  Download our Trial –  http://www.datameer.com/Datameer-trial.html Slide 37 © 2013 Datameer, Inc. All rights reserved.