Your SlideShare is downloading. ×
Using SQL-MapReduce for Advanced Analytics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Using SQL-MapReduce for Advanced Analytics

5,271
views

Published on

Industry analyst Rick van der Lans explains how Aster Data's patent-pending SQL-MapReduce programming framework makes new types of analytic queries possible. The main benefits he outlines are: …

Industry analyst Rick van der Lans explains how Aster Data's patent-pending SQL-MapReduce programming framework makes new types of analytic queries possible. The main benefits he outlines are: Parallelization of complex operations; Simplification of queries; Predictabile query performance; Efficient data access; and Linear scalability.

Check out Rick's complimentary research report at http://www.asterdata.com/ar_SQL-MapReduce_for_Advanced_Analytics/, in which he provides a very clear technical explanation of SQL-MapReduce and its analytic application use cases.

Published in: Technology

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,271
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
221
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Using SQL-MapReduce for Advanced Analytical Queries
    by
    Rick F. van der LansR20/Consultancy BV
  • 2. What Did the Users Want?
    BI reports
    Production databases
  • 3. But What Did We Create?
    ODS
    data
    warehouse
    datamart
    production
    database
    cube
  • 4. Problems with Current DW Platforms
    45%
    40%
    39%
    37%
    33%
    29%
    23%
    23%
    21%
    20%
    19%
    16%
    16%
    15%
    14%
    13%
    11%
    4%
    3%
    Poor query response
    Can’t support advanced analytics
    Inadequate data load speed
    Can’t scale to large data volumes
    Cost of scaling up is too expensive
    Poorly suited to real-time or on demand workloads
    Current platform is a legacy we must phase out
    Can’t support data modeling we need
    We need platform that supports mixed workloads
    Can’t support large concurrent user count
    Inadequate high availability
    Inadequate support for in-memory processing
    Inadequate support for web services and SOA
    Current platform is 32-bit, and we need 64-bit
    Current platform is SMP, and we need MPP
    We need platform better suited to cloud or virtualization
    Can’t secure the data properly
    Other
    No problems
    Source: P. Russom, ‘Next
    Generation Data Warehouse
    Platforms’, TDWI Best
    Practices Report, fourth
    quarter 2009.
  • 5. 49%
    8%
    20%
    12%
    8%
    1%
    1%
    3%
    current DW platform
    2009
    2010
    2011
    2012
    2013
    2014
    2015 or later
    Need for More Powerful Data Warehouse Platforms
    no plans to replace
    Source: P. Russom, ‘Next
    Generation Data Warehouse
    Platforms’, TDWI Best
    Practices Report, fourth
    quarter 2009.
  • 6. New Forms of Analytics
    Advanced Analytics
    Operational Analytics
    Deep Analytics
    Self-Service Analytics
    Complex Analytics
    Automated Analytics
  • 7. Positioning of Advanced Analytics
    complexity
    of analytical
    queries
    high
    complex queries on
    small to medium size
    databases
    advanced
    analytics
    simple queries on
    small to medium size
    databases
    simple queries on
    large to ultra large
    databases
    low
    database size
    low
    high
  • 8. Parallellization of SQL
    Worker
    Worker
    Worker
    SELECT *
    FROM CUSTOMERS
    WHERE LOCATION = 'New York'
    Database
    Server
    Master
  • 9. How Easy Is Parallelizing SQL Queries? (1)
    Example 1:
    SELECT ID, SALES_DATE, PRICE
    FROM SALES_RECORDS
    WHERE PRICE > 100
    Example 2:
    SELECT REGION_ID, SUM(PRICE)
    FROM SALES_RECORDS
    WHERE PRICE > 100
    GROUP BY REGION_ID
  • 10. How Easy Is Parallelizing SQL Queries? (2)
    Example 3:
    Get all the flights to London
    for which another flight
    exists to London that leaves
    within an hour on the same
    day.
    SELECT *
    FROM DEPARTURES AS D1
    WHERE DESTINATION = 'London'
    AND DEPARTURE_TIME + 60 MINUTES >=
    (SELECT MIN(DEPARTURE_TIME)
    FROM DEPARTURES AS D2
    WHERE DESTINATION = 'London'
    AND D2.DEPARTURE_TIME > D1.DEPARTURE_TIME
    AND D2.DEPARTURE_DAY = D1.DEPARTURE_DAY)
    ORDER BY DEPARTURE_TIME
  • 11. How Easy Is Parallelizing SQL Queries? (3)
    SELECTA.PROD_DESC AS ITEM1,B.PROD_DESC AS ITEM2,C.PROD_DESC AS ITEM3,COUNT (*) AS CNTFROM(SELECT SF.STORE_ID, SF.REG_ID, SF.TRAN_NO, SF.ITEM_ID, SF.DT, PD.PROD_DESC, PD.PRICE       FROM             SALES_FACT SF       INNER JOIN             PRODUCT_DIM PD       WHERE             SF.ITEM_ID=PD.ITEM_ID) AS TRANSACTIONS A, (SELECT SF.STORE_ID, SF.REG_ID, SF.TRAN_NO, SF.ITEM_ID, SF.DT, PD.PROD_DESC, PD.PRICE       FROM             SALES_FACT SF       INNER JOIN             PRODUCT_DIM PD       WHERE             SF.ITEM_ID=PD.ITEM_ID) AS TRANSACTIONS B,(SELECT SF.STORE_ID, SF.REG_ID, SF.TRAN_NO, SF.ITEM_ID, SF.DT, PD.PROD_DESC, PD.PRICE       FROM             SALES_FACT SF       ,,,            PRODUCT_DIM PD       WHERE             SF.ITEM_ID=PD.ITEM_ID) AS TRANSACTIONS C WHERE A.STORE_ID=B.STORE_ID AND  B.STORE_ID=C.STORE_ID AND A.STORE_ID=C.STORE_ID AND A.REG_ID=B.REG_ID AND  B.REG_ID=C.REG_ID AND A.REG_ID=C.REG_ID AND A.TRAN_NO=B.TRAN_NO AND  B.TRAN_NO=C.TRAN_NO AND A.TRAN_NO=C.TRAN_NO AND A.DT=B.DT AND  B.DT=C.DT AND A.DT=C.DT AND A.ITEM_ID<>B.ITEM_ID AND A.ITEM_ID<>C.ITEM_ID AND B.ITEM_ID<>C.ITEM_IDGROUP BY A.PROD_DESC,  B.PROD_DESC,  C.PROD_DESCHAVING  COUNT(*)>1000ORDER BY COUNT(*) DESC;
    Example 4:
    Market basket
    analysis:
  • 12. Declarativeness and Storage Independency
    Declarativeness:
    The developer has only to program what has to be done, and not how it should be done.
    Storage independency:
    The language should hide how data is physically stored and how it is accessed.
  • 13. Advantages of Two Properties
    Productivity increase
    less code has to be written
    Maintainability:
    less code means having to maintain less code
    Flexibility:
    changes to the storage layer can be made without the need to change the SQL code in the reports
  • 14. Different Types of SQL Functions
    Built-in or User-defined
    SELECT FLIGHT, TRUNCATE(DEPARTURE_TIME, MINUTES)
    FROM DEPARTURES AS D1
    WHERE BANK_HOLIDAY(DEPARTURE_TIME) = 1
    Scalar or Table
    SELECT AVG(DURATION)
    FROM LAST_FIVE_ROWS(DEPARTURES)
    Pure SQL, Procedural, or External
    Simple or Complex
  • 15. MapReduce
    MapReduce is a programming model introduced by Google
    Aimed at processing requests on large data sets where the processing can be distributed over a high number of nodes using parallel capabilities
    Two steps Map and Reduce
    Map is like Select
    Reduce is like Group-by
  • 16. Aster Data’s SQL-MapReduce (1)
    SQL-MR is a set of built-in and user-defined external table functions
    Example:
    SELECT *
    FROM GET_NEXT_FLIGHT_1HR
    (ON DEPARTURES PARTITION BY DESTINATION)
    WHERE DESTINATION = 'London'
    ORDER BY DEPARTURE_TIME
    All the SQL-MR function processing is parallelized
    Including complex group-by operations and time-series analytics
  • 17. Aster Data’s SQL-MapReduce (2)
    An SQL-MR function can contain the most complex analytical logic
    Programmers of SQL don’t need to learn a new language, Java, C++, Python, and many more can be used
    The SQL statements invoking SQL-MR functions are still declarative and storage-independent
    The functions themselves are not
    Usable by any BI tools supporting SQL
  • 18. Supported Built-in Functions
  • 19. SQL-MR
    Technical Advantages
    Technical Disadvantages
    • Parallelization of complex operations
    • 20. Simplification of queries
    • 21. Efficiency of low-level programming language
    • 22. Efficient data access
    • 23. Predictable query performance
    • 24. Linear scalability
    • 25. Built-in functions
    • 26. Polymorphism of the functions
    • 27. Nesting of the functions
    • 28. Small group of developers have to learn a new language (possibly)
    • 29. Low-level language is not declarative
    • 30. Non-portable functions
  • Market Basket Analysis using SQL-MR
    SELECT PROD_DESC1, PROD_DESC2, PROD_DESC3, COUNT(*) AS CNTFROM BASKET_GENERATOR(       ON  ((SELECT SF.STORE_ID, SF.REG_ID, SF.TRAN_NO, SF.ITEM_ID, SF.DT, PD.PROD_DESC, PD.PRICE             FROM                   SALES_FACT SF             INNER JOIN                   PRODUCT_DIM PD             WHERE                   SF.ITEM_ID=PD.ITEM_ID) AS TRANSACTIONS A        PARTITION BY STORE_ID, REG_ID, TRAN_NO, DT        BASKET_ITEM(‘PROD_DESC')        BASKET_SIZE('3')        )GROUP BY PROD_DESC1, PROD_DESC2, PROD_DESC3HAVING  COUNT(*)>1000ORDER BY COUNT(*) DESC;
  • 31. Business Advantages of SQL-MR
    Simplification of architecture
    Deep analytics
    Complex analytics
    Operational analytics
    Self-service analytics
    No forbidden queries
  • 32. Simplification of Architecture
    SQL-MR
    production
    database
    data
    warehouse
    ODS
    datamart
    cube
    analytics
  • 33. Conclusions
    The analytical and reporting demands are increasing
    Most environments already have problems with performance
    The marriage of SQL and MapReduce offers an enormous potential
    Parallelizing the processing of analytical logic
  • 34. Business Advantages of SQL-MR
    Simplification of architecture
    Deep analytics
    Complex analytics
    Operational analytics
    Self-service analytics
    No forbidden queries
  • 35. Questions & Answers
    Rick van der Lans
    R20 Consultancy
    e-mail: rick@r20.nl website: http://www.r20.nl
    Stephanie McReynolds
    Director of Product Marketing, Aster Data
    e-mail: smcreyno@asterdata.com
    For More Information on Aster Data:
    http: //www.asterdata.com