Your SlideShare is downloading. ×
0
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Using SQL-MapReduce for Advanced Analytics

5,396

Published on

Industry analyst Rick van der Lans explains how Aster Data's patent-pending SQL-MapReduce programming framework makes new types of analytic queries possible. The main benefits he outlines are: …

Industry analyst Rick van der Lans explains how Aster Data's patent-pending SQL-MapReduce programming framework makes new types of analytic queries possible. The main benefits he outlines are: Parallelization of complex operations; Simplification of queries; Predictabile query performance; Efficient data access; and Linear scalability.

Check out Rick's complimentary research report at http://www.asterdata.com/ar_SQL-MapReduce_for_Advanced_Analytics/, in which he provides a very clear technical explanation of SQL-MapReduce and its analytic application use cases.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,396
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
224
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Using SQL-MapReduce for Advanced Analytical Queries<br />by<br />Rick F. van der LansR20/Consultancy BV<br />
  • 2. What Did the Users Want?<br />BI reports<br />Production databases<br />
  • 3. But What Did We Create?<br />ODS<br />data<br />warehouse<br />datamart<br />production<br />database<br />cube<br />
  • 4. Problems with Current DW Platforms<br />45%<br />40%<br />39%<br />37%<br />33%<br />29%<br />23%<br />23%<br />21%<br />20%<br />19%<br />16%<br />16%<br />15%<br />14%<br />13%<br />11%<br />4%<br />3%<br />Poor query response<br />Can’t support advanced analytics<br />Inadequate data load speed<br />Can’t scale to large data volumes<br />Cost of scaling up is too expensive<br />Poorly suited to real-time or on demand workloads<br />Current platform is a legacy we must phase out<br />Can’t support data modeling we need<br />We need platform that supports mixed workloads<br />Can’t support large concurrent user count<br />Inadequate high availability<br />Inadequate support for in-memory processing<br />Inadequate support for web services and SOA<br />Current platform is 32-bit, and we need 64-bit<br />Current platform is SMP, and we need MPP<br />We need platform better suited to cloud or virtualization<br />Can’t secure the data properly<br />Other<br />No problems<br />Source: P. Russom, ‘Next<br />Generation Data Warehouse<br />Platforms’, TDWI Best<br />Practices Report, fourth<br />quarter 2009.<br />
  • 5. 49%<br />8%<br />20%<br />12%<br />8%<br />1%<br />1%<br />3%<br />current DW platform<br />2009<br />2010<br />2011<br />2012<br />2013<br />2014<br />2015 or later<br />Need for More Powerful Data Warehouse Platforms<br />no plans to replace<br />Source: P. Russom, ‘Next<br />Generation Data Warehouse<br />Platforms’, TDWI Best<br />Practices Report, fourth<br />quarter 2009.<br />
  • 6. New Forms of Analytics<br />Advanced Analytics<br />Operational Analytics<br />Deep Analytics<br />Self-Service Analytics<br />Complex Analytics<br />Automated Analytics<br />
  • 7. Positioning of Advanced Analytics<br />complexity<br />of analytical<br />queries<br />high<br />complex queries on<br />small to medium size <br />databases<br />advanced<br />analytics<br />simple queries on<br />small to medium size <br />databases<br />simple queries on<br />large to ultra large <br />databases<br />low<br />database size<br />low<br />high<br />
  • 8. Parallellization of SQL<br />Worker<br />Worker<br />Worker<br />SELECT *<br />FROM CUSTOMERS<br />WHERE LOCATION = 'New York'<br />Database<br />Server<br />Master<br />
  • 9. How Easy Is Parallelizing SQL Queries? (1)<br />Example 1:<br />SELECT ID, SALES_DATE, PRICE<br />FROM SALES_RECORDS<br />WHERE PRICE > 100<br />Example 2:<br />SELECT REGION_ID, SUM(PRICE)<br />FROM SALES_RECORDS<br />WHERE PRICE > 100<br />GROUP BY REGION_ID<br />
  • 10. How Easy Is Parallelizing SQL Queries? (2)<br />Example 3: <br />Get all the flights to London <br />for which another flight <br />exists to London that leaves <br />within an hour on the same <br />day.<br />SELECT *<br />FROM DEPARTURES AS D1<br />WHERE DESTINATION = 'London'<br />AND DEPARTURE_TIME + 60 MINUTES >=<br /> (SELECT MIN(DEPARTURE_TIME)<br /> FROM DEPARTURES AS D2<br /> WHERE DESTINATION = 'London'<br /> AND D2.DEPARTURE_TIME > D1.DEPARTURE_TIME<br /> AND D2.DEPARTURE_DAY = D1.DEPARTURE_DAY)<br />ORDER BY DEPARTURE_TIME<br />
  • 11. How Easy Is Parallelizing SQL Queries? (3)<br />SELECTA.PROD_DESC AS ITEM1,B.PROD_DESC AS ITEM2,C.PROD_DESC AS ITEM3,COUNT (*) AS CNTFROM(SELECT SF.STORE_ID, SF.REG_ID, SF.TRAN_NO, SF.ITEM_ID, SF.DT, PD.PROD_DESC, PD.PRICE       FROM             SALES_FACT SF       INNER JOIN             PRODUCT_DIM PD       WHERE             SF.ITEM_ID=PD.ITEM_ID) AS TRANSACTIONS A, (SELECT SF.STORE_ID, SF.REG_ID, SF.TRAN_NO, SF.ITEM_ID, SF.DT, PD.PROD_DESC, PD.PRICE       FROM             SALES_FACT SF       INNER JOIN             PRODUCT_DIM PD       WHERE             SF.ITEM_ID=PD.ITEM_ID) AS TRANSACTIONS B,(SELECT SF.STORE_ID, SF.REG_ID, SF.TRAN_NO, SF.ITEM_ID, SF.DT, PD.PROD_DESC, PD.PRICE       FROM             SALES_FACT SF       ,,,            PRODUCT_DIM PD       WHERE             SF.ITEM_ID=PD.ITEM_ID) AS TRANSACTIONS C WHERE A.STORE_ID=B.STORE_ID AND  B.STORE_ID=C.STORE_ID AND A.STORE_ID=C.STORE_ID AND A.REG_ID=B.REG_ID AND  B.REG_ID=C.REG_ID AND A.REG_ID=C.REG_ID AND A.TRAN_NO=B.TRAN_NO AND  B.TRAN_NO=C.TRAN_NO AND A.TRAN_NO=C.TRAN_NO AND A.DT=B.DT AND  B.DT=C.DT AND A.DT=C.DT AND A.ITEM_ID<>B.ITEM_ID AND A.ITEM_ID<>C.ITEM_ID AND B.ITEM_ID<>C.ITEM_IDGROUP BY A.PROD_DESC,  B.PROD_DESC,  C.PROD_DESCHAVING  COUNT(*)>1000ORDER BY COUNT(*) DESC;<br />Example 4: <br />Market basket <br />analysis:<br />
  • 12. Declarativeness and Storage Independency<br />Declarativeness:<br /> The developer has only to program what has to be done, and not how it should be done.<br />Storage independency:<br /> The language should hide how data is physically stored and how it is accessed.<br />
  • 13. Advantages of Two Properties<br />Productivity increase<br />less code has to be written<br />Maintainability: <br />less code means having to maintain less code<br />Flexibility: <br />changes to the storage layer can be made without the need to change the SQL code in the reports <br />
  • 14. Different Types of SQL Functions<br />Built-in or User-defined<br />SELECT FLIGHT, TRUNCATE(DEPARTURE_TIME, MINUTES)<br />FROM DEPARTURES AS D1<br />WHERE BANK_HOLIDAY(DEPARTURE_TIME) = 1<br />Scalar or Table<br />SELECT AVG(DURATION)<br />FROM LAST_FIVE_ROWS(DEPARTURES)<br />Pure SQL, Procedural, or External<br />Simple or Complex<br />
  • 15. MapReduce<br />MapReduce is a programming model introduced by Google<br />Aimed at processing requests on large data sets where the processing can be distributed over a high number of nodes using parallel capabilities <br />Two steps Map and Reduce<br />Map is like Select<br />Reduce is like Group-by<br />
  • 16. Aster Data’s SQL-MapReduce (1)<br />SQL-MR is a set of built-in and user-defined external table functions<br />Example:<br />SELECT *<br />FROM GET_NEXT_FLIGHT_1HR <br /> (ON DEPARTURES PARTITION BY DESTINATION)<br />WHERE DESTINATION = 'London'<br />ORDER BY DEPARTURE_TIME<br />All the SQL-MR function processing is parallelized<br />Including complex group-by operations and time-series analytics<br />
  • 17. Aster Data’s SQL-MapReduce (2)<br />An SQL-MR function can contain the most complex analytical logic<br />Programmers of SQL don’t need to learn a new language, Java, C++, Python, and many more can be used<br />The SQL statements invoking SQL-MR functions are still declarative and storage-independent<br />The functions themselves are not<br />Usable by any BI tools supporting SQL<br />
  • 18. Supported Built-in Functions<br />
  • 19. SQL-MR <br />Technical Advantages<br />Technical Disadvantages<br /><ul><li>Parallelization of complex operations
  • 20. Simplification of queries
  • 21. Efficiency of low-level programming language
  • 22. Efficient data access
  • 23. Predictable query performance
  • 24. Linear scalability
  • 25. Built-in functions
  • 26. Polymorphism of the functions
  • 27. Nesting of the functions
  • 28. Small group of developers have to learn a new language (possibly)
  • 29. Low-level language is not declarative
  • 30. Non-portable functions </li></li></ul><li>Market Basket Analysis using SQL-MR<br />SELECT PROD_DESC1, PROD_DESC2, PROD_DESC3, COUNT(*) AS CNTFROM BASKET_GENERATOR(       ON  ((SELECT SF.STORE_ID, SF.REG_ID, SF.TRAN_NO, SF.ITEM_ID, SF.DT, PD.PROD_DESC, PD.PRICE             FROM                   SALES_FACT SF             INNER JOIN                   PRODUCT_DIM PD             WHERE                   SF.ITEM_ID=PD.ITEM_ID) AS TRANSACTIONS A        PARTITION BY STORE_ID, REG_ID, TRAN_NO, DT        BASKET_ITEM(‘PROD_DESC')        BASKET_SIZE('3')        )GROUP BY PROD_DESC1, PROD_DESC2, PROD_DESC3HAVING  COUNT(*)>1000ORDER BY COUNT(*) DESC;<br />
  • 31. Business Advantages of SQL-MR<br />Simplification of architecture<br />Deep analytics<br />Complex analytics<br />Operational analytics<br />Self-service analytics<br />No forbidden queries<br />
  • 32. Simplification of Architecture<br />SQL-MR<br />production<br />database<br />data<br />warehouse<br />ODS<br />datamart<br />cube<br />analytics<br />
  • 33. Conclusions<br />The analytical and reporting demands are increasing<br />Most environments already have problems with performance<br />The marriage of SQL and MapReduce offers an enormous potential<br />Parallelizing the processing of analytical logic <br />
  • 34. Business Advantages of SQL-MR<br />Simplification of architecture<br />Deep analytics<br />Complex analytics<br />Operational analytics<br />Self-service analytics<br />No forbidden queries<br />
  • 35. Questions & Answers<br />Rick van der Lans<br />R20 Consultancy<br />e-mail: rick@r20.nl website: http://www.r20.nl<br />Stephanie McReynolds<br />Director of Product Marketing, Aster Data<br />e-mail: smcreyno@asterdata.com<br />For More Information on Aster Data:<br />http: //www.asterdata.com<br />

×