Mastering MapReduce: MapReduce for Big Data Management and Analysis


Published on

Whether you’ve heard of Google’s MapReduce or not, its impact on Big Data applications, data warehousing, ETL,
business intelligence, and data mining is re-shaping the market for business analytics and data processing.
Attend this session to hear from Curt Monash on the basics of the MapReduce framework, how it is used, and what implementations like SQL-MapReduce enable.

In this session you will learn:

* The basics of MapReduce, key use cases, and what SQL-MapReduce adds
* Which industries and applications are heavily using MapReduce
* Recommendations for integrating MapReduce in your own BI, Data Warehousing environment

Published in: Technology
1 Comment
  • City Real Estate Europe


    حكايات نواعم
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Mastering MapReduce: MapReduce for Big Data Management and Analysis

  1. 1. Mastering MapReduce Series, Session I:MapReduce for Big Data Management and Analysis<br />Curt Monash, Monash Research<br />Steve Wooledge, Aster Data<br />Peter Pawlowski, Aster Data<br />Eric Friedman, Aster Data<br />October 15th, 2009<br />
  2. 2. Aster Data Overview<br /> SQL-MapReduce<br /> Example SQL-MapReduce applications<br /> SQL-MapReduce Syntax/example<br /> Q&A<br />Topics<br />
  3. 3. Aster Data<br />Creating the Next-Generation Data Management System<br />Founded in 2005 to revolutionize data processing & management of very large data volumes<br />Founding team innovated on the ‘big data’ problem at Stanford University and were joined by big data experts from Google, Oracle, and Microsoft<br />Aster’s first commercial product, nCluster, has been in market since 2007. Customers include MySpace, LinkedIn, Coremetrics, Akamai, others.<br />Since 2008, innovated on Google’s well-known MapReduceframework to transform data processing. Created patent-pending SQL-MapReduce(In-Database MapReduce)<br />
  4. 4. Example Data-Driven Applications <br />Large Data Volumes and Analytics-Intensive<br /><ul><li>Merchandising and Packaging Optimization
  5. 5. Service Personalization (e.g. telco)
  6. 6. Graph analysis
  7. 7. Consumer segmentation
  8. 8. Consumer buying patterns and consumer behavior
  9. 9. Click-stream analysis
  10. 10. Compliance & Regulatory Reporting
  11. 11. Predictive and granular forecasting
  12. 12. Trend analysis and modeling
  13. 13. Credit and Risk management
  14. 14. Fraud detection
  15. 15. Cross-platform ad and event attribution
  16. 16. Cross-platform media affinity analysis</li></li></ul><li>Real Results<br />
  17. 17. Improving Computation Push-Down<br />Cycle Time = Seconds to Minutes<br />BI Reports Server<br />DataMining<br />Workload<br />Common SQL Queries:<br />aggregation, sub-sets & samples<br />MPP Database<br />Confidential and proprietary. Copyright © 2009 Aster Data Systems<br />6<br />
  18. 18. Aster’s Solution - A Massively Parallel Data Warehouse With the Unique Ability to Embed Applications<br />Deeper, Faster Analytics on Big Data<br />OtherApplications(C, C++, Perl, Python…)<br />Leading BI Tools<br />Key Classes ofApplications<br />Custom JAVAApplications<br />Custom .NET Applications<br />Packaged Analytic Apps<br />6<br />Aster nCluster System<br />Aster’s SQL-MapReduce orStandard Interfaces<br />Unified <br />Interface<br />SQL<br />SQL-MapReduce<br />5<br />High Volume, Fast Querying<br />Industry-leading <br />WLM: 300+ <br />Concurrent Workloads<br />4<br />Dynamic Workload Manager (WLM)<br />Data<br />.NET<br />App<br />Java<br />App<br />Embedded Parallelized Apps – executes within the DB<br />Pack’gdApp<br />Other<br />Apps<br />3<br />3<br />Data<br />Data<br />Data<br />Data<br />Data<br />Data<br />MPP Data Warehouse withIncremental Scaling <br />(scale by function)<br />Data<br />Data<br />Data<br />Data<br />Data<br />2<br /> Massively<br />-Parallel <br />Data Store<br />1<br />Commodity Hardware<br />
  19. 19. Aster SQL-MapReduce (SQL-MR)<br />Bring your applications to the data<br />“Data-Applications” Development Platform<br />Rich portfolio of supported languages – Java, .NET, Python, Ruby, Perl, C++, R and More<br />Use SQL to develop rich data apps<br />Expressive flexibility<br />Reusability across applications and reports<br />
  20. 20. Full Tilt Poker: Fraud DetectionThe second largest online poker site in the world<br />Objective:<br />Improve fraud analytics and stop revenue leakage<br />Before: Separate Java-based fraud detection applications ran once a week <br /><ul><li>Large volumes of data stored on SQL Server had to be decompressed and moved to analyze for fraud
  21. 21. Java-based program ran the data mining on extracted data
  22. 22. Algorithm had to be oversimplified due to performance limitations
  23. 23. Fraud was detected too late or not at all</li></ul>After: Store and analyze all data in one location…the Aster database with SQL-MapReduce<br /><ul><li>Reduced overall cycle time from 1 x per week to 15 minutes
  24. 24. Enriched fraud algorithm is now catching previously undetected fraud
  25. 25. Query performance improved by 60x (90 mins down to 90 secs)</li></ul>9<br />Confidential and proprietary. Copyright © 2009 Aster Data Systems<br />
  26. 26. Aster’s Patent-Pending SQL-MapReduce<br />Enables faster, easier, and more powerful analytics<br /> SQL-MapReduce framework (for developers to create and extend)<br />Flexible: MapReduce expressiveness, languages, polymorphism<br />Performance: Massive parallelization, computational push-down <br />Availability: Fault isolation, resource management<br /> Powerful SQL-MR functions (for analysts to consume)<br />Deep insights: Unlimited analytical power at your disposal<br />Ease of use: Simply plug in to the SQL you know and love<br />The Power of Aster’s SQL-MapReduce Framework<br />Write<br />Install<br />Use and Reuse<br />Write a SQL-MR function in Java, C, etc.<br />Install inside Aster nCluster<br />Invoke SQL-MR function from SQL<br />3<br />1<br />2<br />
  27. 27. Options for Utilizing Power of MapReduce<br />SQL- MapReduce<br />File-Only MapReduce<br />Pros<br /><ul><li>Scalable
  28. 28. Deep insights
  29. 29. Low HW Cost</li></ul>Cons<br /><ul><li>Limited standards
  30. 30. Limited SLAs
  31. 31. Expensive maintenance</li></ul>Pros<br /><ul><li>Standards (SQL)
  32. 32. Data integrity
  33. 33. Mixed workloads</li></ul>Cons<br /><ul><li>Limited scaling
  34. 34. Limited analytics
  35. 35. Expensive HW & maintenance</li></ul>Best of both worlds!<br />Traditional<br />Database<br />
  36. 36. MapReduce Applications<br />Behavioral Analytics (CRM)<br />Sequential pattern analysis (e.g., up-sell/cross-sell)<br />Spam/BOT analysis<br />Sessionization analysis<br />Risk & Fraud analysis<br />Consumer credit scoring/default risk, market risk/VaR, operational risk, etc<br />Fraud detection<br />Graph analysis<br />Social network “connectedness” (e.g., SSSP, APSP, etc)<br />Text analysis<br />Tokenization (e.g., word count classification)<br />Natural language processing<br />Statistical analysis (machine learning)<br />Linear regression<br />K-means clustering<br />R Project algorithms<br />
  37. 37. Aster’s SQL-MapReduce Library: <br />Pre-packaged (SDK), SQL-MR APIs, and documentation<br />Pre-packaged SQL-MR sample functions<br />nPath – complex sequential analysis for time-series and behavioral pattern analysis<br />SSSP – single source shortest path Graph algorithm useful for fraud and segmentation analysis<br />Sessionize– session categorization based on a sequence of clicks within a specified timeout<br />Approximate percentiles – ultra-fast percentile (or N-tile) statistical distribution analysis <br />Linear regression – statistical technique used to predict values based on a set of related variables. <br />Tokenize – text analysis that splits strings into words, categorizes them, and does a word count<br />
  38. 38. MySpace Weblogs: Sessionization<br />Objective: <br />Analyze data to quickly identify user “sessions”<br />Before: Used Regular SQL<br /><ul><li>~1000 lines ANSI SQL code
  39. 39. Requires dozens of SQL queries every N minutes (dozens of times per day)
  40. 40. Sub-optimal performance (multiple passes)</li></ul>After: Used Sessionize SQL-MR Function<br /><ul><li>Sessionize is a MapReduce function (written in Java)
  41. 41. Significantly simpler code: <100 lines vs. 1000 lines
  42. 42. Single pass over data for optimal performance</li></ul>Source: Avinash Kaushik, Occam’s Razor, Nov ‘08<br />14<br />Confidential and proprietary. Copyright © 2009 Aster Data Systems<br />
  43. 43. ShareThis: Sharing Behavior Analytics<br />Objective: <br />Analyze user behavior in multi-terabyte system run in the cloud<br />Before: Long query times for Amazon EC2’s largest customer<br /><ul><li>Traditional database approach required multiple complex iterations (parsing, temp tables, tedious sorts) that were time intensive
  44. 44. Running data mining and statistical analysis on multi-TB system
  45. 45. Time intensive to develop
  46. 46. Cycle time of many hours</li></ul>After: nPath and SQL/MR solution<br /><ul><li>SQL-MR reduces query times and analyzing user sharing behavior
  47. 47. Single pass over large-scale data
  48. 48. 100 lines of code down to 12
  49. 49. Significant SQL optimization: Minimal SQL code, greater performance via parallel execution
  50. 50. Cycle time reduction: Significant resource savings in both time and utilization</li></ul>15<br />Confidential and proprietary. Copyright © 2009 Aster Data Systems<br />
  51. 51. SQL-MapReduce Syntax:nPath Example<br />
  52. 52. nPath is a SQL-MR function included with nCluster.<br />nPath enables analysis of ordered data:<br />Clickstream data<br /> Financial transaction data<br /> User interaction data<br /> Anything of a time series nature<br />Leverages the power of the SQL-MR framework to transcend SQL’s limitations with respect to ordered data<br />What is Aster nPath?<br />17<br />
  53. 53. Example: Analyzing a Clickstream<br />Business question<br />How many distinct users:<br />Start at the home page.<br />Click on an auction.<br />View the seller’s profile.<br />Bid on the item.<br />Available Data<br />A database table clicks, populated with web log data, that has columns user_id, timestamp, and page_type.<br />
  54. 54. The nPath query<br />SELECT <br /> count(distinct user_id)<br />FROM nPath(<br /> ON clicks<br /> PARTITION BY user_id<br /> ORDER BY timestamp<br /> MODE(OVERLAPPING)<br /> PATTERN(‘H.A.P.B’)<br /> SYMBOLS( <br />page_type = ‘home’ AS H,<br />page_type = ‘auction’ AS A,<br />page_type = ‘profile’ AS P, <br />page_type = ‘bid’ AS B)<br /> RESULT(first(user_id of H) as user_id)<br />);<br />(1) Partition: Form groups by user_id.<br />(2) Order: Sort each group by timestamp.<br />
  55. 55. The nPath query<br />(3b) Match: Define the subsequences of interest via regex.<br />SELECT <br /> count(distinct user_id)<br />FROM nPath(<br /> ON clicks<br /> PARTITION BY user_id<br /> ORDER BY timestamp<br /> MODE(OVERLAPPING)<br /> PATTERN(‘H.A.P.B’)<br /> SYMBOLS( <br />page_type = ‘home’ AS H,<br />page_type = ‘auction’ AS A,<br />page_type = ‘profile’ AS P, <br />page_type = ‘bid’ AS B)<br /> RESULT(first(user_id of H) as user_id)<br />);<br />(3a) Match: Define a set of symbols.<br />
  56. 56. The nPath query<br />SELECT <br /> count(distinct user_id)<br />FROM nPath(<br /> ON clicks<br /> PARTITION BY user_id<br /> ORDER BY timestamp<br /> MODE(OVERLAPPING)<br /> PATTERN(‘H.A.P.B’)<br /> SYMBOLS( <br />page_type = ‘home’ AS H,<br />page_type = ‘auction’ AS A,<br />page_type = ‘profile’ AS P, <br />page_type = ‘bid’ AS B)<br /> RESULT(first(user_id of H) as user_id)<br />);<br />(4) Compute Aggregates over matched subsequences.<br />
  57. 57. Market Basket Analysis Example Question<br />Detect customers<br />- that purchase the same category of items<br />- in three market baskets in a row<br />- with total value &gt; $150<br />
  58. 58. Two Methods – Same Answer<br />Multi-pass Nested Sub-selects<br />Single Pass SQL-MR nPath Query<br />5187<br />17769<br />3542<br />1889<br />5753<br />2001<br />156<br />193<br />2521<br />156<br />1416<br />75194<br />75194<br />10411<br />27355<br />
  59. 59. Demo – Market Basket Analysis (1M Rows)<br />
  60. 60. Summary:Bringing MapReduce to Big Data Management<br />Aster’s MPP data warehouse + SQL-MapReduce<br />
  61. 61. Upcoming Webcast: Mastering MapReduce Part II<br />Save the date!: December 3rd<br />MapReduce Resources -<br />Recorded application use-cases<br />Code samples and tutorials<br />DBMS2 on MapReduce:<br />Aster’s SQL-MapReduce<br /><br /><br />TDWI Technical whitepaper<br />Contact us<br /><br /><br />Thank You!<br />