Mastering MapReduce: MapReduce for Big Data Management and Analysis

1. Mastering MapReduce Series, Session I:MapReduce for Big Data Management and Analysis Curt Monash, Monash Research Steve Wooledge, Aster Data Peter Pawlowski, Aster Data Eric Friedman, Aster Data October 15th, 2009

2. Aster Data Overview SQL-MapReduce Example SQL-MapReduce applications SQL-MapReduce Syntax/example Q&A Topics

3. Aster Data Creating the Next-Generation Data Management System Founded in 2005 to revolutionize data processing & management of very large data volumes Founding team innovated on the ‘big data’ problem at Stanford University and were joined by big data experts from Google, Oracle, and Microsoft Aster’s first commercial product, nCluster, has been in market since 2007. Customers include MySpace, LinkedIn, Coremetrics, Akamai, others. Since 2008, innovated on Google’s well-known MapReduceframework to transform data processing. Created patent-pending SQL-MapReduce(In-Database MapReduce)

5. Service Personalization (e.g. telco)

6. Graph analysis

7. Consumer segmentation

8. Consumer buying patterns and consumer behavior

9. Click-stream analysis

10. Compliance & Regulatory Reporting

11. Predictive and granular forecasting

12. Trend analysis and modeling

13. Credit and Risk management

14. Fraud detection

15. Cross-platform ad and event attribution

17. Improving Computation Push-Down Cycle Time = Seconds to Minutes BI Reports Server DataMining Workload Common SQL Queries: aggregation, sub-sets & samples MPP Database Confidential and proprietary. Copyright © 2009 Aster Data Systems 6

18. Aster’s Solution - A Massively Parallel Data Warehouse With the Unique Ability to Embed Applications Deeper, Faster Analytics on Big Data OtherApplications(C, C++, Perl, Python…) Leading BI Tools Key Classes ofApplications Custom JAVAApplications Custom .NET Applications Packaged Analytic Apps 6 Aster nCluster System Aster’s SQL-MapReduce orStandard Interfaces Unified Interface SQL SQL-MapReduce 5 High Volume, Fast Querying Industry-leading WLM: 300+ Concurrent Workloads 4 Dynamic Workload Manager (WLM) Data .NET App Java App Embedded Parallelized Apps – executes within the DB Pack’gdApp Other Apps 3 3 Data Data Data Data Data Data MPP Data Warehouse withIncremental Scaling (scale by function) Data Data Data Data Data 2 Massively -Parallel Data Store 1 Commodity Hardware

19. Aster SQL-MapReduce (SQL-MR) Bring your applications to the data “Data-Applications” Development Platform Rich portfolio of supported languages – Java, .NET, Python, Ruby, Perl, C++, R and More Use SQL to develop rich data apps Expressive flexibility Reusability across applications and reports

21. Java-based program ran the data mining on extracted data

22. Algorithm had to be oversimplified due to performance limitations

24. Enriched fraud algorithm is now catching previously undetected fraud

26. Aster’s Patent-Pending SQL-MapReduce Enables faster, easier, and more powerful analytics SQL-MapReduce framework (for developers to create and extend) Flexible: MapReduce expressiveness, languages, polymorphism Performance: Massive parallelization, computational push-down Availability: Fault isolation, resource management Powerful SQL-MR functions (for analysts to consume) Deep insights: Unlimited analytical power at your disposal Ease of use: Simply plug in to the SQL you know and love The Power of Aster’s SQL-MapReduce Framework Write Install Use and Reuse Write a SQL-MR function in Java, C, etc. Install inside Aster nCluster Invoke SQL-MR function from SQL 3 1 2

28. Deep insights

30. Limited SLAs

32. Data integrity

34. Limited analytics

35. Expensive HW & maintenanceBest of both worlds! Traditional Database

36. MapReduce Applications Behavioral Analytics (CRM) Sequential pattern analysis (e.g., up-sell/cross-sell) Spam/BOT analysis Sessionization analysis Risk & Fraud analysis Consumer credit scoring/default risk, market risk/VaR, operational risk, etc Fraud detection Graph analysis Social network “connectedness” (e.g., SSSP, APSP, etc) Text analysis Tokenization (e.g., word count classification) Natural language processing Statistical analysis (machine learning) Linear regression K-means clustering R Project algorithms

37. Aster’s SQL-MapReduce Library: Pre-packaged (SDK), SQL-MR APIs, and documentation Pre-packaged SQL-MR sample functions nPath – complex sequential analysis for time-series and behavioral pattern analysis SSSP – single source shortest path Graph algorithm useful for fraud and segmentation analysis Sessionize– session categorization based on a sequence of clicks within a specified timeout Approximate percentiles – ultra-fast percentile (or N-tile) statistical distribution analysis Linear regression – statistical technique used to predict values based on a set of related variables. Tokenize – text analysis that splits strings into words, categorizes them, and does a word count

39. Requires dozens of SQL queries every N minutes (dozens of times per day)

41. Significantly simpler code: <100 lines vs. 1000 lines

44. Running data mining and statistical analysis on multi-TB system

45. Time intensive to develop

47. Single pass over large-scale data

48. 100 lines of code down to 12

49. Significant SQL optimization: Minimal SQL code, greater performance via parallel execution

51. SQL-MapReduce Syntax:nPath Example

52. nPath is a SQL-MR function included with nCluster. nPath enables analysis of ordered data: Clickstream data Financial transaction data User interaction data Anything of a time series nature Leverages the power of the SQL-MR framework to transcend SQL’s limitations with respect to ordered data What is Aster nPath? 17

53. Example: Analyzing a Clickstream Business question How many distinct users: Start at the home page. Click on an auction. View the seller’s profile. Bid on the item. Available Data A database table clicks, populated with web log data, that has columns user_id, timestamp, and page_type.

54. The nPath query SELECT count(distinct user_id) FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE(OVERLAPPING) PATTERN(‘H.A.P.B’) SYMBOLS( page_type = ‘home’ AS H, page_type = ‘auction’ AS A, page_type = ‘profile’ AS P, page_type = ‘bid’ AS B) RESULT(first(user_id of H) as user_id) ); (1) Partition: Form groups by user_id. (2) Order: Sort each group by timestamp.

55. The nPath query (3b) Match: Define the subsequences of interest via regex. SELECT count(distinct user_id) FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE(OVERLAPPING) PATTERN(‘H.A.P.B’) SYMBOLS( page_type = ‘home’ AS H, page_type = ‘auction’ AS A, page_type = ‘profile’ AS P, page_type = ‘bid’ AS B) RESULT(first(user_id of H) as user_id) ); (3a) Match: Define a set of symbols.

56. The nPath query SELECT count(distinct user_id) FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE(OVERLAPPING) PATTERN(‘H.A.P.B’) SYMBOLS( page_type = ‘home’ AS H, page_type = ‘auction’ AS A, page_type = ‘profile’ AS P, page_type = ‘bid’ AS B) RESULT(first(user_id of H) as user_id) ); (4) Compute Aggregates over matched subsequences.

57. Market Basket Analysis Example Question Detect customers - that purchase the same category of items - in three market baskets in a row - with total value > $150

58. Two Methods – Same Answer Multi-pass Nested Sub-selects Single Pass SQL-MR nPath Query 5187 17769 3542 1889 5753 2001 156 193 2521 156 1416 75194 75194 10411 27355

59. Demo – Market Basket Analysis (1M Rows)

60. Summary:Bringing MapReduce to Big Data Management Aster’s MPP data warehouse + SQL-MapReduce

61. Upcoming Webcast: Mastering MapReduce Part II Save the date!: December 3rd MapReduce Resources - http://www.asterdata.com/mapreduce/index.php Recorded application use-cases Code samples and tutorials DBMS2 on MapReduce: http://www.dbms2.com/category/parallelization/mapreduce/ Aster’s SQL-MapReduce http://www.asterdata.com/product/mapreduce.php http://www.asterdata.com/blog/index.php/category/mapreduce/ TDWI Technical whitepaper Contact us hello@asterdata.com Steve.wooledge@asterdata.com Thank You!

Mastering MapReduce: MapReduce for Big Data Management and Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mastering MapReduce: MapReduce for Big Data Management and Analysis

Similar to Mastering MapReduce: MapReduce for Big Data Management and Analysis (20)

More from Teradata Aster

More from Teradata Aster (20)

Recently uploaded

Recently uploaded (20)

Mastering MapReduce: MapReduce for Big Data Management and Analysis