• Save
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Upcoming SlideShare
Loading in...5
×
 

Mastering MapReduce: MapReduce for Big Data Management and Analysis

on

  • 14,754 views

Whether you’ve heard of Google’s MapReduce or not, its impact on Big Data applications, data warehousing, ETL,...

Whether you’ve heard of Google’s MapReduce or not, its impact on Big Data applications, data warehousing, ETL,
business intelligence, and data mining is re-shaping the market for business analytics and data processing.
Attend this session to hear from Curt Monash on the basics of the MapReduce framework, how it is used, and what implementations like SQL-MapReduce enable.

In this session you will learn:

* The basics of MapReduce, key use cases, and what SQL-MapReduce adds
* Which industries and applications are heavily using MapReduce
* Recommendations for integrating MapReduce in your own BI, Data Warehousing environment

Statistics

Views

Total Views
14,754
Slideshare-icon Views on SlideShare
11,634
Embed Views
3,120

Actions

Likes
29
Downloads
0
Comments
1

18 Embeds 3,120

http://www.asterdata.com 2806
http://www.slideshare.net 82
http://decisionstats.wordpress.com 78
http://ericconsulting.wordpress.com 70
http://decisionstats.com 31
http://datadude.wordpress.com 18
http://www.biblogs.com 17
http://webcache.googleusercontent.com 4
http://itknowledgehub.com 3
http://translate.googleusercontent.com 2
http://xianguo.com 2
http://74.125.95.132 1
http://localhost 1
http://192.168.2.3:8585 1
http://feeds2.feedburner.com 1
http://asterdata.com 1
http://feeds.feedburner.com 1
http://www.google.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • City Real Estate Europe http://www.cityorbestate.com


    عقار http://www.3qarsa.net

    حكايات نواعم http://www.nem-stories.com/vb/
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Mastering MapReduce: MapReduce for Big Data Management and Analysis Mastering MapReduce: MapReduce for Big Data Management and Analysis Presentation Transcript

    • Mastering MapReduce Series, Session I:MapReduce for Big Data Management and Analysis
      Curt Monash, Monash Research
      Steve Wooledge, Aster Data
      Peter Pawlowski, Aster Data
      Eric Friedman, Aster Data
      October 15th, 2009
    • Aster Data Overview
      SQL-MapReduce
      Example SQL-MapReduce applications
      SQL-MapReduce Syntax/example
      Q&A
      Topics
    • Aster Data
      Creating the Next-Generation Data Management System
      Founded in 2005 to revolutionize data processing & management of very large data volumes
      Founding team innovated on the ‘big data’ problem at Stanford University and were joined by big data experts from Google, Oracle, and Microsoft
      Aster’s first commercial product, nCluster, has been in market since 2007. Customers include MySpace, LinkedIn, Coremetrics, Akamai, others.
      Since 2008, innovated on Google’s well-known MapReduceframework to transform data processing. Created patent-pending SQL-MapReduce(In-Database MapReduce)
    • Example Data-Driven Applications
      Large Data Volumes and Analytics-Intensive
      • Merchandising and Packaging Optimization
      • Service Personalization (e.g. telco)
      • Graph analysis
      • Consumer segmentation
      • Consumer buying patterns and consumer behavior
      • Click-stream analysis
      • Compliance & Regulatory Reporting
      • Predictive and granular forecasting
      • Trend analysis and modeling
      • Credit and Risk management
      • Fraud detection
      • Cross-platform ad and event attribution
      • Cross-platform media affinity analysis
    • Real Results
    • Improving Computation Push-Down
      Cycle Time = Seconds to Minutes
      BI Reports Server
      DataMining
      Workload
      Common SQL Queries:
      aggregation, sub-sets & samples
      MPP Database
      Confidential and proprietary. Copyright © 2009 Aster Data Systems
      6
    • Aster’s Solution - A Massively Parallel Data Warehouse With the Unique Ability to Embed Applications
      Deeper, Faster Analytics on Big Data
      OtherApplications(C, C++, Perl, Python…)
      Leading BI Tools
      Key Classes ofApplications
      Custom JAVAApplications
      Custom .NET Applications
      Packaged Analytic Apps
      6
      Aster nCluster System
      Aster’s SQL-MapReduce orStandard Interfaces
      Unified
      Interface
      SQL
      SQL-MapReduce
      5
      High Volume, Fast Querying
      Industry-leading
      WLM: 300+
      Concurrent Workloads
      4
      Dynamic Workload Manager (WLM)
      Data
      .NET
      App
      Java
      App
      Embedded Parallelized Apps – executes within the DB
      Pack’gdApp
      Other
      Apps
      3
      3
      Data
      Data
      Data
      Data
      Data
      Data
      MPP Data Warehouse withIncremental Scaling
      (scale by function)
      Data
      Data
      Data
      Data
      Data
      2
      Massively
      -Parallel
      Data Store
      1
      Commodity Hardware
    • Aster SQL-MapReduce (SQL-MR)
      Bring your applications to the data
      “Data-Applications” Development Platform
      Rich portfolio of supported languages – Java, .NET, Python, Ruby, Perl, C++, R and More
      Use SQL to develop rich data apps
      Expressive flexibility
      Reusability across applications and reports
    • Full Tilt Poker: Fraud DetectionThe second largest online poker site in the world
      Objective:
      Improve fraud analytics and stop revenue leakage
      Before: Separate Java-based fraud detection applications ran once a week
      • Large volumes of data stored on SQL Server had to be decompressed and moved to analyze for fraud
      • Java-based program ran the data mining on extracted data
      • Algorithm had to be oversimplified due to performance limitations
      • Fraud was detected too late or not at all
      After: Store and analyze all data in one location…the Aster database with SQL-MapReduce
      • Reduced overall cycle time from 1 x per week to 15 minutes
      • Enriched fraud algorithm is now catching previously undetected fraud
      • Query performance improved by 60x (90 mins down to 90 secs)
      9
      Confidential and proprietary. Copyright © 2009 Aster Data Systems
    • Aster’s Patent-Pending SQL-MapReduce
      Enables faster, easier, and more powerful analytics
      SQL-MapReduce framework (for developers to create and extend)
      Flexible: MapReduce expressiveness, languages, polymorphism
      Performance: Massive parallelization, computational push-down
      Availability: Fault isolation, resource management
      Powerful SQL-MR functions (for analysts to consume)
      Deep insights: Unlimited analytical power at your disposal
      Ease of use: Simply plug in to the SQL you know and love
      The Power of Aster’s SQL-MapReduce Framework
      Write
      Install
      Use and Reuse
      Write a SQL-MR function in Java, C, etc.
      Install inside Aster nCluster
      Invoke SQL-MR function from SQL
      3
      1
      2
    • Options for Utilizing Power of MapReduce
      SQL- MapReduce
      File-Only MapReduce
      Pros
      • Scalable
      • Deep insights
      • Low HW Cost
      Cons
      • Limited standards
      • Limited SLAs
      • Expensive maintenance
      Pros
      • Standards (SQL)
      • Data integrity
      • Mixed workloads
      Cons
      • Limited scaling
      • Limited analytics
      • Expensive HW & maintenance
      Best of both worlds!
      Traditional
      Database
    • MapReduce Applications
      Behavioral Analytics (CRM)
      Sequential pattern analysis (e.g., up-sell/cross-sell)
      Spam/BOT analysis
      Sessionization analysis
      Risk & Fraud analysis
      Consumer credit scoring/default risk, market risk/VaR, operational risk, etc
      Fraud detection
      Graph analysis
      Social network “connectedness” (e.g., SSSP, APSP, etc)
      Text analysis
      Tokenization (e.g., word count classification)
      Natural language processing
      Statistical analysis (machine learning)
      Linear regression
      K-means clustering
      R Project algorithms
    • Aster’s SQL-MapReduce Library:
      Pre-packaged (SDK), SQL-MR APIs, and documentation
      Pre-packaged SQL-MR sample functions
      nPath – complex sequential analysis for time-series and behavioral pattern analysis
      SSSP – single source shortest path Graph algorithm useful for fraud and segmentation analysis
      Sessionize– session categorization based on a sequence of clicks within a specified timeout
      Approximate percentiles – ultra-fast percentile (or N-tile) statistical distribution analysis
      Linear regression – statistical technique used to predict values based on a set of related variables.
      Tokenize – text analysis that splits strings into words, categorizes them, and does a word count
    • MySpace Weblogs: Sessionization
      Objective:
      Analyze data to quickly identify user “sessions”
      Before: Used Regular SQL
      • ~1000 lines ANSI SQL code
      • Requires dozens of SQL queries every N minutes (dozens of times per day)
      • Sub-optimal performance (multiple passes)
      After: Used Sessionize SQL-MR Function
      • Sessionize is a MapReduce function (written in Java)
      • Significantly simpler code: <100 lines vs. 1000 lines
      • Single pass over data for optimal performance
      Source: Avinash Kaushik, Occam’s Razor, Nov ‘08
      14
      Confidential and proprietary. Copyright © 2009 Aster Data Systems
    • ShareThis: Sharing Behavior Analytics
      Objective:
      Analyze user behavior in multi-terabyte system run in the cloud
      Before: Long query times for Amazon EC2’s largest customer
      • Traditional database approach required multiple complex iterations (parsing, temp tables, tedious sorts) that were time intensive
      • Running data mining and statistical analysis on multi-TB system
      • Time intensive to develop
      • Cycle time of many hours
      After: nPath and SQL/MR solution
      • SQL-MR reduces query times and analyzing user sharing behavior
      • Single pass over large-scale data
      • 100 lines of code down to 12
      • Significant SQL optimization: Minimal SQL code, greater performance via parallel execution
      • Cycle time reduction: Significant resource savings in both time and utilization
      15
      Confidential and proprietary. Copyright © 2009 Aster Data Systems
    • SQL-MapReduce Syntax:nPath Example
    • nPath is a SQL-MR function included with nCluster.
      nPath enables analysis of ordered data:
      Clickstream data
      Financial transaction data
      User interaction data
      Anything of a time series nature
      Leverages the power of the SQL-MR framework to transcend SQL’s limitations with respect to ordered data
      What is Aster nPath?
      17
    • Example: Analyzing a Clickstream
      Business question
      How many distinct users:
      Start at the home page.
      Click on an auction.
      View the seller’s profile.
      Bid on the item.
      Available Data
      A database table clicks, populated with web log data, that has columns user_id, timestamp, and page_type.
    • The nPath query
      SELECT
      count(distinct user_id)
      FROM nPath(
      ON clicks
      PARTITION BY user_id
      ORDER BY timestamp
      MODE(OVERLAPPING)
      PATTERN(‘H.A.P.B’)
      SYMBOLS(
      page_type = ‘home’ AS H,
      page_type = ‘auction’ AS A,
      page_type = ‘profile’ AS P,
      page_type = ‘bid’ AS B)
      RESULT(first(user_id of H) as user_id)
      );
      (1) Partition: Form groups by user_id.
      (2) Order: Sort each group by timestamp.
    • The nPath query
      (3b) Match: Define the subsequences of interest via regex.
      SELECT
      count(distinct user_id)
      FROM nPath(
      ON clicks
      PARTITION BY user_id
      ORDER BY timestamp
      MODE(OVERLAPPING)
      PATTERN(‘H.A.P.B’)
      SYMBOLS(
      page_type = ‘home’ AS H,
      page_type = ‘auction’ AS A,
      page_type = ‘profile’ AS P,
      page_type = ‘bid’ AS B)
      RESULT(first(user_id of H) as user_id)
      );
      (3a) Match: Define a set of symbols.
    • The nPath query
      SELECT
      count(distinct user_id)
      FROM nPath(
      ON clicks
      PARTITION BY user_id
      ORDER BY timestamp
      MODE(OVERLAPPING)
      PATTERN(‘H.A.P.B’)
      SYMBOLS(
      page_type = ‘home’ AS H,
      page_type = ‘auction’ AS A,
      page_type = ‘profile’ AS P,
      page_type = ‘bid’ AS B)
      RESULT(first(user_id of H) as user_id)
      );
      (4) Compute Aggregates over matched subsequences.
    • Market Basket Analysis Example Question
      Detect customers
      - that purchase the same category of items
      - in three market baskets in a row
      - with total value &gt; $150
    • Two Methods – Same Answer
      Multi-pass Nested Sub-selects
      Single Pass SQL-MR nPath Query
      5187
      17769
      3542
      1889
      5753
      2001
      156
      193
      2521
      156
      1416
      75194
      75194
      10411
      27355
    • Demo – Market Basket Analysis (1M Rows)
    • Summary:Bringing MapReduce to Big Data Management
      Aster’s MPP data warehouse + SQL-MapReduce
    • Upcoming Webcast: Mastering MapReduce Part II
      Save the date!: December 3rd
      MapReduce Resources - http://www.asterdata.com/mapreduce/index.php
      Recorded application use-cases
      Code samples and tutorials
      DBMS2 on MapReduce: http://www.dbms2.com/category/parallelization/mapreduce/
      Aster’s SQL-MapReduce
      http://www.asterdata.com/product/mapreduce.php
      http://www.asterdata.com/blog/index.php/category/mapreduce/
      TDWI Technical whitepaper
      Contact us
      hello@asterdata.com
      Steve.wooledge@asterdata.com
      Thank You!