• Save
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Upcoming SlideShare
Loading in...5
×
 

Mastering MapReduce: MapReduce for Big Data Management and Analysis

on

  • 14,873 views

Whether you’ve heard of Google’s MapReduce or not, its impact on Big Data applications, data warehousing, ETL,...

Whether you’ve heard of Google’s MapReduce or not, its impact on Big Data applications, data warehousing, ETL,
business intelligence, and data mining is re-shaping the market for business analytics and data processing.
Attend this session to hear from Curt Monash on the basics of the MapReduce framework, how it is used, and what implementations like SQL-MapReduce enable.

In this session you will learn:

* The basics of MapReduce, key use cases, and what SQL-MapReduce adds
* Which industries and applications are heavily using MapReduce
* Recommendations for integrating MapReduce in your own BI, Data Warehousing environment

Statistics

Views

Total Views
14,873
Views on SlideShare
11,753
Embed Views
3,120

Actions

Likes
29
Downloads
0
Comments
1

18 Embeds 3,120

http://www.asterdata.com 2806
http://www.slideshare.net 82
http://decisionstats.wordpress.com 78
http://ericconsulting.wordpress.com 70
http://decisionstats.com 31
http://datadude.wordpress.com 18
http://www.biblogs.com 17
http://webcache.googleusercontent.com 4
http://itknowledgehub.com 3
http://translate.googleusercontent.com 2
http://xianguo.com 2
http://74.125.95.132 1
http://localhost 1
http://192.168.2.3:8585 1
http://feeds2.feedburner.com 1
http://asterdata.com 1
http://feeds.feedburner.com 1
http://www.google.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • City Real Estate Europe http://www.cityorbestate.com


    عقار http://www.3qarsa.net

    حكايات نواعم http://www.nem-stories.com/vb/
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Mastering MapReduce: MapReduce for Big Data Management and Analysis Mastering MapReduce: MapReduce for Big Data Management and Analysis Presentation Transcript

  • Mastering MapReduce Series, Session I:MapReduce for Big Data Management and Analysis
    Curt Monash, Monash Research
    Steve Wooledge, Aster Data
    Peter Pawlowski, Aster Data
    Eric Friedman, Aster Data
    October 15th, 2009
  • Aster Data Overview
    SQL-MapReduce
    Example SQL-MapReduce applications
    SQL-MapReduce Syntax/example
    Q&A
    Topics
  • Aster Data
    Creating the Next-Generation Data Management System
    Founded in 2005 to revolutionize data processing & management of very large data volumes
    Founding team innovated on the ‘big data’ problem at Stanford University and were joined by big data experts from Google, Oracle, and Microsoft
    Aster’s first commercial product, nCluster, has been in market since 2007. Customers include MySpace, LinkedIn, Coremetrics, Akamai, others.
    Since 2008, innovated on Google’s well-known MapReduceframework to transform data processing. Created patent-pending SQL-MapReduce(In-Database MapReduce)
  • Example Data-Driven Applications
    Large Data Volumes and Analytics-Intensive
    • Merchandising and Packaging Optimization
    • Service Personalization (e.g. telco)
    • Graph analysis
    • Consumer segmentation
    • Consumer buying patterns and consumer behavior
    • Click-stream analysis
    • Compliance & Regulatory Reporting
    • Predictive and granular forecasting
    • Trend analysis and modeling
    • Credit and Risk management
    • Fraud detection
    • Cross-platform ad and event attribution
    • Cross-platform media affinity analysis
  • Real Results
  • Improving Computation Push-Down
    Cycle Time = Seconds to Minutes
    BI Reports Server
    DataMining
    Workload
    Common SQL Queries:
    aggregation, sub-sets & samples
    MPP Database
    Confidential and proprietary. Copyright © 2009 Aster Data Systems
    6
  • Aster’s Solution - A Massively Parallel Data Warehouse With the Unique Ability to Embed Applications
    Deeper, Faster Analytics on Big Data
    OtherApplications(C, C++, Perl, Python…)
    Leading BI Tools
    Key Classes ofApplications
    Custom JAVAApplications
    Custom .NET Applications
    Packaged Analytic Apps
    6
    Aster nCluster System
    Aster’s SQL-MapReduce orStandard Interfaces
    Unified
    Interface
    SQL
    SQL-MapReduce
    5
    High Volume, Fast Querying
    Industry-leading
    WLM: 300+
    Concurrent Workloads
    4
    Dynamic Workload Manager (WLM)
    Data
    .NET
    App
    Java
    App
    Embedded Parallelized Apps – executes within the DB
    Pack’gdApp
    Other
    Apps
    3
    3
    Data
    Data
    Data
    Data
    Data
    Data
    MPP Data Warehouse withIncremental Scaling
    (scale by function)
    Data
    Data
    Data
    Data
    Data
    2
    Massively
    -Parallel
    Data Store
    1
    Commodity Hardware
  • Aster SQL-MapReduce (SQL-MR)
    Bring your applications to the data
    “Data-Applications” Development Platform
    Rich portfolio of supported languages – Java, .NET, Python, Ruby, Perl, C++, R and More
    Use SQL to develop rich data apps
    Expressive flexibility
    Reusability across applications and reports
  • Full Tilt Poker: Fraud DetectionThe second largest online poker site in the world
    Objective:
    Improve fraud analytics and stop revenue leakage
    Before: Separate Java-based fraud detection applications ran once a week
    • Large volumes of data stored on SQL Server had to be decompressed and moved to analyze for fraud
    • Java-based program ran the data mining on extracted data
    • Algorithm had to be oversimplified due to performance limitations
    • Fraud was detected too late or not at all
    After: Store and analyze all data in one location…the Aster database with SQL-MapReduce
    • Reduced overall cycle time from 1 x per week to 15 minutes
    • Enriched fraud algorithm is now catching previously undetected fraud
    • Query performance improved by 60x (90 mins down to 90 secs)
    9
    Confidential and proprietary. Copyright © 2009 Aster Data Systems
  • Aster’s Patent-Pending SQL-MapReduce
    Enables faster, easier, and more powerful analytics
    SQL-MapReduce framework (for developers to create and extend)
    Flexible: MapReduce expressiveness, languages, polymorphism
    Performance: Massive parallelization, computational push-down
    Availability: Fault isolation, resource management
    Powerful SQL-MR functions (for analysts to consume)
    Deep insights: Unlimited analytical power at your disposal
    Ease of use: Simply plug in to the SQL you know and love
    The Power of Aster’s SQL-MapReduce Framework
    Write
    Install
    Use and Reuse
    Write a SQL-MR function in Java, C, etc.
    Install inside Aster nCluster
    Invoke SQL-MR function from SQL
    3
    1
    2
  • Options for Utilizing Power of MapReduce
    SQL- MapReduce
    File-Only MapReduce
    Pros
    • Scalable
    • Deep insights
    • Low HW Cost
    Cons
    • Limited standards
    • Limited SLAs
    • Expensive maintenance
    Pros
    • Standards (SQL)
    • Data integrity
    • Mixed workloads
    Cons
    • Limited scaling
    • Limited analytics
    • Expensive HW & maintenance
    Best of both worlds!
    Traditional
    Database
  • MapReduce Applications
    Behavioral Analytics (CRM)
    Sequential pattern analysis (e.g., up-sell/cross-sell)
    Spam/BOT analysis
    Sessionization analysis
    Risk & Fraud analysis
    Consumer credit scoring/default risk, market risk/VaR, operational risk, etc
    Fraud detection
    Graph analysis
    Social network “connectedness” (e.g., SSSP, APSP, etc)
    Text analysis
    Tokenization (e.g., word count classification)
    Natural language processing
    Statistical analysis (machine learning)
    Linear regression
    K-means clustering
    R Project algorithms
  • Aster’s SQL-MapReduce Library:
    Pre-packaged (SDK), SQL-MR APIs, and documentation
    Pre-packaged SQL-MR sample functions
    nPath – complex sequential analysis for time-series and behavioral pattern analysis
    SSSP – single source shortest path Graph algorithm useful for fraud and segmentation analysis
    Sessionize– session categorization based on a sequence of clicks within a specified timeout
    Approximate percentiles – ultra-fast percentile (or N-tile) statistical distribution analysis
    Linear regression – statistical technique used to predict values based on a set of related variables.
    Tokenize – text analysis that splits strings into words, categorizes them, and does a word count
  • MySpace Weblogs: Sessionization
    Objective:
    Analyze data to quickly identify user “sessions”
    Before: Used Regular SQL
    • ~1000 lines ANSI SQL code
    • Requires dozens of SQL queries every N minutes (dozens of times per day)
    • Sub-optimal performance (multiple passes)
    After: Used Sessionize SQL-MR Function
    • Sessionize is a MapReduce function (written in Java)
    • Significantly simpler code: <100 lines vs. 1000 lines
    • Single pass over data for optimal performance
    Source: Avinash Kaushik, Occam’s Razor, Nov ‘08
    14
    Confidential and proprietary. Copyright © 2009 Aster Data Systems
  • ShareThis: Sharing Behavior Analytics
    Objective:
    Analyze user behavior in multi-terabyte system run in the cloud
    Before: Long query times for Amazon EC2’s largest customer
    • Traditional database approach required multiple complex iterations (parsing, temp tables, tedious sorts) that were time intensive
    • Running data mining and statistical analysis on multi-TB system
    • Time intensive to develop
    • Cycle time of many hours
    After: nPath and SQL/MR solution
    • SQL-MR reduces query times and analyzing user sharing behavior
    • Single pass over large-scale data
    • 100 lines of code down to 12
    • Significant SQL optimization: Minimal SQL code, greater performance via parallel execution
    • Cycle time reduction: Significant resource savings in both time and utilization
    15
    Confidential and proprietary. Copyright © 2009 Aster Data Systems
  • SQL-MapReduce Syntax:nPath Example
  • nPath is a SQL-MR function included with nCluster.
    nPath enables analysis of ordered data:
    Clickstream data
    Financial transaction data
    User interaction data
    Anything of a time series nature
    Leverages the power of the SQL-MR framework to transcend SQL’s limitations with respect to ordered data
    What is Aster nPath?
    17
  • Example: Analyzing a Clickstream
    Business question
    How many distinct users:
    Start at the home page.
    Click on an auction.
    View the seller’s profile.
    Bid on the item.
    Available Data
    A database table clicks, populated with web log data, that has columns user_id, timestamp, and page_type.
  • The nPath query
    SELECT
    count(distinct user_id)
    FROM nPath(
    ON clicks
    PARTITION BY user_id
    ORDER BY timestamp
    MODE(OVERLAPPING)
    PATTERN(‘H.A.P.B’)
    SYMBOLS(
    page_type = ‘home’ AS H,
    page_type = ‘auction’ AS A,
    page_type = ‘profile’ AS P,
    page_type = ‘bid’ AS B)
    RESULT(first(user_id of H) as user_id)
    );
    (1) Partition: Form groups by user_id.
    (2) Order: Sort each group by timestamp.
  • The nPath query
    (3b) Match: Define the subsequences of interest via regex.
    SELECT
    count(distinct user_id)
    FROM nPath(
    ON clicks
    PARTITION BY user_id
    ORDER BY timestamp
    MODE(OVERLAPPING)
    PATTERN(‘H.A.P.B’)
    SYMBOLS(
    page_type = ‘home’ AS H,
    page_type = ‘auction’ AS A,
    page_type = ‘profile’ AS P,
    page_type = ‘bid’ AS B)
    RESULT(first(user_id of H) as user_id)
    );
    (3a) Match: Define a set of symbols.
  • The nPath query
    SELECT
    count(distinct user_id)
    FROM nPath(
    ON clicks
    PARTITION BY user_id
    ORDER BY timestamp
    MODE(OVERLAPPING)
    PATTERN(‘H.A.P.B’)
    SYMBOLS(
    page_type = ‘home’ AS H,
    page_type = ‘auction’ AS A,
    page_type = ‘profile’ AS P,
    page_type = ‘bid’ AS B)
    RESULT(first(user_id of H) as user_id)
    );
    (4) Compute Aggregates over matched subsequences.
  • Market Basket Analysis Example Question
    Detect customers
    - that purchase the same category of items
    - in three market baskets in a row
    - with total value &gt; $150
  • Two Methods – Same Answer
    Multi-pass Nested Sub-selects
    Single Pass SQL-MR nPath Query
    5187
    17769
    3542
    1889
    5753
    2001
    156
    193
    2521
    156
    1416
    75194
    75194
    10411
    27355
  • Demo – Market Basket Analysis (1M Rows)
  • Summary:Bringing MapReduce to Big Data Management
    Aster’s MPP data warehouse + SQL-MapReduce
  • Upcoming Webcast: Mastering MapReduce Part II
    Save the date!: December 3rd
    MapReduce Resources - http://www.asterdata.com/mapreduce/index.php
    Recorded application use-cases
    Code samples and tutorials
    DBMS2 on MapReduce: http://www.dbms2.com/category/parallelization/mapreduce/
    Aster’s SQL-MapReduce
    http://www.asterdata.com/product/mapreduce.php
    http://www.asterdata.com/blog/index.php/category/mapreduce/
    TDWI Technical whitepaper
    Contact us
    hello@asterdata.com
    Steve.wooledge@asterdata.com
    Thank You!