Introduction to MapReduce Data Transformations

  • 8,795 views
Uploaded on

MapReduce is a framework for scalable parallel data processing popularized by Google. Although initially used for simple large-scale text processing, map/reduce has recently been expanded to serve …

MapReduce is a framework for scalable parallel data processing popularized by Google. Although initially used for simple large-scale text processing, map/reduce has recently been expanded to serve some application tasks normally performed by traditional relational databases.
You Will Learn

* The basics of Map/Reduce programming in Java
* The application domains where the framework is most appropriate
* How to build analytic database systems that handle large datasets and multiple data sources robustly
* Evaluate data warehousing vendors in a realistic and unbiased way
* Emerging trends to combine Map/Reduce with standard SQL for improved power and efficiency

Geared To

* Programmers
* Developers
* Database Administrators
* Data warehouse managers
* CIOs
* CTOs

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
8,795
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
460
Comments
0
Likes
12

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduction to Map/Reduce Data Transformations Tasso Argyros CTO and Co-Founder Aster Data Systems [email_address]
  • 2. A Brief History of MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 3. What is MapReduce?
      • It’s the simplest API you have ever seen
      • It has just two functions
      • 1. Map()
      • and
      • 2. Reduce()
      • Plus: it’s language independent (Java, Perl, Python, …)
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 4. Why is MapReduce Useful?
      • It simplifies distributed applications…
      • … by abstracting the details of data distribution (where is the data I need?) and process distribution (where should I run this process?)…
      • … behind two simple functions.
      • But let’s see an example
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 5. The quick brown fox jumps over the lazy dog. To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful programming paradigm. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D
  • 6. Goal We Want to Count the # of Times Each Word Occurs Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 7. 1 st Approach No MapReduce 1 st Approach No MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 8. The quick brown fox jumps over the lazy dog To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful concept. the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D hello world mapreduce is a very powerful concept to be or not to be that is the question
  • 9. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server 4 Final Result File the 5 is 3 mapreduce 2 … …
  • 10. What Did We Do?
    • Write a script to parse the documents and output word lists
    • FTP all the word lists to server 4
    • Write another script to count each word on Server 4
    • Problem: (2) and (3) do not scale!
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 11. 2 nd Approach No MapReduce Fully Distributed Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 12. The quick brown fox jumps over the lazy dog To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful concept. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question the the the the the database database future world world powerful lazy brown mapreduce mapreduce be be to jumps computers hello is is is question over a that
  • 13. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server 1 Final Result File the 5 … … . Server 2 Final Result File world 2 … … . Server 3 Final Result File mapreduce 2 … … . Server 4 Final Result File is 3 … … .
  • 14. 2 nd Approach: No MapReduce, Distributed Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 15. Does it work? Yes Is it a pain? Yes!! Does it take lots of time? Yes! Would you do it? No!!! Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 16. Moreover…
      • Who will manage your files?
      • What if nodes fail?
      • What if you want to add more nodes?
      • What if…
      • What if…
      • What if…
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 17. Data Redistribution and Grouping Confidential and proprietary. Copyright © 2008 Aster Data Systems Map() Input Any file (e.g. documents) Output Stream of <key, value> pairs (e.g. <word, count> pairs) Input All <key, value> pairs with the same key grouped (e.g. all <word, count> pairs where word = “the”) Output Anything (e.g. sum of counts for a specific word) Reduce()
  • 18. The quick brown fox jumps over the lazy dog In-Database MapReduce is the future. <the, 1> <quick, 1> <brown,1> <fox,1> <jumps,1> <over,1> <the,1> <lazy,1> <dog,1> <in, 1> <database, 1> <mapreduce,1> <is,1> <the,1> <future,1> <world,1> <world,1> <powerful,1> <lazy,1> <brown,1> <mapreduce,1> <mapreduce,1> <be,1> <be,1> <to,1> <jumps,1> <computers,1> <hello,1> <is,1> <is,1> <is,1> <question,1> <over,1> <a,1> <that,1> Switch <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> Map() and Redistribution Phase Confidential and proprietary. Copyright © 2008 Aster Data Systems Map() Map() Server A Server B Server C Server D
  • 19. <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> Grouping and Reduce() Phase (on Server 1) Confidential and proprietary. Copyright © 2008 Aster Data Systems Reduce() Server 1 Final Result File the 5 database 2 future 1 Reduce() Reduce()
  • 20. What Just Happened?
      • By writing two small scripts with a few lines of code…
      • … we achieved exactly the same result!
      • Plus, our code did not have to care about:
        • the # of servers on the system (4 or 400?)
        • which server to send each word
        • any network communication aspects
        • any fault tolerance aspects
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 21. Word Count was Only an Example!
    • Google does all web indexing on MapReduce
    “ The indexing code is simpler, smaller, and easier to understand, because the code that deals with fault tolerance, distribution and parallelization is hidden within the MapReduce library. For example, the size of one phase of the computation dropped from approximately 3,800 lines of C++ code to approximately 700 lines when expressed using MapReduce .” Google 2004 MapReduce paper Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 22. Word Count was Only an Example!
    • Published work from Stanford University showed that even extremely complex Data Mining algorithms can fit in this very simple model
    “ We adapt Google’s MapReduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).” Stanford 2006 AI Lab paper Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 23. Result?
    • MapReduce makes writing parallel programs extremely easy…
      • … and can accommodate from trivial to very complex algorithms…
      • … thus enabling the processing of petabytes of data with a few lines of code!
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 24. But…
      • Today MapReduce is used only by hardcore coders/programmers/hackers
      • Changes in MapReduce queries require changes in the MapReduce code itself
        • Constantly keep coding
      • Using MapReduce with database data is hard and cumbersome…
      • … when most of the structured data in the enterprise are stored in databases!
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 25. Beyond SQL and MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 26. SQL vs MapReduce: Two different worlds?
    • SQL
      • Declarative
        • Specifies what needs to happen
      • Execution plans optimized dynamically
      • Input/output is structured
      • Data redistribution inferred from SQL statement (in MPP Databases)
    • MapReduce
      • Procedural
        • Specifies how it needs to happen
      • Code compiled once; MapReduce plans are static
      • Input/output is unstructured
      • Data redistribution based on <keys> in Reduce() phase
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 27. Implementing MR in the Database
      • Uses Polymorphic SQL operators to embed MapReduce functions to SQL
      • Introduces a “PARTITION BY” clause to specify data redistribution
      • Introduces a “SEQUENCE BY” clause to specify ordering of data flows to the MR functions
      • Best of both worlds
        • Planning is still dynamic
        • MapReduce functions can be used like custom SQL operators
        • MapReduce functions can implement any algorithm or transformation
        • Code Once – Use Many (through SQL) model
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 28. The SQL/MR Process Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 29. SQL/MR Function: Syntax
    • SELECT…
    • FROM
    • MR_Function (           ON source_data           [ PARTITION BY column ]           [ ORDER BY column ] [Function Arguments]
    • )
    • WHERE …
    • GROUP BY …
    • HAVING …
    • ORDER BY …
    • LIMIT …;
    Optional conditions & filters (5) Select output (eg. count) (1) Source table or sub-select (3) Sort before the MR function (4) Java/Python/… MR function (2) <key> for data redistribution Optional MR_Function Arguments Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 30. Example 1: Tokenization
    • Demo #1: Only Map (Tokenization) in SQL/MR
    •     SELECT word, count(*) AS wordcount     FROM Tokenize ( ON blogs )     GROUP BY word     ORDER BY wordcount DESC     LIMIT 20;
    • Demo #2: Map (Tokenization) and Reduce (WordCount) in SQL/MR
    •        SELECT key AS word, value AS wordcount     FROM WordCountReduce (           ON Tokenize ( ON blogs )           PARTITION BY key           )     ORDER BY wordcount DESC     LIMIT 20;
    • Demo #3: Why do Reduce when you have SQL?
    •      SELECT word, count(*) AS wordcount     FROM Tokenize ( ON blogs )     GROUP BY word     ORDER BY wordcount DESC     LIMIT 20 ;
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 31. Example 2: Sessionization
    • What Is Sessionize?
      • An example Aster SQL/MR function
      • Leverages Aster’s Java library API
    • What Does It Do?
      • User specified a column (eg. timestamp) and a session timeout value (in seconds)
      • Spits out unique session identifiers (sessionid column)
    • Usage
    •     CREATE TABLE sessionized_clicks AS     SELECT ts, userid, sessionid, ...     FROM Sessionize (           ON clicks           PARTITION BY userid           ORDER BY ts           TIMEOUT 60           );
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 32. Example 2: Sessionization Slide Session Timeout = 60 seconds Clickstream Confidential and proprietary. Copyright © 2008 Aster Data Systems timestamp userid 10:00:00 Shawn1 00:58:24 PrezBush 10:00:24 Shawn1 02:30:33 PrezBush 10:01:23 Shawn1 10:02:40 Shawn1 timestamp userid sessionid 10:00:00 Shawn1 0 10:00:24 Shawn1 0 10:01:23 Shawn1 0 10:02:40 Shawn1 1 timestamp userid sessionid 00:58:24 PrezBush 0 02:30:33 PrezBush 1 INPUT OUTPUT
  • 33. MR Applications in the Database
    • ELT
      • Text and data transformations, in-parallel, in-database
    • Queries that become too complex for SQL
      • E.g. Sessionize(), customer segmentation, predictive analytics, …
    • Queries that SQL inherently cannot handle well
      • Time series analytics
      • Aster has a set of pre-defined SQL/MR functions for this
    • Data structures that do not fit well the relational model
      • Time series (again)
      • Graphs, spatial data
    • Any analytical or reporting application that requires more performance and data proximity!
    Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 34. Summary
      • Growing challenges in scaling analytical applications and reporting
      • MapReduce is driving a data revolution (see: Google)
      • In-Database MapReduce will open up databases to a host of new applications
    [email_address] (Questions, Comments) asterdata.com/blog (Lots of technical details) 1.888.Aster.Data (Any other information) Confidential and proprietary. Copyright © 2008 Aster Data Systems