MapReduce@DirectI

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    2 Favorites

    MapReduce@DirectI - Presentation Transcript

    1. MapReduce@DirectI amkiray: ramki.g@directi.com uvdhray: dhruv.m@directi.com
    2. Lets start with an example… access.log timestamp,url,response_code,response_time products.dat date, product_id, price
    3. Requirement:  Number of requests in the last 30 days.  $> ls –rt *.log | tail -30 | xargs “wc –l”
    4. Requirement:  Busiest 30 minutes in last 30 days.  $> ls –rt *.log| tail -30 | xargs “./count_30min.sh“
    5. Requirement:  Number of failed buy requests for products worth  more than $30 in the last 30 days . Import data to an RDBMS; SELECT COUNT(*) FROM logs, products WHERE GET_REQUEST_TYPE(logs.url)=„BUY‟ AND GET_PRODUCT_ID(logs.url)=products.product_id AND product.price>30 AND DATE(log.timestamp) = products.date;
    6. Now gimme the number of  failed buy requests for products worth more than $30 in the It might take a last 1 Year. while!
    7. 2 days later…. On its way! Inserting data into database…
    8. 5 days later… $> mysqladmin processlist +-----------------------------------+ | Query | Copy to Temp Table | +-----------------------------------+ May be its Joining!! Or may be its dead… Or may be my replacement will see the result .. 
    9. Hadoop god@internal.directi.com Go use bloody!
    10. But Why?! Distributed data processing cluster  A distributed file system  Data location sensitive Task scheduler  MapReduce paradigm  Handles Parallelizable and distributable tasks  Failover capability  Web based monitoring capabilities  More value for your time! 
    11. Where to start?? MapReduce:  Q: Sum of all squares of a=[1,2,3,4,3,2,7]  Simple….  fold( map(a, square()), sum()) You can do that in any functional programming language….
    12. Now do it for an array of 100 million elements…….
    13. This is where Hadoop comes in.. Distributed File System : HDFS  Distributed Computation/Task Processing: Hadoop  Name Node + Data Node  Task Tracker + Job Tracker 
    14. Task Tracker
    15. How MapReduce Works…
    16. An example Word Count  hadoop jar contrib/streaming/hadoop-0.*-streaming.jar -jobconf mapred.data.field.separator=\",\" -input 'wc.eg.in' -output 'wc.eg.out' -mapper 'wc -w' -reducer \"awk „{ sum+=\\$1 } END{ print sum}‟\" Task Tracker: http://cae5.internal.directi.com:50030/jobtracker.jsp Name Node: http://cae2.internal.directi.com:50070/dfshealth.jsp
    17. Pig and Pig Latin A procedural language for MapReduce operations.  logs = LOAD 'access.logs' USING PigStorage(',') AS (ts:int, URL:chararray, resp:chararray, resp_time:int); products = LOAD 'products.dat' USING PigStorage(',') AS (date:int, pid:int, price:int); l1 = FOREACH logs GENERATE GetDate(ts) as req_date, GetProductID(URL) as prod_id, GetRequestType(URL) as rtype, resp, resp_time; j1 = JOIN l1 BY prod_id, products by pid; j2 = FILTER j1 BY req_date==date AND price>30.0F; j3 = GROUP j2 ALL; j4 = FOREACH j3 GENERATE COUNT(j2); DUMP j4
    18. Q&A

    + Directi Directi , 7 months ago

    custom

    1171 views, 2 favs, 1 embeds more stats

    The initial simple MapReduce cluster setup at Direc more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1171
      • 1127 on SlideShare
      • 44 from embeds
    • Comments 0
    • Favorites 2
    • Downloads 18
    Most viewed embeds
    • 44 views on http://blog.folks.in

    more

    All embeds
    • 44 views on http://blog.folks.in

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories