MapReduce@DirectI

  • 2,447 views
Uploaded on

The initial simple MapReduce cluster setup at DirectI. An introduction to MapReduce and Hadoop. A brief intro Pig is also included.

The initial simple MapReduce cluster setup at DirectI. An introduction to MapReduce and Hadoop. A brief intro Pig is also included.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,447
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
64
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. MapReduce@DirectI amkiray: ramki.g@directi.com uvdhray: dhruv.m@directi.com
  • 2. Lets start with an example… access.log timestamp,url,response_code,response_time products.dat date, product_id, price
  • 3. Requirement:  Number of requests in the last 30 days.  $> ls –rt *.log | tail -30 | xargs “wc –l”
  • 4. Requirement:  Busiest 30 minutes in last 30 days.  $> ls –rt *.log| tail -30 | xargs “./count_30min.sh“
  • 5. Requirement:  Number of failed buy requests for products worth  more than $30 in the last 30 days . Import data to an RDBMS; SELECT COUNT(*) FROM logs, products WHERE GET_REQUEST_TYPE(logs.url)=„BUY‟ AND GET_PRODUCT_ID(logs.url)=products.product_id AND product.price>30 AND DATE(log.timestamp) = products.date;
  • 6. Now gimme the number of  failed buy requests for products worth more than $30 in the It might take a last 1 Year. while!
  • 7. 2 days later…. On its way! Inserting data into database…
  • 8. 5 days later… $> mysqladmin processlist +-----------------------------------+ | Query | Copy to Temp Table | +-----------------------------------+ May be its Joining!! Or may be its dead… Or may be my replacement will see the result .. 
  • 9. Hadoop god@internal.directi.com Go use bloody!
  • 10. But Why?! Distributed data processing cluster  A distributed file system  Data location sensitive Task scheduler  MapReduce paradigm  Handles Parallelizable and distributable tasks  Failover capability  Web based monitoring capabilities  More value for your time! 
  • 11. Where to start?? MapReduce:  Q: Sum of all squares of a=[1,2,3,4,3,2,7]  Simple….  fold( map(a, square()), sum()) You can do that in any functional programming language….
  • 12. Now do it for an array of 100 million elements…….
  • 13. This is where Hadoop comes in.. Distributed File System : HDFS  Distributed Computation/Task Processing: Hadoop  Name Node + Data Node  Task Tracker + Job Tracker 
  • 14. Task Tracker
  • 15. How MapReduce Works…
  • 16. An example Word Count  hadoop jar contrib/streaming/hadoop-0.*-streaming.jar -jobconf mapred.data.field.separator=quot;,quot; -input 'wc.eg.in' -output 'wc.eg.out' -mapper 'wc -w' -reducer quot;awk „{ sum+=$1 } END{ print sum}‟quot; Task Tracker: http://cae5.internal.directi.com:50030/jobtracker.jsp Name Node: http://cae2.internal.directi.com:50070/dfshealth.jsp
  • 17. Pig and Pig Latin A procedural language for MapReduce operations.  logs = LOAD 'access.logs' USING PigStorage(',') AS (ts:int, URL:chararray, resp:chararray, resp_time:int); products = LOAD 'products.dat' USING PigStorage(',') AS (date:int, pid:int, price:int); l1 = FOREACH logs GENERATE GetDate(ts) as req_date, GetProductID(URL) as prod_id, GetRequestType(URL) as rtype, resp, resp_time; j1 = JOIN l1 BY prod_id, products by pid; j2 = FILTER j1 BY req_date==date AND price>30.0F; j3 = GROUP j2 ALL; j4 = FOREACH j3 GENERATE COUNT(j2); DUMP j4
  • 18. Q&A