2. Lets start with an example…
access.log
timestamp,url,response_code,response_time
products.dat
date, product_id, price
3. Requirement:
Number of requests in the last 30 days.
$> ls –rt *.log | tail -30 | xargs “wc –l”
4. Requirement:
Busiest 30 minutes in last 30 days.
$> ls –rt *.log| tail -30 | xargs “./count_30min.sh“
5. Requirement:
Number of failed buy requests for products worth
more than $30 in the last 30 days .
Import data to an RDBMS;
SELECT COUNT(*) FROM logs, products
WHERE GET_REQUEST_TYPE(logs.url)=„BUY‟
AND
GET_PRODUCT_ID(logs.url)=products.product_id
AND product.price>30
AND DATE(log.timestamp) = products.date;
6. Now gimme the number of
failed buy requests for products
worth more than $30 in the
It might take a
last 1 Year.
while!
8. 5 days later…
$> mysqladmin processlist
+-----------------------------------+
| Query | Copy to Temp Table |
+-----------------------------------+
May be its Joining!!
Or may be its dead…
Or may be my replacement will see the result ..
10. But Why?!
Distributed data processing cluster
A distributed file system
Data location sensitive Task scheduler
MapReduce paradigm
Handles Parallelizable and distributable tasks
Failover capability
Web based monitoring capabilities
More value for your time!
11. Where to start??
MapReduce:
Q: Sum of all squares of a=[1,2,3,4,3,2,7]
Simple….
fold( map(a, square()), sum())
You can do that in any functional programming language….
12. Now do it for an array
of 100 million
elements…….
13. This is where Hadoop comes in..
Distributed File System : HDFS
Distributed Computation/Task Processing: Hadoop
Name Node + Data Node
Task Tracker + Job Tracker
17. An example
Word Count
hadoop jar contrib/streaming/hadoop-0.*-streaming.jar
-jobconf mapred.data.field.separator=quot;,quot;
-input 'wc.eg.in'
-output 'wc.eg.out'
-mapper 'wc -w'
-reducer quot;awk „{ sum+=$1 } END{ print sum}‟quot;
Task Tracker: http://cae5.internal.directi.com:50030/jobtracker.jsp
Name Node: http://cae2.internal.directi.com:50070/dfshealth.jsp
18. Pig and Pig Latin
A procedural language for MapReduce operations.
logs = LOAD 'access.logs' USING PigStorage(',') AS
(ts:int, URL:chararray, resp:chararray, resp_time:int);
products = LOAD 'products.dat' USING PigStorage(',') AS
(date:int, pid:int, price:int);
l1 = FOREACH logs GENERATE GetDate(ts) as req_date,
GetProductID(URL) as prod_id,
GetRequestType(URL) as rtype, resp, resp_time;
j1 = JOIN l1 BY prod_id, products by pid;
j2 = FILTER j1 BY req_date==date AND price>30.0F;
j3 = GROUP j2 ALL;
j4 = FOREACH j3 GENERATE COUNT(j2);
DUMP j4