3x  Fr@ iso fzk va       nV         ol            lenh                 ov                   en
Millions of these, each day86.88.37.142 - - [26/Jul/2011:00:01:46 +0200] "GET /nl/index.html?Referrer=ADVNLGOO22901030000b...
Egypt @ Jan 27, 2011
Hundreds of millions of these, each dayBGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148....
Name Node                                                          /some/file               /foo/barHDFS client           ...
Why                                ?  scalable      storage and   good for analytics:open source      processing     schem...
Not for me...I don’t have a lot of data.I surely don’t have a cluster of machines to spare.I just read the paper.It’d be c...
Free data...
Getting it...curl -u fzk:secret https://stream.twitter.com/1/statuses/sample.json > tweets.json                 8 weeks ==...
Tens of millions of these
Good, now the cluster...     http://whirr.apache.org/
Step 1: ConfigureStep 2: LaunchStep 3: ?Step 4: Pay
Step 1: Configurewhirr.service-name=hadoopwhirr.cluster-name=my-clusterwhirr.instance-templates=1 hadoop-jobtracker+hadoop-...
Step 2: Launchwhirr launch-cluster --config cluster.properties               wait about 20 minutes...bash .whirr/my-cluste...
Step 3:                             What’s up with                             Microsoft?Twitter mentions
“Hello,         MA                   POracle”                input: text“Google vs.                  Oracle, 1Microsoft vs...
MAGIC!
map(input record) => (key, value)ORDER BY key GROUP BY keyreduce(key, values) => (key, value)
Apache: [1]     RE                   DUC                       EApple: [1,1]    input: text,                count         ...
https://github.com/xebia/BigData-University
mvn clean installexport HADOOP_CONF_DIR=$HOME/.whirr/my-clusterhadoop jar bigdata-twitter-0.1-SNAPSHOT-job.jar -Dxebia.twi...
hadoop fs -get /job-output/part-r-00000 .whirr destroy-cluster --config cluster.properties
20110807   apache      220110807   google      42220110807   microsoft   4420110807   oracle      1120110808   apache     ...
Step 4: PayFrom: no-reply-aws@amazon.comSubject: AWS Billing Statement AvailableGreetings from Amazon Web Services,This e-...
Q&A                       @fzk  fvanvollenhoven@xebia.com
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
Upcoming SlideShare
Loading in …5
×

GOTO 2011 preso: 3x Hadoop

1,080 views

Published on

Talk presentation at the 2011 GOTO Amsterdam conference.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,080
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

GOTO 2011 preso: 3x Hadoop

  1. 1. 3x Fr@ iso fzk va nV ol lenh ov en
  2. 2. Millions of these, each day86.88.37.142 - - [26/Jul/2011:00:01:46 +0200] "GET /nl/index.html?Referrer=ADVNLGOO22901030000bsl HTTP/1.1" 200 15551 "http://www.google.nl/search?sourceid=navclient&aq=0h&oq=b&hl=nl&ie=UTF-8&q=bol.com.nl" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)""DYN_USER_ID=12660142780; DYN_USER_CONFIRM=8bc25ea623423bae5c4ce970faf1b13f4;BOL_RFID=ADVNLGOO1322090000bsl; BUI=86.55.31.109.1278181451852406" 0"Ti3nysCoEI4AAGMfqZAAAAPD" "-" "325886" "ps316"
  3. 3. Egypt @ Jan 27, 2011
  4. 4. Hundreds of millions of these, each dayBGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148.15.140|0|0||NAG|| the internet works because of these (and cables and routers and money and people and stuff)
  5. 5. Name Node /some/file /foo/barHDFS client create file read data Date Node Date Node Date Node write data DISK DISK DISK Node local HDFS client DISK DISK DISK replicate DISK DISK DISK read data
  6. 6. Why ? scalable storage and good for analytics:open source processing schema-less,cost-efficient in one unstructured
  7. 7. Not for me...I don’t have a lot of data.I surely don’t have a cluster of machines to spare.I just read the paper.It’d be cool if I could try this stuff sometime, though...
  8. 8. Free data...
  9. 9. Getting it...curl -u fzk:secret https://stream.twitter.com/1/statuses/sample.json > tweets.json 8 weeks == ~1/4 TB
  10. 10. Tens of millions of these
  11. 11. Good, now the cluster... http://whirr.apache.org/
  12. 12. Step 1: ConfigureStep 2: LaunchStep 3: ?Step 4: Pay
  13. 13. Step 1: Configurewhirr.service-name=hadoopwhirr.cluster-name=my-clusterwhirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode, 19 hadoop-datanode+hadoop-tasktrackerwhirr.provider=aws-ec2whirr.identity=SECRETwhirr.credential=EVEN-MORE-SECRETwhirr.private-key-file=${sys:user.home}/.ssh/id_rsawhirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pubwhirr.hadoop-install-function=install_cdh_hadoopwhirr.hadoop-configure-function=configure_cdh_hadoopwhirr.hardware-id=c1.xlarge
  14. 14. Step 2: Launchwhirr launch-cluster --config cluster.properties wait about 20 minutes...bash .whirr/my-cluster/hadoop-proxy.sh
  15. 15. Step 3: What’s up with Microsoft?Twitter mentions
  16. 16. “Hello, MA POracle” input: text“Google vs. Oracle, 1Microsoft vs. split words Google, 1Apple” Microsoft, 1 emit: Apple, 1“Apache rocks! $WORD, 1 Apache, 1Oracle not so for Oracle, 1much...” ‘interesting’ words Apple, 1“Apple ==iAwesome”
  17. 17. MAGIC!
  18. 18. map(input record) => (key, value)ORDER BY key GROUP BY keyreduce(key, values) => (key, value)
  19. 19. Apache: [1] RE DUC EApple: [1,1] input: text, count Apache: 1 Apple: 2 sum valuesGoogle: [1] Google: 1 emit: Microsoft: 1 $KEY, $SUM Oracle: 2Microsoft: [1] for all keysOracle: [1,1]
  20. 20. https://github.com/xebia/BigData-University
  21. 21. mvn clean installexport HADOOP_CONF_DIR=$HOME/.whirr/my-clusterhadoop jar bigdata-twitter-0.1-SNAPSHOT-job.jar -Dxebia.twitter.terms=oracle,google,microsoft,apache s3://training-hdfs/twitter-sample/* /job-output wait another 20 minutes...
  22. 22. hadoop fs -get /job-output/part-r-00000 .whirr destroy-cluster --config cluster.properties
  23. 23. 20110807 apache 220110807 google 42220110807 microsoft 4420110807 oracle 1120110808 apache 2520110808 google 134120110808 microsoft 16020110808 oracle 3720110809 apache 1720110809 google 143120110809 microsoft 18420110809 oracle 4020110810 apache 1220110810 google 168820110810 microsoft 17920110810 oracle 51
  24. 24. Step 4: PayFrom: no-reply-aws@amazon.comSubject: AWS Billing Statement AvailableGreetings from Amazon Web Services,This e-mail confirms that your latest billingstatement is available on the AWS web site. Youraccount will be charged the following:Total: $218.02Thank you for using Amazon Web Services.Sincerely,Amazon Web Services
  25. 25. Q&A @fzk fvanvollenhoven@xebia.com

×