Elastic MapReduce
   Wikipedia
http://ohkura.com

• 2008                  1
•
•              blog
•              2007
Python




     Wikipedia       (   120   )
MapReduce
• Hadoop
  o

• Hadoop Streaming
  o Mapper Reducer


  o                  OK   Python
  o            IO
• Amazon AWS (S3,...
Elastic MapReduce

• Amazon          Cloud Computing
• MapReduce                    Hadoop


• Master                     ...
Step0:

• AWS
• Elastic MapReduce                                    1
• S3
  o   Ruby                     s3sync
      ht...
Step1:

• Wikipedia
    o wget "http://download.wikimedia.org/jawiki/latest/jawiki-
      latest-pages-articles.xml.bz2"
 ...
Step2:
Step2:

Mapper
 link_pat = re.compile(r"[[([^]|#]*?)[]|#]")

 for line in sys.stdin:
    for link in link_pat.findall(line...
2007     92008
2006     88376
2008     82821
2005     77964
       76111
2000     68078
2004     64921
                 63...
Step3:
Step3:

Mapper
 timestamp_pat = re.compile("<timestamp>(.+?)</timestamp>")
 articles = ArticleExtractor(sys.stdin)
 for ar...
JSON                   Wizard




$ elastic-mapreduce --create --num-instances 4
                  --instance-type m1.smal...
2002: 1
2003: 4107
2004: 19630
2005: 44766
2006: 103018
2007: 151382
2008: 217252
2009: 683079
Step4: PageRank
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
Step4: PageRank

•
    o       1        1/
    o                     /
    o
    o   2       10
PageRank                 MapReduce

• Step1
    o          Wikipedia
    o   M:


    o   R: Identity
• Step2
  o M:      ...
1803.63759701
1568.19638967
1029.67219551 2006
991.646816399 2007
930.652982148 2005
885.892964893
866.358526418 2008
798....
1803.63759701
1568.19638967
885.892964893
779.443042817
755.488775376
728.882441149
682.257070166
623.000478660
580.347125...
779.443042817
728.882441149
682.257070166
580.347125978
522.618667481
495.986145911
452.646283200
444.036370473
443.043952...
=100

0.00682557409174                785       ...
0.00682555111099 JR   700
0.00682544488688
0.00682540998664
0.00682540...
• Wikipedia (JA)
  o 1,900,000 articles
  o 4.2GB
  o 20
  o   ~30
• Blog          from   blogeye.jp
  o   200,000,000 art...
•
    o
    o                 Master
•
    o
    o
    o
    o
    o   1   1   0.1   1    100   1000
Q&A
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Upcoming SlideShare
Loading in …5
×

Quick Wikipedia Mining using Elastic Map Reduce

4,860 views
4,761 views

Published on

Published in: Technology

Quick Wikipedia Mining using Elastic Map Reduce

  1. 1. Elastic MapReduce Wikipedia
  2. 2. http://ohkura.com • 2008 1 • • blog • 2007
  3. 3. Python Wikipedia ( 120 ) MapReduce
  4. 4. • Hadoop o • Hadoop Streaming o Mapper Reducer o OK Python o IO • Amazon AWS (S3, EC2)
  5. 5. Elastic MapReduce • Amazon Cloud Computing • MapReduce Hadoop • Master Worker EC2 • S3 • http://aws.amazon.com/elasticmapreduce/
  6. 6. Step0: • AWS • Elastic MapReduce 1 • S3 o Ruby s3sync http://s3sync.net/wiki • elastic-mapreduce o Amazon Ruby o http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264
  7. 7. Step1: • Wikipedia o wget "http://download.wikimedia.org/jawiki/latest/jawiki- latest-pages-articles.xml.bz2" o bunzip2 jawiki-latest-pages-articles.xml.bz2 • o <page> 20000 o Hadoop Streaming worker • S3 o ohkura-wikipedia:jawiki/articles/part-00000, 00001, ... o EC2
  8. 8. Step2:
  9. 9. Step2: Mapper link_pat = re.compile(r"[[([^]|#]*?)[]|#]") for line in sys.stdin: for link in link_pat.findall(line): if ":" not in link: print "LongValueSum:%st1" % link Reducer aggregate (Hadoop Reducer)
  10. 10. 2007 92008 2006 88376 2008 82821 2005 77964 76111 2000 68078 2004 64921 63660 58081 2001 57419 2003 57130
  11. 11. Step3:
  12. 12. Step3: Mapper timestamp_pat = re.compile("<timestamp>(.+?)</timestamp>") articles = ArticleExtractor(sys.stdin) for article in articles: for line in article: m = timestamp_pat.search(line) if m: dt = m.groups(0)[0] # eg. 2009-10-08T05:55:49Z t = datetime.datetime.strptime(dt, "%Y-%m-%dT%H:%M:%SZ") print "LongValueSum:%s t1" % t.year Reducer aggregate (Hadoop Reducer)
  13. 13. JSON Wizard $ elastic-mapreduce --create --num-instances 4 --instance-type m1.small --json count-year-jobflow.json
  14. 14. 2002: 1 2003: 4107 2004: 19630 2005: 44766 2006: 103018 2007: 151382 2008: 217252 2009: 683079
  15. 15. Step4: PageRank
  16. 16. Step4: PageRank • o 1 1/ o / o o 2 10
  17. 17. Step4: PageRank • o 1 1/ o / o o 2 10
  18. 18. Step4: PageRank • o 1 1/ o / o o 2 10
  19. 19. Step4: PageRank • o 1 1/ o / o o 2 10
  20. 20. Step4: PageRank • o 1 1/ o / o o 2 10
  21. 21. PageRank MapReduce • Step1 o Wikipedia o M: o R: Identity • Step2 o M: / o R: • Step2 10 o HDFS
  22. 22. 1803.63759701 1568.19638967 1029.67219551 2006 991.646816399 2007 930.652982148 2005 885.892964893 866.358526418 2008 798.668799871 2004 779.443042817 . .
  23. 23. 1803.63759701 1568.19638967 885.892964893 779.443042817 755.488775376 728.882441149 682.257070166 623.000478660 580.347125978 569.411885196 ...
  24. 24. 779.443042817 728.882441149 682.257070166 580.347125978 522.618667481 495.986145911 452.646283200 444.036370473 443.043952427 441.486349135 392.427995635
  25. 25. =100 0.00682557409174 785 ... 0.00682555111099 JR 700 0.00682544488688 0.00682540998664 0.00682540375114 0.00682528989653 ( ) 0.00682524117061 0.00682521978481 ( ) 0.00682521236658 0.00682517459662 0.00682512260620
  26. 26. • Wikipedia (JA) o 1,900,000 articles o 4.2GB o 20 o ~30 • Blog from blogeye.jp o 200,000,000 articles o 800GB o 80 o 70
  27. 27. • o o Master • o o o o o 1 1 0.1 1 100 1000
  28. 28. Q&A

×