Your SlideShare is downloading. ×
0
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Quick Wikipedia Mining using Elastic Map Reduce
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Quick Wikipedia Mining using Elastic Map Reduce

4,658

Published on

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,658
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. Elastic MapReduce Wikipedia
    • 2. http://ohkura.com • 2008 1 • • blog • 2007
    • 3. Python Wikipedia ( 120 ) MapReduce
    • 4. • Hadoop o • Hadoop Streaming o Mapper Reducer o OK Python o IO • Amazon AWS (S3, EC2)
    • 5. Elastic MapReduce • Amazon Cloud Computing • MapReduce Hadoop • Master Worker EC2 • S3 • http://aws.amazon.com/elasticmapreduce/
    • 6. Step0: • AWS • Elastic MapReduce 1 • S3 o Ruby s3sync http://s3sync.net/wiki • elastic-mapreduce o Amazon Ruby o http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264
    • 7. Step1: • Wikipedia o wget "http://download.wikimedia.org/jawiki/latest/jawiki- latest-pages-articles.xml.bz2" o bunzip2 jawiki-latest-pages-articles.xml.bz2 • o <page> 20000 o Hadoop Streaming worker • S3 o ohkura-wikipedia:jawiki/articles/part-00000, 00001, ... o EC2
    • 8. Step2:
    • 9. Step2: Mapper link_pat = re.compile(r"[[([^]|#]*?)[]|#]") for line in sys.stdin: for link in link_pat.findall(line): if ":" not in link: print "LongValueSum:%st1" % link Reducer aggregate (Hadoop Reducer)
    • 10. 2007 92008 2006 88376 2008 82821 2005 77964 76111 2000 68078 2004 64921 63660 58081 2001 57419 2003 57130
    • 11. Step3:
    • 12. Step3: Mapper timestamp_pat = re.compile("<timestamp>(.+?)</timestamp>") articles = ArticleExtractor(sys.stdin) for article in articles: for line in article: m = timestamp_pat.search(line) if m: dt = m.groups(0)[0] # eg. 2009-10-08T05:55:49Z t = datetime.datetime.strptime(dt, "%Y-%m-%dT%H:%M:%SZ") print "LongValueSum:%s t1" % t.year Reducer aggregate (Hadoop Reducer)
    • 13. JSON Wizard $ elastic-mapreduce --create --num-instances 4 --instance-type m1.small --json count-year-jobflow.json
    • 14. 2002: 1 2003: 4107 2004: 19630 2005: 44766 2006: 103018 2007: 151382 2008: 217252 2009: 683079
    • 15. Step4: PageRank
    • 16. Step4: PageRank • o 1 1/ o / o o 2 10
    • 17. Step4: PageRank • o 1 1/ o / o o 2 10
    • 18. Step4: PageRank • o 1 1/ o / o o 2 10
    • 19. Step4: PageRank • o 1 1/ o / o o 2 10
    • 20. Step4: PageRank • o 1 1/ o / o o 2 10
    • 21. PageRank MapReduce • Step1 o Wikipedia o M: o R: Identity • Step2 o M: / o R: • Step2 10 o HDFS
    • 22. 1803.63759701 1568.19638967 1029.67219551 2006 991.646816399 2007 930.652982148 2005 885.892964893 866.358526418 2008 798.668799871 2004 779.443042817 . .
    • 23. 1803.63759701 1568.19638967 885.892964893 779.443042817 755.488775376 728.882441149 682.257070166 623.000478660 580.347125978 569.411885196 ...
    • 24. 779.443042817 728.882441149 682.257070166 580.347125978 522.618667481 495.986145911 452.646283200 444.036370473 443.043952427 441.486349135 392.427995635
    • 25. =100 0.00682557409174 785 ... 0.00682555111099 JR 700 0.00682544488688 0.00682540998664 0.00682540375114 0.00682528989653 ( ) 0.00682524117061 0.00682521978481 ( ) 0.00682521236658 0.00682517459662 0.00682512260620
    • 26. • Wikipedia (JA) o 1,900,000 articles o 4.2GB o 20 o ~30 • Blog from blogeye.jp o 200,000,000 articles o 800GB o 80 o 70
    • 27. • o o Master • o o o o o 1 1 0.1 1 100 1000
    • 28. Q&A

    ×