• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Quick Wikipedia Mining using Elastic Map Reduce
 

Quick Wikipedia Mining using Elastic Map Reduce

on

  • 6,362 views

 

Statistics

Views

Total Views
6,362
Views on SlideShare
6,157
Embed Views
205

Actions

Likes
9
Downloads
0
Comments
0

4 Embeds 205

http://tsukamoto.typepad.com 89
http://www.slideshare.net 76
http://a1.vox-data.com 38
http://www.slideee.com 2

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Quick Wikipedia Mining using Elastic Map Reduce Quick Wikipedia Mining using Elastic Map Reduce Presentation Transcript

  • Elastic MapReduce Wikipedia
  • http://ohkura.com • 2008 1 • • blog • 2007
  • Python Wikipedia ( 120 ) MapReduce
  • • Hadoop o • Hadoop Streaming o Mapper Reducer o OK Python o IO • Amazon AWS (S3, EC2)
  • Elastic MapReduce • Amazon Cloud Computing • MapReduce Hadoop • Master Worker EC2 • S3 • http://aws.amazon.com/elasticmapreduce/
  • Step0: • AWS • Elastic MapReduce 1 • S3 o Ruby s3sync http://s3sync.net/wiki • elastic-mapreduce o Amazon Ruby o http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264
  • Step1: • Wikipedia o wget "http://download.wikimedia.org/jawiki/latest/jawiki- latest-pages-articles.xml.bz2" o bunzip2 jawiki-latest-pages-articles.xml.bz2 • o <page> 20000 o Hadoop Streaming worker • S3 o ohkura-wikipedia:jawiki/articles/part-00000, 00001, ... o EC2
  • Step2:
  • Step2: Mapper link_pat = re.compile(r"[[([^]|#]*?)[]|#]") for line in sys.stdin: for link in link_pat.findall(line): if ":" not in link: print "LongValueSum:%st1" % link Reducer aggregate (Hadoop Reducer)
  • 2007 92008 2006 88376 2008 82821 2005 77964 76111 2000 68078 2004 64921 63660 58081 2001 57419 2003 57130
  • Step3:
  • Step3: Mapper timestamp_pat = re.compile("<timestamp>(.+?)</timestamp>") articles = ArticleExtractor(sys.stdin) for article in articles: for line in article: m = timestamp_pat.search(line) if m: dt = m.groups(0)[0] # eg. 2009-10-08T05:55:49Z t = datetime.datetime.strptime(dt, "%Y-%m-%dT%H:%M:%SZ") print "LongValueSum:%s t1" % t.year Reducer aggregate (Hadoop Reducer)
  • JSON Wizard $ elastic-mapreduce --create --num-instances 4 --instance-type m1.small --json count-year-jobflow.json
  • 2002: 1 2003: 4107 2004: 19630 2005: 44766 2006: 103018 2007: 151382 2008: 217252 2009: 683079
  • Step4: PageRank
  • Step4: PageRank • o 1 1/ o / o o 2 10
  • Step4: PageRank • o 1 1/ o / o o 2 10
  • Step4: PageRank • o 1 1/ o / o o 2 10
  • Step4: PageRank • o 1 1/ o / o o 2 10
  • Step4: PageRank • o 1 1/ o / o o 2 10
  • PageRank MapReduce • Step1 o Wikipedia o M: o R: Identity • Step2 o M: / o R: • Step2 10 o HDFS
  • 1803.63759701 1568.19638967 1029.67219551 2006 991.646816399 2007 930.652982148 2005 885.892964893 866.358526418 2008 798.668799871 2004 779.443042817 . .
  • 1803.63759701 1568.19638967 885.892964893 779.443042817 755.488775376 728.882441149 682.257070166 623.000478660 580.347125978 569.411885196 ...
  • 779.443042817 728.882441149 682.257070166 580.347125978 522.618667481 495.986145911 452.646283200 444.036370473 443.043952427 441.486349135 392.427995635
  • =100 0.00682557409174 785 ... 0.00682555111099 JR 700 0.00682544488688 0.00682540998664 0.00682540375114 0.00682528989653 ( ) 0.00682524117061 0.00682521978481 ( ) 0.00682521236658 0.00682517459662 0.00682512260620
  • • Wikipedia (JA) o 1,900,000 articles o 4.2GB o 20 o ~30 • Blog from blogeye.jp o 200,000,000 articles o 800GB o 80 o 70
  • • o o Master • o o o o o 1 1 0.1 1 100 1000
  • Q&A