Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Common failures to be avoided
 when we analyze Wikipedia
        public data


           Felipe Ortega
           GsyC/Li...
#1 Beware of special types of pages
●   Official page count includes articles with just one link.
    ●   You must conside...
#2 Plan your hardware carefully
●   There are some general rules.
    ●   Parallelize as much as possible.
    ●   Buy mor...
3# Know your engines (DBs)
●   Correct configuration of DB engine is crucial.
    ●   You'll always fall short with standa...
4# Organize your code
●   Using a SCM is a must.
    ●   SVN, GIT.
●   Upload your code to public repository.
    ●   Berl...
#5 Use the right “spell”
●   Target data is well defined:
    ●   XML
    ●   Big portions of plain text
    ●   Inter-wik...
#6 Avoid reinventing the wheel
●   Consider to develop only if:
    ●   No available solution fits your needs.
        – O...
#7 Automate everything

●   Huge data repositories.
●   Even small samples are excessively time
    consuming if processed...
#8 Extreme case of Murphy's Law

●   Always expect the worst possible case.
    ●   Many caveats in each implementation.
 ...
#9 Not many graphical interfaces

●   Some good reasons for that
    ●  Difficult to automate
     ● Hard to display dynam...
#10 Communication channels

●   Wikimedia-research-l
    ●   Mailing list about research on Wikimedia
        projects.
  ...
Upcoming SlideShare
Loading in …5
×

Caveats

833 views

Published on

Published in: Education, Technology
  • Slides from http://wikimania2010.wikimedia.org/wiki/Submissions/Mining_Wikipedia_public_data

    Great talk!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Caveats

  1. 1. Common failures to be avoided when we analyze Wikipedia public data Felipe Ortega GsyC/Libresoft Wikimania 2010, Gdańsk, Poland.
  2. 2. #1 Beware of special types of pages ● Official page count includes articles with just one link. ● You must consider if you need to filter out disambiguation pages. ● Pay attention to redirects. ● Sometimes people wonder how the number of pages in main namespace in the dump is so high. ● Break down evolution trends by namespace. ● Articles are very different from other pages. ● Explore % of already existing talk pages. ● Connections from user pages.
  3. 3. #2 Plan your hardware carefully ● There are some general rules. ● Parallelize as much as possible. ● Buy more memory before buying more disk... ● But take a look at your disk requirements. – It's very different when you can work on decompressed data, on the fly. ● Hardware RAID is not always the best solution. – RAID 10 in Linux can perform decently in many average studies.
  4. 4. 3# Know your engines (DBs) ● Correct configuration of DB engine is crucial. ● You'll always fall short with standard configs. ● Fine tune parameters according to your hardware. ● Exploit memory as much as possible. – E.g. MEMORY engine in MySQL. ● Avoid unnecessary backup... ● But be sure that you have copies of relevant info elseware! ● Think about your process: – Read only vs. read-write.
  5. 5. 4# Organize your code ● Using a SCM is a must. ● SVN, GIT. ● Upload your code to public repository. ● BerliOS, SourceForge, GitHub... ● Document your code... ● ...if you ever aspire to get interest from other developers. ● Use consistent version numbers. ● Test, test, test... ● Include sanity checks and “toy tests”.
  6. 6. #5 Use the right “spell” ● Target data is well defined: ● XML ● Big portions of plain text ● Inter-wiki links and outlinks. ● Some alternatives ● CelementTree (high-speed parsing) ● Python (modules/short scripts) or Java (big projects). ● Perl (regexps). ● Sed & awk
  7. 7. #6 Avoid reinventing the wheel ● Consider to develop only if: ● No available solution fits your needs. – Or you can only find proprietary/evaluation sofwtare. ● Performance of other solutions is really bad ● Example: pywikipediabot ● Simple library to query Wikipedia API. ● Solves many simple needs of researchers/programmers.
  8. 8. #7 Automate everything ● Huge data repositories. ● Even small samples are excessively time consuming if processed by hand. ● You will start to concat individual processes. ● You will save time for later executions. ● Your study will be reproducible. ● Updating results after several months becomes no-brainer solution.
  9. 9. #8 Extreme case of Murphy's Law ● Always expect the worst possible case. ● Many caveats in each implementation. ● Countless particular cases. ● It's not OK with just the “average solution”. – Standard algorithms may take much more than expected to finish the job.
  10. 10. #9 Not many graphical interfaces ● Some good reasons for that ● Difficult to automate ● Hard to display dynamic results in real-time. ● Almost impossible to compute all results in a reasonable time frame for huge data collections (e.g. English Wikipedia). ● To the best of my knowledge, there are very few tools with graphical interfaces out there. ● Is there a real need for that??
  11. 11. #10 Communication channels ● Wikimedia-research-l ● Mailing list about research on Wikimedia projects. ● http://meta.wikimedia.org/wiki/Research ● http://meta.wikimedia.org/wiki/Wikimedia_Research_Network ● http://acawiki.org/Home ● Final comments ● Need for consolidated info point, once for all

×