Common failures to be avoided
 when we analyze Wikipedia
        public data


           Felipe Ortega
           GsyC/Li...
#1 Beware of special types of pages
●   Official page count includes articles with just one link.
    ●   You must conside...
#2 Plan your hardware carefully
●   There are some general rules.
    ●   Parallelize as much as possible.
    ●   Buy mor...
3# Know your engines (DBs)
●   Correct configuration of DB engine is crucial.
    ●   You'll always fall short with standa...
4# Organize your code
●   Using a SCM is a must.
    ●   SVN, GIT.
●   Upload your code to public repository.
    ●   Berl...
#5 Use the right “spell”
●   Target data is well defined:
    ●   XML
    ●   Big portions of plain text
    ●   Inter-wik...
#6 Avoid reinventing the wheel
●   Consider to develop only if:
    ●   No available solution fits your needs.
        – O...
#7 Automate everything

●   Huge data repositories.
●   Even small samples are excessively time
    consuming if processed...
#8 Extreme case of Murphy's Law

●   Always expect the worst possible case.
    ●   Many caveats in each implementation.
 ...
#9 Not many graphical interfaces

●   Some good reasons for that
    ●  Difficult to automate
     ● Hard to display dynam...
#10 Communication channels

●   Wikimedia-research-l
    ●   Mailing list about research on Wikimedia
        projects.
  ...
Upcoming SlideShare
Loading in...5
×

Caveats

622

Published on

Published in: Education, Technology
1 Comment
3 Likes
Statistics
Notes
  • Slides from http://wikimania2010.wikimedia.org/wiki/Submissions/Mining_Wikipedia_public_data

    Great talk!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
622
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Caveats

  1. 1. Common failures to be avoided when we analyze Wikipedia public data Felipe Ortega GsyC/Libresoft Wikimania 2010, Gdańsk, Poland.
  2. 2. #1 Beware of special types of pages ● Official page count includes articles with just one link. ● You must consider if you need to filter out disambiguation pages. ● Pay attention to redirects. ● Sometimes people wonder how the number of pages in main namespace in the dump is so high. ● Break down evolution trends by namespace. ● Articles are very different from other pages. ● Explore % of already existing talk pages. ● Connections from user pages.
  3. 3. #2 Plan your hardware carefully ● There are some general rules. ● Parallelize as much as possible. ● Buy more memory before buying more disk... ● But take a look at your disk requirements. – It's very different when you can work on decompressed data, on the fly. ● Hardware RAID is not always the best solution. – RAID 10 in Linux can perform decently in many average studies.
  4. 4. 3# Know your engines (DBs) ● Correct configuration of DB engine is crucial. ● You'll always fall short with standard configs. ● Fine tune parameters according to your hardware. ● Exploit memory as much as possible. – E.g. MEMORY engine in MySQL. ● Avoid unnecessary backup... ● But be sure that you have copies of relevant info elseware! ● Think about your process: – Read only vs. read-write.
  5. 5. 4# Organize your code ● Using a SCM is a must. ● SVN, GIT. ● Upload your code to public repository. ● BerliOS, SourceForge, GitHub... ● Document your code... ● ...if you ever aspire to get interest from other developers. ● Use consistent version numbers. ● Test, test, test... ● Include sanity checks and “toy tests”.
  6. 6. #5 Use the right “spell” ● Target data is well defined: ● XML ● Big portions of plain text ● Inter-wiki links and outlinks. ● Some alternatives ● CelementTree (high-speed parsing) ● Python (modules/short scripts) or Java (big projects). ● Perl (regexps). ● Sed & awk
  7. 7. #6 Avoid reinventing the wheel ● Consider to develop only if: ● No available solution fits your needs. – Or you can only find proprietary/evaluation sofwtare. ● Performance of other solutions is really bad ● Example: pywikipediabot ● Simple library to query Wikipedia API. ● Solves many simple needs of researchers/programmers.
  8. 8. #7 Automate everything ● Huge data repositories. ● Even small samples are excessively time consuming if processed by hand. ● You will start to concat individual processes. ● You will save time for later executions. ● Your study will be reproducible. ● Updating results after several months becomes no-brainer solution.
  9. 9. #8 Extreme case of Murphy's Law ● Always expect the worst possible case. ● Many caveats in each implementation. ● Countless particular cases. ● It's not OK with just the “average solution”. – Standard algorithms may take much more than expected to finish the job.
  10. 10. #9 Not many graphical interfaces ● Some good reasons for that ● Difficult to automate ● Hard to display dynamic results in real-time. ● Almost impossible to compute all results in a reasonable time frame for huge data collections (e.g. English Wikipedia). ● To the best of my knowledge, there are very few tools with graphical interfaces out there. ● Is there a real need for that??
  11. 11. #10 Communication channels ● Wikimedia-research-l ● Mailing list about research on Wikimedia projects. ● http://meta.wikimedia.org/wiki/Research ● http://meta.wikimedia.org/wiki/Wikimedia_Research_Network ● http://acawiki.org/Home ● Final comments ● Need for consolidated info point, once for all
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×