Measuring the New Wikipedia Community (PyData SV 2013)


Published on

Talk given by Ryan Faulkner at PyData Silicon Valley 2013

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Measuring the New Wikipedia Community (PyData SV 2013)

  1. 1. Measuring the NewWikipedia CommunityPyData 2013Ryan Faulkner ( Foundation
  2. 2. OverviewIntroductionProblem & MotivationProposed SolutionUser MetricsA Short ExampleExtending the SolutionUsing the ToolLive Demo!!
  3. 3. IntroductionMe: Data Analyst at WikimediaMachine Learning @ McGillFundraising - A/B testingEditor Experiments - increasing the number ofActive editorsEditor Engagement Experiments (E3) team @ theWikimedia FoundationMicro-feature experimentation
  4. 4. ProblemWhats wrong with Wikipedia?
  5. 5. Problem - Editor Decline
  6. 6. Problem - ApproachCan we stimulate the community of users to become morenumerous and productive?○ Focus on new users■ Encourage contribution, make it easier○ Lower the threshold for account creation■ Bring more people in.○ Rapid experimentation on features that retain moreusers and stimulate increased participation.■ This will help us determine what works with lesscost
  7. 7. Problem - Evaluation○ Data Consistency■ Anomaly Detection■ Auto-correlation (seasonality)○ "A/B" testing■ Hypothesis testing - students t, chi-square■ Linear / Logistic regression○ Multivariate testing■ Analysis of variance
  8. 8. Problem - What we needCurrently a lot of the work around analysis is donemanually and is a large drain on resources:○ Faster Data gathering○ Knowing what were logging and measuring &faster ETL○ Faster Analysis○ Broadening Service and iterating on results
  9. 9. Problem - What we needBuild better infrastructure around how we interpret andanalyze our data.○ Determine what to measure.■ Rigorously define relevant metrics○ Expose the metrics from our data store■ Python is great for writing code quickly to handletasks with data■ Library support for data analysis (pandas,numpy)
  10. 10. SolutionThe tools to build.
  11. 11. Solution - ProposedWe need to measure User Behaviour"User Metrics" & "UMAPI"User Metrics & UMAPIPython implementation for gathering data from MediaWiki data stores,producing well defined metrics, and facilitating subsequent modelling andanalysis. This includes a way to provide an interface for making different typesof requests and returning standard responses.
  12. 12. Solution - Why BotherWhat exactly do we gain by building theseclasses? Why not just query the database?1. Reproducibility & Standardization2. Extensibility3. Concise definition4. Increase turn arounda. Multiprocessing to optimize metrics generation(e.g. Revert rate on 100K usersvia MySQL = 24hrs,via User Metrics < 10mins)
  13. 13. Solution - Why Python?Why not C++, Java, or PHP?1. Speed of development2. Simplify the code base & easy extensibilitya. more "Scientist Friendly"3. Good support for data processing4. Better integration for downstream data analysis5. The way that metrics work lends them to "Pythonic"artifacts. List comprehension, decorator patterns, duck-typing, RESTful API.
  14. 14. User MetricsHow do we form a picture about what happenson Wikipedia?
  15. 15. User Metrics - User activityEvents (not exhaustive):■ Registration■ Making an edit■ Contributions of Namespaces■ Reverting edits■ Blocking
  16. 16. User Metrics - What do we want toknow about users?○ How much do they contribute?○ How often do they contribute?○ Potential vandals. Do they go on to be reverted,blocked, banned?
  17. 17. User Metrics - Metrics Definitions MetricsSurvival(t) Boolean measure of an editor surviving beyond tThreshold(t,n) Boolean measure of an editor reaching activity threshold n by time tLive Account(t) Boolean measure of whether the new user click the edit button?Volume MetricsEdit Rate Float result of users rate of contribution.Content Integer bytes added by revision and edit count.Sessions Average session length (future)Time to Threshold Time to reach a threshold (e.g. first edit)
  18. 18. User Metrics - Metrics DefinitionsContent QualityRevert Rate Float representing the proportion of revisions reverted.Block Boolean indicating a block event on the user.Content Persistence Integer indicating how long this users edits survive (future)Contribution TypeNamespace of Edits Integer edit counts in all namespaces.Scale of Change Float representation of fraction of total page content modified (future)
  19. 19. User Metrics - Bytes Addeduserrevisionhistory(over a predifinedperiod)Revision k:byte increase(user ID, bytes_added, bytes_removed, edit count)
  20. 20. User Metrics - Thresholduserrevisionhistory(over a predefinedperiod)(user ID, threshold_reached={0,1})registrationEvents sinceregistration upto time "t"if len(event_list) >= n:threshold_reached = Trueelse:threshold_reached = False
  21. 21. User Metrics - Revert Rateuserrevisionhistory(over a predefinedperiod)for eachrevision lookat pagehistoryFuture RevisionsPast Revisionschecksum kchecksum iif checksum i == checksum k:# reverted!(user ID, revert_rate, total_revisions)
  22. 22. User Metrics - Implementation MySQL & Redis (future) data storea. All of the backend dependency is abstracted out ofmetrics classes2. Python implementation - MySQLdb (SQLalchemy)3. Strategy Pattern of Parent user metrics class4. Metrics built mainly from four core MediaWiki tables:a. revision, user, page, logging5. Python Decorator methods for handling metricaggregation
  23. 23. User Metrics
  24. 24. A Concrete ExampleHow can we use thisframework?
  25. 25. Example - Post Edit FeedbackWhat effect does editing feedback (confirmation/gratitude)have on new editors?
  26. 26. Example - Results
  27. 27. An Extended SolutionTurn the data machine into a service.
  28. 28. Editor Metrics go beyond featureexperimentation ...It became clear that...● We needed a service to let clients generate their ownuser metrics data sets● We wanted to add a way for this methodology toextend beyond E3 and potentially WMF● A force multiplier was necessary to iterate on editordata in more interesting ways (Machine Learning &more sophisticated analyses)
  29. 29. User Metrics API [UMAPI]Open Source (almost) RESTful API (Flask)Computes metrics per user (User Metrics)Combines metrics in different ways depending onrequest typesHTTP response in JSON with resulting dataStore data internally for reuse
  30. 30. UMAPI
  31. 31. UMAPI - OverviewService GET requests based on a combination of URLpaths + query paramse.g. /cohort/metric?date_start=..&date_end=...&...Define user "cohorts" on which to operateAPI engine maps to metrics request object (MediatorPattern) which is handed off to a request manager whichbuilds and runs requestJSON response
  32. 32. UMAPI - OverviewBasic cPickle file cache for responsesCan substitute caching system (e.g. memcached)Reusing request data where it overlapsRequest Types:"Raw" - metrics per userAggregation over cohorts: mean, sum, median, etc.Time series requests
  33. 33. UMAPI ArchitectureHTTP GET requestJSON responseApacheFlask / AppServermod_wsgiRequestNotificationsListenerRequestControlResponseControl CacheMediaWikiSlavesUserMetricsAPIMessaging QueuesMetrics objects -SeparateProcessesAsynchronous Callbacks
  34. 34. UMAPI Architecture - ListenersRequest Notifications CallbackHandles managing and notifications on job statusRequest ControllerQueues requestsSpawns jobs from metrics objectsCoordinates parametersResponse ControllerReconstruct response dataWrite to cache
  35. 35. We will want to consider large groups of users, for instance,a test or control group in some experiment:Aggregate groups of userslists of user IDsCohort registration (under construction)adding new cohorts to the modelSingle user endpointBoolean expressions over cohorts supportedUMAPI - User Cohorts
  36. 36. User Metric PeriodsHow do we define the periods over which metrics aremeasured?RegistrationLook "t" hours since user registrationUser DefinedUser supplied start and end datesConditional RegistrationRegistration as above with condition that registration falls within input
  37. 37. UMAPI - RequestMeta ModuleMediator Pattern to handle passing request data amongdifferent portions of the architectureAbstraction allows for easy filtering and default behaviourof request parametersRequests can easily be turned into reproducible and uniquehashes for caching
  38. 38. How the Service WorksThe user experience with user metrics.
  39. 39. UMAPI - PipelineCohortorcomboRaw ParamsTimeSeriesAggregatorAggregator ParamsParams JSONJSONJSON
  40. 40. UMAPI - Frontend Flow
  41. 41. Job QueueAs you fire off requests the queue tracks whats running:
  42. 42. Response - Bytes Added
  43. 43. Response - Threshold
  44. 44. Response - Edit Rate
  45. 45. Response - Threshold w/ params
  46. 46. Response - Aggregation
  47. 47. Response - Aggregation
  48. 48. Response - Time series
  49. 49. Response - Combining Cohorts"usertags_meta" - cohort definitions
  50. 50. Response - Combining CohortsTwo intersecting cohorts:
  51. 51. Response - Combining CohortsAND (&)
  52. 52. Response - Combining CohortsOR (~)
  53. 53. Response - Single user endpointe.g.
  54. 54. Looking ahead ...Connectivity metrics (additional metrics)○ Graph database? (Neo4j, gremlin w/ postgreSQL)○ User talk and common article editsBetter in-memory modelling○ python-memcached○ better reuse of generated data based on request dataBeyond English WikipediaImplemented!
  55. 55. Looking ahead ...More sophisticated and robust data modelling○ Modelling richer data: contribution histories, articlesedited, aggregate metrics○ Classification: Logistic classifiers, Support VectorMachine, Deep Belief Networks, DimensionalityReduction○ Modelling revision text - Neural Networks, HiddenMarkov Models
  56. 56. DEMO!!
  57. 57. The End