Measuring the New Wikipedia Community (PyData SV 2013)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Measuring the New Wikipedia Community (PyData SV 2013)

on

  • 513 views

Talk given by Ryan Faulkner at PyData Silicon Valley 2013

Talk given by Ryan Faulkner at PyData Silicon Valley 2013

Statistics

Views

Total Views
513
Views on SlideShare
513
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Measuring the New Wikipedia Community (PyData SV 2013) Presentation Transcript

  • 1. Measuring the NewWikipedia CommunityPyData 2013Ryan Faulkner (rfaulkner@wikimedia.org)Wikimedia Foundation
  • 2. OverviewIntroductionProblem & MotivationProposed SolutionUser MetricsA Short ExampleExtending the SolutionUsing the ToolLive Demo!!
  • 3. IntroductionMe: Data Analyst at WikimediaMachine Learning @ McGillFundraising - A/B testingEditor Experiments - increasing the number ofActive editorsEditor Engagement Experiments (E3) team @ theWikimedia FoundationMicro-feature experimentation
  • 4. ProblemWhats wrong with Wikipedia?
  • 5. Problem - Editor Declinehttp://strategy.wikimedia.org/wiki/Editor_Trends_Study
  • 6. Problem - ApproachCan we stimulate the community of users to become morenumerous and productive?○ Focus on new users■ Encourage contribution, make it easier○ Lower the threshold for account creation■ Bring more people in.○ Rapid experimentation on features that retain moreusers and stimulate increased participation.■ This will help us determine what works with lesscost
  • 7. Problem - Evaluation○ Data Consistency■ Anomaly Detection■ Auto-correlation (seasonality)○ "A/B" testing■ Hypothesis testing - students t, chi-square■ Linear / Logistic regression○ Multivariate testing■ Analysis of variance
  • 8. Problem - What we needCurrently a lot of the work around analysis is donemanually and is a large drain on resources:○ Faster Data gathering○ Knowing what were logging and measuring &faster ETL○ Faster Analysis○ Broadening Service and iterating on results
  • 9. Problem - What we needBuild better infrastructure around how we interpret andanalyze our data.○ Determine what to measure.■ Rigorously define relevant metrics○ Expose the metrics from our data store■ Python is great for writing code quickly to handletasks with data■ Library support for data analysis (pandas,numpy)
  • 10. SolutionThe tools to build.
  • 11. Solution - ProposedWe need to measure User Behaviour"User Metrics" & "UMAPI"User Metrics & UMAPIPython implementation for gathering data from MediaWiki data stores,producing well defined metrics, and facilitating subsequent modelling andanalysis. This includes a way to provide an interface for making different typesof requests and returning standard responses.
  • 12. Solution - Why BotherWhat exactly do we gain by building theseclasses? Why not just query the database?1. Reproducibility & Standardization2. Extensibility3. Concise definition4. Increase turn arounda. Multiprocessing to optimize metrics generation(e.g. Revert rate on 100K usersvia MySQL = 24hrs,via User Metrics < 10mins)
  • 13. Solution - Why Python?Why not C++, Java, or PHP?1. Speed of development2. Simplify the code base & easy extensibilitya. more "Scientist Friendly"3. Good support for data processing4. Better integration for downstream data analysis5. The way that metrics work lends them to "Pythonic"artifacts. List comprehension, decorator patterns, duck-typing, RESTful API.
  • 14. User MetricsHow do we form a picture about what happenson Wikipedia?
  • 15. User Metrics - User activityEvents (not exhaustive):■ Registration■ Making an edit■ Contributions of Namespaces■ Reverting edits■ Blocking
  • 16. User Metrics - What do we want toknow about users?○ How much do they contribute?○ How often do they contribute?○ Potential vandals. Do they go on to be reverted,blocked, banned?
  • 17. User Metrics - Metrics Definitionshttps://meta.wikimedia.org/wiki/Research:MetricsRetention MetricsSurvival(t) Boolean measure of an editor surviving beyond tThreshold(t,n) Boolean measure of an editor reaching activity threshold n by time tLive Account(t) Boolean measure of whether the new user click the edit button?Volume MetricsEdit Rate Float result of users rate of contribution.Content Integer bytes added by revision and edit count.Sessions Average session length (future)Time to Threshold Time to reach a threshold (e.g. first edit)
  • 18. User Metrics - Metrics DefinitionsContent QualityRevert Rate Float representing the proportion of revisions reverted.Block Boolean indicating a block event on the user.Content Persistence Integer indicating how long this users edits survive (future)Contribution TypeNamespace of Edits Integer edit counts in all namespaces.Scale of Change Float representation of fraction of total page content modified (future)
  • 19. User Metrics - Bytes Addeduserrevisionhistory(over a predifinedperiod)Revision k:byte increase(user ID, bytes_added, bytes_removed, edit count)
  • 20. User Metrics - Thresholduserrevisionhistory(over a predefinedperiod)(user ID, threshold_reached={0,1})registrationEvents sinceregistration upto time "t"if len(event_list) >= n:threshold_reached = Trueelse:threshold_reached = False
  • 21. User Metrics - Revert Rateuserrevisionhistory(over a predefinedperiod)for eachrevision lookat pagehistoryFuture RevisionsPast Revisionschecksum kchecksum iif checksum i == checksum k:# reverted!(user ID, revert_rate, total_revisions)
  • 22. User Metrics - Implementationhttps://github.com/wikimedia/user_metrics1. MySQL & Redis (future) data storea. All of the backend dependency is abstracted out ofmetrics classes2. Python implementation - MySQLdb (SQLalchemy)3. Strategy Pattern of Parent user metrics class4. Metrics built mainly from four core MediaWiki tables:a. revision, user, page, logging5. Python Decorator methods for handling metricaggregation
  • 23. User Metrics
  • 24. A Concrete ExampleHow can we use thisframework?
  • 25. Example - Post Edit FeedbackWhat effect does editing feedback (confirmation/gratitude)have on new editors?
  • 26. Example - Results
  • 27. An Extended SolutionTurn the data machine into a service.
  • 28. Editor Metrics go beyond featureexperimentation ...It became clear that...● We needed a service to let clients generate their ownuser metrics data sets● We wanted to add a way for this methodology toextend beyond E3 and potentially WMF● A force multiplier was necessary to iterate on editordata in more interesting ways (Machine Learning &more sophisticated analyses)
  • 29. User Metrics API [UMAPI]Open Source (almost) RESTful API (Flask)Computes metrics per user (User Metrics)Combines metrics in different ways depending onrequest typesHTTP response in JSON with resulting dataStore data internally for reuse
  • 30. UMAPIhttp://metrics.wikimedia.org/https://github.com/wikimedia/user_metricshttps://github.com/rfaulkner/E3_analysishttps://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev
  • 31. UMAPI - OverviewService GET requests based on a combination of URLpaths + query paramse.g. /cohort/metric?date_start=..&date_end=...&...Define user "cohorts" on which to operateAPI engine maps to metrics request object (MediatorPattern) which is handed off to a request manager whichbuilds and runs requestJSON response
  • 32. UMAPI - OverviewBasic cPickle file cache for responsesCan substitute caching system (e.g. memcached)Reusing request data where it overlapsRequest Types:"Raw" - metrics per userAggregation over cohorts: mean, sum, median, etc.Time series requests
  • 33. UMAPI ArchitectureHTTP GET requestJSON responseApacheFlask / AppServermod_wsgiRequestNotificationsListenerRequestControlResponseControl CacheMediaWikiSlavesUserMetricsAPIMessaging QueuesMetrics objects -SeparateProcessesAsynchronous Callbacks
  • 34. UMAPI Architecture - ListenersRequest Notifications CallbackHandles managing and notifications on job statusRequest ControllerQueues requestsSpawns jobs from metrics objectsCoordinates parametersResponse ControllerReconstruct response dataWrite to cache
  • 35. We will want to consider large groups of users, for instance,a test or control group in some experiment:Aggregate groups of userslists of user IDsCohort registration (under construction)adding new cohorts to the modelSingle user endpointBoolean expressions over cohorts supportedUMAPI - User Cohorts
  • 36. User Metric PeriodsHow do we define the periods over which metrics aremeasured?RegistrationLook "t" hours since user registrationUser DefinedUser supplied start and end datesConditional RegistrationRegistration as above with condition that registration falls within input
  • 37. UMAPI - RequestMeta ModuleMediator Pattern to handle passing request data amongdifferent portions of the architectureAbstraction allows for easy filtering and default behaviourof request parametersRequests can easily be turned into reproducible and uniquehashes for caching
  • 38. How the Service WorksThe user experience with user metrics.
  • 39. UMAPI - PipelineCohortorcomboRaw ParamsTimeSeriesAggregatorAggregator ParamsParams JSONJSONJSON
  • 40. UMAPI - Frontend Flow
  • 41. Job QueueAs you fire off requests the queue tracks whats running:
  • 42. Response - Bytes Added
  • 43. Response - Threshold
  • 44. Response - Edit Rate
  • 45. Response - Threshold w/ params
  • 46. Response - Aggregation
  • 47. Response - Aggregation
  • 48. Response - Time series
  • 49. Response - Combining Cohorts"usertags_meta" - cohort definitions
  • 50. Response - Combining CohortsTwo intersecting cohorts:
  • 51. Response - Combining CohortsAND (&)
  • 52. Response - Combining CohortsOR (~)
  • 53. Response - Single user endpointe.g.http://metrics-api.wikimedia.org/user/Renklauf/threshold?t=10000
  • 54. Looking ahead ...Connectivity metrics (additional metrics)○ Graph database? (Neo4j, gremlin w/ postgreSQL)○ User talk and common article editsBetter in-memory modelling○ python-memcached○ better reuse of generated data based on request dataBeyond English WikipediaImplemented!
  • 55. Looking ahead ...More sophisticated and robust data modelling○ Modelling richer data: contribution histories, articlesedited, aggregate metrics○ Classification: Logistic classifiers, Support VectorMachine, Deep Belief Networks, DimensionalityReduction○ Modelling revision text - Neural Networks, HiddenMarkov Models
  • 56. DEMO!!http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/thresholdhttp://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/threshold?aggregator=averagehttp://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_ratehttp://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_rate?aggregator=disthttp://metrics.wikimedia.org/cohorts/ryan_test_2/bytes_added?time_series&start=20120101&end=20130101&aggregator=sum&group=input&interval=720
  • 57. The Endhttp://metrics.wikimedia.org/stat1.wikimedia.org:4000https://github.com/wikimedia/user_metricshttps://github.com/rfaulkner/E3_analysishttps://pypi.python.org/pypi/wmf_user_metrics/0.1.3-devQuestions?