Discusses the challenges that a strong privacy stance poses for the Wikimedia Foundation, including how it affects data collection, aggregation, and preservation practices. Presentation details some creative workarounds that allow WMF to calculate metrics in a privacy-conscious way.
12. By Lane Hartwell - Photographed by Lane Hartwell (http://fetching.net/) on behalf of the Wikimedia Foundation, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=8927361
Free as in “free” access
18. Should not have to provide any
information to participate in free
knowledge movement.
There cannot be access to free
knowledge without a strong
guarantee of privacy.
25. No data is sent to third parties or sold for
marketing purposes.
We only retain personal data for 90 days.
IPs and user agents (in raw format) are
deleted after 90 days.
Articles browsed by readers are retained at
most 90 days, if retained at all, then only in
aggregate form
26. Photo by Mikito Tateisi on Unsplash
The Technical Challenge of
Privacy
27. No data is sent to third parties or sold
for marketing purposes.
28. No data is sent to third parties or sold
for marketing purposes.
Commercial solutions to gather
user data are not an option.
Translation:
29. We only retain personal data and session
data for 90 days. IPs and user agents (in raw
format) are deleted after 90 days.
30. We only retain personal data and session
data for 90 days. IPs and user agents (in raw
format) are deleted after 90 days.
We need to build an ETL system in which is
easy to selectively sanitize and aggregate
data at scale.
Translation:
49. Timestamp IP Page Cookies
2018-03-01 343.4.* Bike -
2018-03-01 343.4.* Airplane Session=1011
2018-03-01 747.4.* Milan Session=1011
2018-03-01 345.2.* Rome Session=1011
50. Timestamp IP Page Cookies
2018-03-01 USA Bike -
2018-03-01 USA Airplane Session=1011
2018-03-01 Italy Milan Session=1011
2018-03-01 Italy Rome Session=1011
61. Timestamp IP Page Cookies
2018-03-01 123.4.* Bike -
2018-03-15 123.4.* Airplane Last-Access=2018-03-01
2018-03-15 776.9.* Milan Last-Access=2018-03-15
62. March 15th, More visits...
Timestamp IP Page Cookies
2018-03-15 113.4.* Tacos Last-Access=2018-03-15
2018-03-15 113.4.* Sushi Last-Access=2018-01-01
63. Timestamp IP Page Cookies
2018-03-01 123.4.* Bike -
2018-03-15 123.4.* Airplane Last-Access=2018-03-01
2018-03-15 776.9.* Milan Last-Access=2018-03-01
64. 2018-03-15 223.2.* Sushi Last-Access=2018-01-01
Timestamp IP Page Cookies
2018-03-01 123.4.* Bike -
2018-03-15 123.4.* Airplane Last-Access=2018-03-01
2018-03-15 223.2.* Tacos Last-Access=2018-03-15
2018-03-15 776.9.* Milan Last-Access=2018-03-01
65. 2018-03-15 223.2.* Sushi Last-Access=2018-01-01
Timestamp IP Page Cookies
2018-03-15 133.1.* Ronaldo Last-Access=2018-02-25
2018-03-01 123.4.* Bike -
2018-03-15 123.4.* Airplane Last-Access=2018-03-01
2018-03-15 223.2.* Tacos Last-Access=2018-03-15
2018-03-15 776.9.* Milan Last-Access=2018-03-15
66. 2018-03-15 223.2.* Sushi Last-Access=2018-01-01
Timestamp IP Page Cookies
2018-03-15 133.1.* Ronaldo Last-Access=2018-02-25
2018-03-01 123.4.* Bike -
2018-03-15 123.4.* Airplane Last-Access=2018-03-01
2018-03-15 223.2.* Tacos Last-Access=2018-03-15
2018-03-15 776.9.* Milan Last-Access=2018-03-15
How many uniques?
67. 2018-03-15 223.2.* Sushi Last-Access=2018-01-01
2018-03-15 776.9.* Milan Last-Access=2018-03-15
Timestamp IP Page Cookies
2018-03-15 133.1.* Ronaldo Last-Access=2018-02-25
2018-03-01 123.4.* Bike -
2018-03-15 123.4.* Airplane Last-Access=2018-03-01
2018-03-15 223.2.* Tacos Last-Access=2018-03-15
How many uniques?
68. 2018-03-15 223.2.* Sushi Last-Access=2018-01-01
2018-03-15 776.9.* Milan Last-Access=2018-03-15
Timestamp IP Page Cookies
2018-03-15 133.1.* Ronaldo Last-Access=2018-02-25
2018-03-01 123.4.* Bike -
2018-03-15 123.4.* Airplane Last-Access=2018-03-01
2018-03-15 223.2.* Tacos Last-Access=2018-03-15
How many uniques?
69. 2018-03-15 223.2.* Sushi Last-Access=2018-01-01
Timestamp IP Page Cookies
2018-03-15 133.1.* Ronaldo Last-Access=2018-02-25
2018-03-01 123.4.* Bike -
2018-03-15 123.4.* Airplane Last-Access=2018-03-01
2018-03-15 223.2.* Tacos Last-Access=2018-03-15
2018-03-15 776.9.* Milan Last-Access=2018-03-15
Lessens Identifiability
70. 2018-03-15 223.2.* Sushi Last-Access=2018-01-01
Timestamp IP Page Cookies
2018-03-15 133.1.* Ronaldo Last-Access=2018-02-25
2018-03-01 123.4.* Bike -
2018-03-15 123.4.* Airplane Last-Access=2018-03-01
2018-03-15 223.2.* Tacos Last-Access=2018-03-15
2018-03-15 776.9.* Milan Last-Access=2018-03-15
Not able to reconstruct all sessions
71. 2018-03-15 223.2.* Sushi Last-Access=2018-01-01
Timestamp UA + IP? Page Cookies
2018-03-15 133.1.* Ronaldo Last-Access=2018-02-25
2018-03-01 123.4.* Bike -
2018-03-15 123.4.* Airplane Last-Access=2018-03-01
2018-03-15 223.2.* Tacos Last-Access=2018-03-15
2018-03-15 776.9.* Milan Last-Access=2018-03-15
Not able to reconstruct all sessions
72. Caveats of last access solution are
pretty much the same as token
solution but lessens identifiability
significantly
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution