A brief overview about the history of the data mining in Google companies and services, focusing on Gmail.
Captured from a court case documents.
Written by Jeff Gould, you can find the origin article here:
https://medium.com/@jeffgould/the-natural-history-of-gmail-data-mining-be115d196b10
5. A court case reveals a trove of documents
about Gmail’s inner workings
In late
2010 .. Ads
depends
on emails
the most
serious
legal
challenge
Illegal data
mining
10. Gmail’s early history
✖Lunched 2004.
✖Yahoo and Microsoft’s Hotmail since
the 90s.
✖Vast amount of storage space per
user.
✖It would be free to users and earn
revenue through advertising.
12. Gmail data mining
✖The first version of ad serving in
Gmail exploited only concepts directly
extracted from message texts and did
little or no user profiling.
13. Gmail’s original patented data mining scheme(2013)
✖“internal” and “external” message
attributes that used in any combination
to extract the meaning of an email and
select the best ads to match it.
14. Gmail’s original patented data mining scheme(2013)
Internal Email Information:
✖Info. from a subject line.
✖Info. from body text.
✖A sender name and/or email address.
✖One or more recipient name and/or
email address.
✖Recipient type (e.g., direct recipient,
cc, bcc).
✖Text extracted from an email address.
✖Embedded information (e.g., business
card file, an image).
15. Gmail’s original patented data mining scheme(2013) (continue)
Internal Email Information:
✖Linked Info. (e.g., info. from a web
page linked to from the email).
✖Attached info. (e.g., Word processor
files, images, spreadsheets, etc).
16. Gmail’s original patented data mining scheme(2013) (continue)
External Email Information:
✖Info extracted or derived from search
results returned in response to a search
query composed of extracted email info.
✖Info about the sender for example
derived from previous interactions with
the recipient.
✖Info from other emails sent by sender
and/or received by the recipient.
✖Info from common directory to
embedded info(word file).
17. Gmail’s original patented data mining scheme(2013) (continue)
External Email Information:
✖A geographic location of the sender
and the recipient.
✖A time the email was sent(lunch).
19. ✖When Gmail was finally released to
the public in April 2004, its ad serving
system used a sophisticated data
mining algorithm known as PHIL.
PHIL algorithm
20. ✖PHIL already implemented the
previous year in Google’s AdSense
program that serves ads to web sites
PHIL algorithm
21. ✖PHIL stands for Probabilistic
Hierarchical Inferential Learner
PHIL algorithm
22. ✖PHIL identify clusters, depending on
concepts.
✖Concepts more or less likely to occur
in email content or web page.
PHIL algorithm
23. ✖e. g., PHIL can learn to distinguish the
entirely different meaning of two
concepts as “ski resort” and “lender of
last resort”.
PHIL algorithm
24. ✖In AdSense, PHIL matched concepts
derived from sets of keywords provided
by advertisers with concepts extracted
from the web pages where publishers
wanted Google to place ads.
✖The idea was that the better the
match, the more likely a visitor to the
publisher’s site would be to click on the
ad, which was the revenue generating
event for Google.
PHIL algorithm In AdSense
25. ✖ AdSense quickly grew to become
Google’s second largest business after
search itself, reaching more than $1
million a day by 2004 and $13 billion a
year by 2013.
PHIL algorithm In AdSense
26. PHIL algorithm In Gmail
✖PHIL for monetization in Gmail must
have seemed like a no-brainer to the
Google managers.
✖BUT ..
27. PHIL algorithm In Gmail
✖BUT things did not work out as hoped.
✖Gmail revenues were not good!!
28. PHIL algorithm In Gmail
✖Gmail revenues for 2014 at barely
$400 million, or less than 1% of
Google’s total revenue.
✖Google was estimated to have over
500 million users.
✖THEN ..
29. PHIL algorithm In Gmail
✖THEN Gmail user produces less than
$1 in revenue per year.
30. PHIL algorithm In Gmail
✖ The cost of storage alone is 31 cents
per year per gigabyte.
✖If the average Gmail user consumes
only 20% of their nominally allotted 15
gigabytes.
✖Google’s retail price for this amount
of storage would be 93 cents
✖more than the revenue it gets from
one Gmail user.
31. Why is revenue generation
in Gmail so much weaker
than for search or AdSense?
34. Google online profiling
✖the most comprehensive kind—
consists of the concept or category
clusters extracted by the PHIL
algorithm from documents the user has
viewed (web pages, inbound emails) or
created (outbound emails, social media
posts).
35. Google online profiling
✖Assuming conservatively that the
average Gmail user receives just 10
non-spam emails per day, the annual
flux of inbound Gmail probably
approaches and may well surpass two
trillion messages per year.
36. Google online profiling
✖By building and continually updating
a vast database of individual user
profiles.
✖one particular user who enters the
word “blackberry” into her browser ..
38. One Box to rule them all
purely
ad-based
business
model
ads and
user
profiling
39. COB (Content OneBox)
✖the PHIL-based extraction of
message concepts
✖updating the “user model” that
Google maintains of each user
✖attaching “smart labels” to
messages that indicate their type
51. lGoogle’s (partial) clarification of
data mining in Gmail
r create advertsing
profiles”. in April 2014 ,Google promises on its web site that “Google
Apps for Education services do not collect or use student
data for advertising purposes or create advertising
profiles”.
The carefully worded promise to stop using student data
to create “advertising profiles” does not rule out the
possibility that it will continue creating profiles that help
it to optimize search results or identify valuable clusters of
users
Google was forced to admit that, contrary to its
promises to educators, it was in fact mining student
emails in GAFE for years.
52. We cannot know for certain what Google is doing with the
output of its vast and highly sophisticated email data mining
machinery
53. have “no legitimate
expectation of privacy”
It is not the profiling itself that is objectionable, it becomes objectionable
when the “voluntary” part drops out of the formula.
Google argued that implicit user consent to data mining was
sufficient“…impliedly consent to Google’s practices by virtue of the fact
that all users of email must necessarily expect that their emails will be
subject to automated processing.”
Gmail users have “no
legitimate expectation
of privacy”
54. Google’s lawyers make the preposterous claim that once
users turn their email over to a third party service
provider they no longer have any “legitimate expectation
of privacy”.
56. Google is calling on governments around the world to
disclose and limit their surveillance practices
it is time for Google to embrace the same transparency
about data mining it wishes to see in others.