The Natural
History of Gmail
Data Mining
Gmail isn’t really about
email !!
..it’s a gigantic profiling
machine
1.
court case !
A court case reveals a trove of documents
about Gmail’s inner workings
In late
2010 .. Ads
depends
on emails
the most
serious
legal
challenge
Illegal data
mining
2.
How google Makes money?
Google is the world’s largest advertising
company.
3.
Gmail’s early history
Gmail’s early history
✖Lunched 2004.
✖Yahoo and Microsoft’s Hotmail since
the 90s.
✖Vast amount of storage space per
user.
✖It would be free to users and earn
revenue through advertising.
4.
Gmail’s limitless data
mining ambitions
Gmail data mining
✖The first version of ad serving in
Gmail exploited only concepts directly
extracted from message texts and did
little or no user profiling.
Gmail’s original patented data mining scheme(2013)
✖“internal” and “external” message
attributes that used in any combination
to extract the meaning of an email and
select the best ads to match it.
Gmail’s original patented data mining scheme(2013)
Internal Email Information:
✖Info. from a subject line.
✖Info. from body text.
✖A sender name and/or email address.
✖One or more recipient name and/or
email address.
✖Recipient type (e.g., direct recipient,
cc, bcc).
✖Text extracted from an email address.
✖Embedded information (e.g., business
card file, an image).
Gmail’s original patented data mining scheme(2013) (continue)
Internal Email Information:
✖Linked Info. (e.g., info. from a web
page linked to from the email).
✖Attached info. (e.g., Word processor
files, images, spreadsheets, etc).
Gmail’s original patented data mining scheme(2013) (continue)
External Email Information:
✖Info extracted or derived from search
results returned in response to a search
query composed of extracted email info.
✖Info about the sender for example
derived from previous interactions with
the recipient.
✖Info from other emails sent by sender
and/or received by the recipient.
✖Info from common directory to
embedded info(word file).
Gmail’s original patented data mining scheme(2013) (continue)
External Email Information:
✖A geographic location of the sender
and the recipient.
✖A time the email was sent(lunch).
5.
Gmail doesn’t make much
money from ads
✖When Gmail was finally released to
the public in April 2004, its ad serving
system used a sophisticated data
mining algorithm known as PHIL.
PHIL algorithm
✖PHIL already implemented the
previous year in Google’s AdSense
program that serves ads to web sites
PHIL algorithm
✖PHIL stands for Probabilistic
Hierarchical Inferential Learner
PHIL algorithm
✖PHIL identify clusters, depending on
concepts.
✖Concepts more or less likely to occur
in email content or web page.
PHIL algorithm
✖e. g., PHIL can learn to distinguish the
entirely different meaning of two
concepts as “ski resort” and “lender of
last resort”.
PHIL algorithm
✖In AdSense, PHIL matched concepts
derived from sets of keywords provided
by advertisers with concepts extracted
from the web pages where publishers
wanted Google to place ads.
✖The idea was that the better the
match, the more likely a visitor to the
publisher’s site would be to click on the
ad, which was the revenue generating
event for Google.
PHIL algorithm In AdSense
✖ AdSense quickly grew to become
Google’s second largest business after
search itself, reaching more than $1
million a day by 2004 and $13 billion a
year by 2013.
PHIL algorithm In AdSense
PHIL algorithm In Gmail
✖PHIL for monetization in Gmail must
have seemed like a no-brainer to the
Google managers.
✖BUT ..
PHIL algorithm In Gmail
✖BUT things did not work out as hoped.
✖Gmail revenues were not good!!
PHIL algorithm In Gmail
✖Gmail revenues for 2014 at barely
$400 million, or less than 1% of
Google’s total revenue.
✖Google was estimated to have over
500 million users.
✖THEN ..
PHIL algorithm In Gmail
✖THEN Gmail user produces less than
$1 in revenue per year.
PHIL algorithm In Gmail
✖ The cost of storage alone is 31 cents
per year per gigabyte.
✖If the average Gmail user consumes
only 20% of their nominally allotted 15
gigabytes.
✖Google’s retail price for this amount
of storage would be 93 cents
✖more than the revenue it gets from
one Gmail user.
Why is revenue generation
in Gmail so much weaker
than for search or AdSense?
6.
From ads to user profiles
Google online profiling
✖Using PHIL ..
Google online profiling
✖the most comprehensive kind—
consists of the concept or category
clusters extracted by the PHIL
algorithm from documents the user has
viewed (web pages, inbound emails) or
created (outbound emails, social media
posts).
Google online profiling
✖Assuming conservatively that the
average Gmail user receives just 10
non-spam emails per day, the annual
flux of inbound Gmail probably
approaches and may well surpass two
trillion messages per year.
Google online profiling
✖By building and continually updating
a vast database of individual user
profiles.
✖one particular user who enters the
word “blackberry” into her browser ..
Google page ranking
✖computes an aggregate statistical view
of each web page’s.
✖Bad way ..
One Box to rule them all
purely
ad-based
business
model
ads and
user
profiling
COB (Content OneBox)
✖the PHIL-based extraction of
message concepts
✖updating the “user model” that
Google maintains of each user
✖attaching “smart labels” to
messages that indicate their type
COBCAT2 MIXER
How does CAT2 Mixer operate ??
CAT2 Mixer did not trigger, and consequently
neither did COB
in the case of Government and Business to
pay real money for the service
The Soltion for COB
Sequence of events in the life of a Gmail message
In 2014
60$ billionThat’s a lot of money
70% - 80% of usersAnd a lot of users
ad1 ad2 ad3
user1 1$
user2 1.5$
user3 2$
hundreds of thousands of advertisers
hundredsofmillionsofusers
Sparsity is a problem for Google
Clustering using data brokers
AcxiomDatalogix Epsilon
Too expensive for 0.5
Billion
Clustering using query stream &( IRS & Zillow)
technology
technology
health
health
lGoogle’s (partial) clarification of
data mining in Gmail
r create advertsing
profiles”. in April 2014 ,Google promises on its web site that “Google
Apps for Education services do not collect or use student
data for advertising purposes or create advertising
profiles”.
 The carefully worded promise to stop using student data
to create “advertising profiles” does not rule out the
possibility that it will continue creating profiles that help
it to optimize search results or identify valuable clusters of
users
 Google was forced to admit that, contrary to its
promises to educators, it was in fact mining student
emails in GAFE for years.
We cannot know for certain what Google is doing with the
output of its vast and highly sophisticated email data mining
machinery
have “no legitimate
expectation of privacy”
 It is not the profiling itself that is objectionable, it becomes objectionable
when the “voluntary” part drops out of the formula.
 Google argued that implicit user consent to data mining was
sufficient“…impliedly consent to Google’s practices by virtue of the fact
that all users of email must necessarily expect that their emails will be
subject to automated processing.”
Gmail users have “no
legitimate expectation
of privacy”
Google’s lawyers make the preposterous claim that once
users turn their email over to a third party service
provider they no longer have any “legitimate expectation
of privacy”.
have“nolegitimate
expectationofprivacy”The future of Gmail data mining and
the need for transparency
Google is calling on governments around the world to
disclose and limit their surveillance practices
it is time for Google to embrace the same transparency
about data mining it wishes to see in others.
Recourses
✖https://medium.com/@jeffgould/the-
natural-history-of-gmail-data-
mining-be115d196b10
Thanks!
Any questions?

Gmail data mining

  • 1.
    The Natural History ofGmail Data Mining
  • 2.
    Gmail isn’t reallyabout email !!
  • 3.
    ..it’s a giganticprofiling machine
  • 4.
  • 5.
    A court casereveals a trove of documents about Gmail’s inner workings In late 2010 .. Ads depends on emails the most serious legal challenge Illegal data mining
  • 6.
  • 8.
    Google is theworld’s largest advertising company.
  • 9.
  • 10.
    Gmail’s early history ✖Lunched2004. ✖Yahoo and Microsoft’s Hotmail since the 90s. ✖Vast amount of storage space per user. ✖It would be free to users and earn revenue through advertising.
  • 11.
  • 12.
    Gmail data mining ✖Thefirst version of ad serving in Gmail exploited only concepts directly extracted from message texts and did little or no user profiling.
  • 13.
    Gmail’s original patenteddata mining scheme(2013) ✖“internal” and “external” message attributes that used in any combination to extract the meaning of an email and select the best ads to match it.
  • 14.
    Gmail’s original patenteddata mining scheme(2013) Internal Email Information: ✖Info. from a subject line. ✖Info. from body text. ✖A sender name and/or email address. ✖One or more recipient name and/or email address. ✖Recipient type (e.g., direct recipient, cc, bcc). ✖Text extracted from an email address. ✖Embedded information (e.g., business card file, an image).
  • 15.
    Gmail’s original patenteddata mining scheme(2013) (continue) Internal Email Information: ✖Linked Info. (e.g., info. from a web page linked to from the email). ✖Attached info. (e.g., Word processor files, images, spreadsheets, etc).
  • 16.
    Gmail’s original patenteddata mining scheme(2013) (continue) External Email Information: ✖Info extracted or derived from search results returned in response to a search query composed of extracted email info. ✖Info about the sender for example derived from previous interactions with the recipient. ✖Info from other emails sent by sender and/or received by the recipient. ✖Info from common directory to embedded info(word file).
  • 17.
    Gmail’s original patenteddata mining scheme(2013) (continue) External Email Information: ✖A geographic location of the sender and the recipient. ✖A time the email was sent(lunch).
  • 18.
    5. Gmail doesn’t makemuch money from ads
  • 19.
    ✖When Gmail wasfinally released to the public in April 2004, its ad serving system used a sophisticated data mining algorithm known as PHIL. PHIL algorithm
  • 20.
    ✖PHIL already implementedthe previous year in Google’s AdSense program that serves ads to web sites PHIL algorithm
  • 21.
    ✖PHIL stands forProbabilistic Hierarchical Inferential Learner PHIL algorithm
  • 22.
    ✖PHIL identify clusters,depending on concepts. ✖Concepts more or less likely to occur in email content or web page. PHIL algorithm
  • 23.
    ✖e. g., PHILcan learn to distinguish the entirely different meaning of two concepts as “ski resort” and “lender of last resort”. PHIL algorithm
  • 24.
    ✖In AdSense, PHILmatched concepts derived from sets of keywords provided by advertisers with concepts extracted from the web pages where publishers wanted Google to place ads. ✖The idea was that the better the match, the more likely a visitor to the publisher’s site would be to click on the ad, which was the revenue generating event for Google. PHIL algorithm In AdSense
  • 25.
    ✖ AdSense quicklygrew to become Google’s second largest business after search itself, reaching more than $1 million a day by 2004 and $13 billion a year by 2013. PHIL algorithm In AdSense
  • 26.
    PHIL algorithm InGmail ✖PHIL for monetization in Gmail must have seemed like a no-brainer to the Google managers. ✖BUT ..
  • 27.
    PHIL algorithm InGmail ✖BUT things did not work out as hoped. ✖Gmail revenues were not good!!
  • 28.
    PHIL algorithm InGmail ✖Gmail revenues for 2014 at barely $400 million, or less than 1% of Google’s total revenue. ✖Google was estimated to have over 500 million users. ✖THEN ..
  • 29.
    PHIL algorithm InGmail ✖THEN Gmail user produces less than $1 in revenue per year.
  • 30.
    PHIL algorithm InGmail ✖ The cost of storage alone is 31 cents per year per gigabyte. ✖If the average Gmail user consumes only 20% of their nominally allotted 15 gigabytes. ✖Google’s retail price for this amount of storage would be 93 cents ✖more than the revenue it gets from one Gmail user.
  • 31.
    Why is revenuegeneration in Gmail so much weaker than for search or AdSense?
  • 32.
    6. From ads touser profiles
  • 33.
  • 34.
    Google online profiling ✖themost comprehensive kind— consists of the concept or category clusters extracted by the PHIL algorithm from documents the user has viewed (web pages, inbound emails) or created (outbound emails, social media posts).
  • 35.
    Google online profiling ✖Assumingconservatively that the average Gmail user receives just 10 non-spam emails per day, the annual flux of inbound Gmail probably approaches and may well surpass two trillion messages per year.
  • 36.
    Google online profiling ✖Bybuilding and continually updating a vast database of individual user profiles. ✖one particular user who enters the word “blackberry” into her browser ..
  • 37.
    Google page ranking ✖computesan aggregate statistical view of each web page’s. ✖Bad way ..
  • 38.
    One Box torule them all purely ad-based business model ads and user profiling
  • 39.
    COB (Content OneBox) ✖thePHIL-based extraction of message concepts ✖updating the “user model” that Google maintains of each user ✖attaching “smart labels” to messages that indicate their type
  • 40.
  • 41.
    How does CAT2Mixer operate ??
  • 42.
    CAT2 Mixer didnot trigger, and consequently neither did COB
  • 43.
    in the caseof Government and Business to pay real money for the service
  • 44.
  • 45.
    Sequence of eventsin the life of a Gmail message
  • 46.
    In 2014 60$ billionThat’sa lot of money 70% - 80% of usersAnd a lot of users
  • 47.
    ad1 ad2 ad3 user11$ user2 1.5$ user3 2$ hundreds of thousands of advertisers hundredsofmillionsofusers
  • 48.
    Sparsity is aproblem for Google
  • 49.
    Clustering using databrokers AcxiomDatalogix Epsilon Too expensive for 0.5 Billion
  • 50.
    Clustering using querystream &( IRS & Zillow) technology technology health health
  • 51.
    lGoogle’s (partial) clarificationof data mining in Gmail r create advertsing profiles”. in April 2014 ,Google promises on its web site that “Google Apps for Education services do not collect or use student data for advertising purposes or create advertising profiles”.  The carefully worded promise to stop using student data to create “advertising profiles” does not rule out the possibility that it will continue creating profiles that help it to optimize search results or identify valuable clusters of users  Google was forced to admit that, contrary to its promises to educators, it was in fact mining student emails in GAFE for years.
  • 52.
    We cannot knowfor certain what Google is doing with the output of its vast and highly sophisticated email data mining machinery
  • 53.
    have “no legitimate expectationof privacy”  It is not the profiling itself that is objectionable, it becomes objectionable when the “voluntary” part drops out of the formula.  Google argued that implicit user consent to data mining was sufficient“…impliedly consent to Google’s practices by virtue of the fact that all users of email must necessarily expect that their emails will be subject to automated processing.” Gmail users have “no legitimate expectation of privacy”
  • 54.
    Google’s lawyers makethe preposterous claim that once users turn their email over to a third party service provider they no longer have any “legitimate expectation of privacy”.
  • 55.
    have“nolegitimate expectationofprivacy”The future ofGmail data mining and the need for transparency
  • 56.
    Google is callingon governments around the world to disclose and limit their surveillance practices it is time for Google to embrace the same transparency about data mining it wishes to see in others.
  • 57.
  • 58.