SlideShare a Scribd company logo
1 of 83
Short URLs, Big Fun:
Understanding the World in Realtime


            Hilary Mason
         Chief Scientist, bitly

              @hmason
              h@bit.ly
http://www.pcworld.com/article/223409/move_over_dr_soong_
girls_can_build_android_apps_too.html




                        http://bit.ly/hOnbWg
[fireplace]
How do we change the world?
Can we understand the world, first?
Big Data
Data
10s of millions of URLs per day
100s of millions of clicks per day



10s of billions of URLs
encodes
{"g": "zalAU0",
 "i": "173.213.X.X",
"h": "zalAU0",
"l": "bitly",
"u": "http://www.amazon.com/Country-Life-Cal-Mag-
Potassium-Target-
Tablets/dp/B0001VUZ3A?SubscriptionId=AKIAJGA7
AAB6QE7WENSQ&tag=mycellrevi-
20&linkCode=sp1&camp=2025&creative=165953&cr
eativeASIN=B0001VUZ3A", "t": 1328266799,
"_id": "4f2bbe2f-0035d-063a1-3d1cf10a"}
decode
{"a": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
"c": "US",
"nk": 1,
"tz": "America/New_York",
"gr": "NY",
"g": "xNaZ9h",
"i": "98.118.X.X",
"h": "wXxuKW",
"k": "4eefe4be-003e4-X-X",
"l": "moma",
"al": "en-US", "hh":
"bit.ly",
"r":
"http://www.facebook.com/l.php?u=http%3A%2F%2Fbit.ly%2FwXxuKW&h=rAQG_ZZ2G
AQGQth1IOej-9_KHmbpEvh6FlllZjsAqg6A7Rw",
"u": "http://www.brainpickings.org/index.php/2012/02/02/jackson-pollock-father-letter/",
"t": 1328272481,
"hc": 1328232072,
"cy": "East Amherst",
"ll": [43.044101715087891, -78.694900512695312]}
a link
•   URL
•   Content
•   Ref distribution
•   Geo distribution
•   Language
•   Key phrases
•   Topic
Data Science?




Analytics   Science
Data Science?




Things you can                   Things you
just count.                      can’t.
Data scientists?

     engineering
                                math

                     nerds


           nerds               nerds



                     nerds
comp sci
                             hacking




                   awesome nerds
bitly science team!
What can we learn from a lot of
 people talking to each other?
A few things that we can count...
How do people use different
        devices?
What happens on the internet when
       society isn’t stable?
Revolution.
(Silly Things on the Internet)
the cutest kitten
A few things that we can count...

            cleverly.
What spoken languages are in a
           page?
raw data
"es"
"en-us,en;q=0.5"
"pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4"
"en-gb,en;q=0.5"
"en-US,en;q=0.5"
"es-es,es;q=0.8,en-us;q=0.5,en;q=0.3”
"de, en-gb;q=0.9, en;q=0.8"
entropy calculation
def ghash2lang(g, Ri, min_count=3, max_entropy=0.2):
  ""”
  returns the majority vote of a langauge for a given hash
  ""”
  lang = R.zrevrange(g,0,0)[0]
  # let's calculate the entropy!
  # possible languages
  x = R.zrange(g,0,-1)
  # distribution over those languages
  p = np.array([R.zscore(g,langi) for langi in x])
  p /= p.sum()
  # info content
  I = [pi*np.log(pi) for pi in p]
  # entropy: smaller the more certain we are! - i.e. the lower our surprise
  H = -sum(I)/len(I) #in nats!
  # note that this will give a perfect zero for a single count in one language
  # or for 5K counts in one language. So we also need the count..
               count = R.zscore(g,lang)
  if count < min_count and H > max_entropy:
      return lang, count
  else:
      return None, 1
http://4sq.com/96kc1O
What’s the context
 around a link?
[demo]
Things we have to think about.

           (science)
What’s a human?
normal click distributions
abnormal click distributions
Organic vs Inorganic?
AT SCALE
1. Research offline

2. Do fancy math – find the shortcuts

3. Design infrastructure

4. Re-design to run at scale and speed
Realtime Search
Realtime Search
Attributes calculated either at index time or
query time.

Rankings can vary by second.
[demo]
What are people paying
attention to right now?
actual rate of clicks on phrases
vs
expected rate of clicks on phrases
Dragoneye
We calculate clickrate with a sort of moving
average:




          where
Dragoneye
We represent as a sum of delta spikes.

This simplifies to:
Dragoneye
Choosing    is important.

It must be interpretable, and smooth (but not
too smooth).

We use a distribution for that is a function
that sums to 1. The function is 0 at the
origin.
[demo]
philosophy
simple math > fancy math
How do we know
when we’ve won?
How do we communicate what
  we’ve learned effectively?
Ask the crazy questions.
Thank you!




h@bit.ly
@hmason

More Related Content

Similar to Short URLs, Big Fun

OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
tkisason
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
Ken Mwai
 
October hug
October hugOctober hug
October hug
huguk
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 

Similar to Short URLs, Big Fun (20)

Data visualization for development
Data visualization for developmentData visualization for development
Data visualization for development
 
OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with Fusion
 
Progressing and enhancing
Progressing and enhancingProgressing and enhancing
Progressing and enhancing
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
Understanding Artificial Intelligence
Understanding Artificial Intelligence Understanding Artificial Intelligence
Understanding Artificial Intelligence
 
October hug
October hugOctober hug
October hug
 
Python 101 for Data Science to Absolute Beginners
Python 101 for Data Science to Absolute BeginnersPython 101 for Data Science to Absolute Beginners
Python 101 for Data Science to Absolute Beginners
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Artificial Assistants: How can I help you? by Christopher Currin
Artificial Assistants: How can I help you? by Christopher CurrinArtificial Assistants: How can I help you? by Christopher Currin
Artificial Assistants: How can I help you? by Christopher Currin
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Get connected with python
Get connected with pythonGet connected with python
Get connected with python
 
Machine Learning for dummies!
Machine Learning for dummies!Machine Learning for dummies!
Machine Learning for dummies!
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 

More from Hilary Mason

IgniteNYC: How to Replace Yourself With a Very Small Shell Script
IgniteNYC: How to Replace Yourself With a Very Small Shell ScriptIgniteNYC: How to Replace Yourself With a Very Small Shell Script
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
Hilary Mason
 
Experiential Learning in Second Life
Experiential Learning in Second LifeExperiential Learning in Second Life
Experiential Learning in Second Life
Hilary Mason
 

More from Hilary Mason (10)

PyCon 2011 Keynote
PyCon 2011 KeynotePyCon 2011 Keynote
PyCon 2011 Keynote
 
Machine Learning for Web Data
Machine Learning for Web DataMachine Learning for Web Data
Machine Learning for Web Data
 
A Data-driven Look at the Realtime Web
A Data-driven Look at the Realtime WebA Data-driven Look at the Realtime Web
A Data-driven Look at the Realtime Web
 
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
IgniteNYC: How to Replace Yourself With a Very Small Shell ScriptIgniteNYC: How to Replace Yourself With a Very Small Shell Script
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
 
Practical Data Analysis in Python
Practical Data Analysis in PythonPractical Data Analysis in Python
Practical Data Analysis in Python
 
Have data? What now?!
Have data? What now?!Have data? What now?!
Have data? What now?!
 
JWU Guest Talk: JavaScript and AJAX
JWU Guest Talk: JavaScript and AJAXJWU Guest Talk: JavaScript and AJAX
JWU Guest Talk: JavaScript and AJAX
 
Analytics for Virtual Worlds
Analytics for Virtual WorldsAnalytics for Virtual Worlds
Analytics for Virtual Worlds
 
Experiential Learning in Second Life
Experiential Learning in Second LifeExperiential Learning in Second Life
Experiential Learning in Second Life
 
Virtual Worlds in Education
Virtual Worlds in EducationVirtual Worlds in Education
Virtual Worlds in Education
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Short URLs, Big Fun

Editor's Notes

  1. This is going to be a talk for people who love the internet.
  2. The true story of bitly, engineering, data science, loveHow to do data science at scaleBuilding teams and keeping people happyClever tricks
  3. Thomas Karaganis at MS Research 1% of new URLs per day
  4. Shortened links get shared on different platforms and methods
  5. Messages move across platforms in complicated ways
  6. We have a lot of growing up to do
  7. http://www.flickr.com/photos/wanderingnome/73328967/sizes/l/in/photostream/Philosphy, science, engineering, cool tricks
  8. Pow! Surprise! Here we are!
  9. …first, we understand it
  10. …first, we understand it
  11. Asking questions.
  12. http://www.flickr.com/photos/32443746@N07/4753829490/
  13. Egypt.
  14. Tunisia.
  15. …and we can do the same thing for geo data
  16. Studied offline, using hadoopBuild a supervised classifier over timeseriesBuild a random forest ensemble decision tree classifier
  17. The data system fortune cookie gameCreative commons: http://www.flickr.com/photos/mzn37/308048794/sizes/o/in/photostream/
  18. The simplification is important for three reasons:1) A continuous function of time that simplifies to \\phi2) it’s linear, so the sum of the click rates on each page with a phrase is the click rate per phrase3) IT’S FAST
  19. The 0 at the origin insures that we have seen sustained click rates on a phrase before we think it’s anything useful.