SlideShare a Scribd company logo
1 of 83
Short URLs, Big Fun:
Understanding the World in Realtime


            Hilary Mason
         Chief Scientist, bitly

              @hmason
              h@bit.ly
http://www.pcworld.com/article/223409/move_over_dr_soong_
girls_can_build_android_apps_too.html




                        http://bit.ly/hOnbWg
[fireplace]
How do we change the world?
Can we understand the world, first?
Big Data
Data
10s of millions of URLs per day
100s of millions of clicks per day



10s of billions of URLs
encodes
{"g": "zalAU0",
 "i": "173.213.X.X",
"h": "zalAU0",
"l": "bitly",
"u": "http://www.amazon.com/Country-Life-Cal-Mag-
Potassium-Target-
Tablets/dp/B0001VUZ3A?SubscriptionId=AKIAJGA7
AAB6QE7WENSQ&tag=mycellrevi-
20&linkCode=sp1&camp=2025&creative=165953&cr
eativeASIN=B0001VUZ3A", "t": 1328266799,
"_id": "4f2bbe2f-0035d-063a1-3d1cf10a"}
decode
{"a": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
"c": "US",
"nk": 1,
"tz": "America/New_York",
"gr": "NY",
"g": "xNaZ9h",
"i": "98.118.X.X",
"h": "wXxuKW",
"k": "4eefe4be-003e4-X-X",
"l": "moma",
"al": "en-US", "hh":
"bit.ly",
"r":
"http://www.facebook.com/l.php?u=http%3A%2F%2Fbit.ly%2FwXxuKW&h=rAQG_ZZ2G
AQGQth1IOej-9_KHmbpEvh6FlllZjsAqg6A7Rw",
"u": "http://www.brainpickings.org/index.php/2012/02/02/jackson-pollock-father-letter/",
"t": 1328272481,
"hc": 1328232072,
"cy": "East Amherst",
"ll": [43.044101715087891, -78.694900512695312]}
a link
•   URL
•   Content
•   Ref distribution
•   Geo distribution
•   Language
•   Key phrases
•   Topic
Data Science?




Analytics   Science
Data Science?




Things you can                   Things you
just count.                      can’t.
Data scientists?

     engineering
                                math

                     nerds


           nerds               nerds



                     nerds
comp sci
                             hacking




                   awesome nerds
bitly science team!
What can we learn from a lot of
 people talking to each other?
A few things that we can count...
How do people use different
        devices?
What happens on the internet when
       society isn’t stable?
Revolution.
(Silly Things on the Internet)
the cutest kitten
A few things that we can count...

            cleverly.
What spoken languages are in a
           page?
raw data
"es"
"en-us,en;q=0.5"
"pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4"
"en-gb,en;q=0.5"
"en-US,en;q=0.5"
"es-es,es;q=0.8,en-us;q=0.5,en;q=0.3”
"de, en-gb;q=0.9, en;q=0.8"
entropy calculation
def ghash2lang(g, Ri, min_count=3, max_entropy=0.2):
  ""”
  returns the majority vote of a langauge for a given hash
  ""”
  lang = R.zrevrange(g,0,0)[0]
  # let's calculate the entropy!
  # possible languages
  x = R.zrange(g,0,-1)
  # distribution over those languages
  p = np.array([R.zscore(g,langi) for langi in x])
  p /= p.sum()
  # info content
  I = [pi*np.log(pi) for pi in p]
  # entropy: smaller the more certain we are! - i.e. the lower our surprise
  H = -sum(I)/len(I) #in nats!
  # note that this will give a perfect zero for a single count in one language
  # or for 5K counts in one language. So we also need the count..
               count = R.zscore(g,lang)
  if count < min_count and H > max_entropy:
      return lang, count
  else:
      return None, 1
http://4sq.com/96kc1O
What’s the context
 around a link?
[demo]
Things we have to think about.

           (science)
What’s a human?
normal click distributions
abnormal click distributions
Organic vs Inorganic?
AT SCALE
1. Research offline

2. Do fancy math – find the shortcuts

3. Design infrastructure

4. Re-design to run at scale and speed
Realtime Search
Realtime Search
Attributes calculated either at index time or
query time.

Rankings can vary by second.
[demo]
What are people paying
attention to right now?
actual rate of clicks on phrases
vs
expected rate of clicks on phrases
Dragoneye
We calculate clickrate with a sort of moving
average:




          where
Dragoneye
We represent as a sum of delta spikes.

This simplifies to:
Dragoneye
Choosing    is important.

It must be interpretable, and smooth (but not
too smooth).

We use a distribution for that is a function
that sums to 1. The function is 0 at the
origin.
[demo]
philosophy
simple math > fancy math
How do we know
when we’ve won?
How do we communicate what
  we’ve learned effectively?
Ask the crazy questions.
Thank you!




h@bit.ly
@hmason

More Related Content

Similar to Short URLs, Big Fun

OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
tkisason
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
Ken Mwai
 
October hug
October hugOctober hug
October hug
huguk
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 

Similar to Short URLs, Big Fun (20)

Data visualization for development
Data visualization for developmentData visualization for development
Data visualization for development
 
OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with Fusion
 
Progressing and enhancing
Progressing and enhancingProgressing and enhancing
Progressing and enhancing
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
Understanding Artificial Intelligence
Understanding Artificial Intelligence Understanding Artificial Intelligence
Understanding Artificial Intelligence
 
October hug
October hugOctober hug
October hug
 
Python 101 for Data Science to Absolute Beginners
Python 101 for Data Science to Absolute BeginnersPython 101 for Data Science to Absolute Beginners
Python 101 for Data Science to Absolute Beginners
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Artificial Assistants: How can I help you? by Christopher Currin
Artificial Assistants: How can I help you? by Christopher CurrinArtificial Assistants: How can I help you? by Christopher Currin
Artificial Assistants: How can I help you? by Christopher Currin
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Get connected with python
Get connected with pythonGet connected with python
Get connected with python
 
Machine Learning for dummies!
Machine Learning for dummies!Machine Learning for dummies!
Machine Learning for dummies!
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 

More from Hilary Mason

IgniteNYC: How to Replace Yourself With a Very Small Shell Script
IgniteNYC: How to Replace Yourself With a Very Small Shell ScriptIgniteNYC: How to Replace Yourself With a Very Small Shell Script
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
Hilary Mason
 
Experiential Learning in Second Life
Experiential Learning in Second LifeExperiential Learning in Second Life
Experiential Learning in Second Life
Hilary Mason
 

More from Hilary Mason (10)

PyCon 2011 Keynote
PyCon 2011 KeynotePyCon 2011 Keynote
PyCon 2011 Keynote
 
Machine Learning for Web Data
Machine Learning for Web DataMachine Learning for Web Data
Machine Learning for Web Data
 
A Data-driven Look at the Realtime Web
A Data-driven Look at the Realtime WebA Data-driven Look at the Realtime Web
A Data-driven Look at the Realtime Web
 
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
IgniteNYC: How to Replace Yourself With a Very Small Shell ScriptIgniteNYC: How to Replace Yourself With a Very Small Shell Script
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
 
Practical Data Analysis in Python
Practical Data Analysis in PythonPractical Data Analysis in Python
Practical Data Analysis in Python
 
Have data? What now?!
Have data? What now?!Have data? What now?!
Have data? What now?!
 
JWU Guest Talk: JavaScript and AJAX
JWU Guest Talk: JavaScript and AJAXJWU Guest Talk: JavaScript and AJAX
JWU Guest Talk: JavaScript and AJAX
 
Analytics for Virtual Worlds
Analytics for Virtual WorldsAnalytics for Virtual Worlds
Analytics for Virtual Worlds
 
Experiential Learning in Second Life
Experiential Learning in Second LifeExperiential Learning in Second Life
Experiential Learning in Second Life
 
Virtual Worlds in Education
Virtual Worlds in EducationVirtual Worlds in Education
Virtual Worlds in Education
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Short URLs, Big Fun

Editor's Notes

  1. This is going to be a talk for people who love the internet.
  2. The true story of bitly, engineering, data science, loveHow to do data science at scaleBuilding teams and keeping people happyClever tricks
  3. Thomas Karaganis at MS Research 1% of new URLs per day
  4. Shortened links get shared on different platforms and methods
  5. Messages move across platforms in complicated ways
  6. We have a lot of growing up to do
  7. http://www.flickr.com/photos/wanderingnome/73328967/sizes/l/in/photostream/Philosphy, science, engineering, cool tricks
  8. Pow! Surprise! Here we are!
  9. …first, we understand it
  10. …first, we understand it
  11. Asking questions.
  12. http://www.flickr.com/photos/32443746@N07/4753829490/
  13. Egypt.
  14. Tunisia.
  15. …and we can do the same thing for geo data
  16. Studied offline, using hadoopBuild a supervised classifier over timeseriesBuild a random forest ensemble decision tree classifier
  17. The data system fortune cookie gameCreative commons: http://www.flickr.com/photos/mzn37/308048794/sizes/o/in/photostream/
  18. The simplification is important for three reasons:1) A continuous function of time that simplifies to \\phi2) it’s linear, so the sum of the click rates on each page with a phrase is the click rate per phrase3) IT’S FAST
  19. The 0 at the origin insures that we have seen sustained click rates on a phrase before we think it’s anything useful.