Short URLs, Big Fun

•

19 likes•9,598 views

These are the slides from a talk I gave at dropbox this month (Feb 2012). It was a narrative about the evolution of bitly and a technical presentation about algorithms and infrastructure. The live demo portion is not represented in the slides (and each of the visuals has an accompanying story).

Technology Education

Short URLs, Big Fun:
Understanding the World in Realtime

Hilary Mason
Chief Scientist, bitly

@hmason
h@bit.ly

http://www.pcworld.com/article/223409/move_over_dr_soong_
girls_can_build_android_apps_too.html

http://bit.ly/hOnbWg

10s of millions of URLs per day
100s of millions of clicks per day

10s of billions of URLs

$encodes {"g": "zalAU0", "i": "173.213.X.X", "h": "zalAU0", "l": "bitly", "u": "http://www.amazon.com/Country-Life-Cal-Mag- Potassium-Target- Tablets/dp/B0001VUZ3A?SubscriptionId=AKIAJGA7 AAB6QE7WENSQ&tag=mycellrevi- 20&linkCode=sp1&camp=2025&creative=165953&cr eativeASIN=B0001VUZ3A", "t": 1328266799, "_id": "4f2bbe2f-0035d-063a1-3d1cf10a"}$

$decode {"a": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)", "c": "US", "nk": 1, "tz": "America/New_York", "gr": "NY", "g": "xNaZ9h", "i": "98.118.X.X", "h": "wXxuKW", "k": "4eefe4be-003e4-X-X", "l": "moma", "al": "en-US", "hh": "bit.ly", "r": "http://www.facebook.com/l.php?u=http%3A%2F%2Fbit.ly%2FwXxuKW&h=rAQG_ZZ2G AQGQth1IOej-9_KHmbpEvh6FlllZjsAqg6A7Rw", "u": "http://www.brainpickings.org/index.php/2012/02/02/jackson-pollock-father-letter/", "t": 1328272481, "hc": 1328232072, "cy": "East Amherst", "ll": [43.044101715087891, -78.694900512695312]}$

a link
• URL
• Content
• Ref distribution
• Geo distribution
• Language
• Key phrases
• Topic

Data Science?

Things you can Things you
just count. can’t.

Data scientists?

engineering
math

nerds

nerds nerds

nerds
comp sci
hacking

awesome nerds

What can we learn from a lot of
people talking to each other?

What happens on the internet when
society isn’t stable?

A few things that we can count...

cleverly.

raw data
"es"
"en-us,en;q=0.5"
"pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4"
"en-gb,en;q=0.5"
"en-US,en;q=0.5"
"es-es,es;q=0.8,en-us;q=0.5,en;q=0.3”
"de, en-gb;q=0.9, en;q=0.8"

entropy calculation
def ghash2lang(g, Ri, min_count=3, max_entropy=0.2):
""”
returns the majority vote of a langauge for a given hash
""”
lang = R.zrevrange(g,0,0)[0]
# let's calculate the entropy!
# possible languages
x = R.zrange(g,0,-1)
# distribution over those languages
p = np.array([R.zscore(g,langi) for langi in x])
p /= p.sum()
# info content
I = [pi*np.log(pi) for pi in p]
# entropy: smaller the more certain we are! - i.e. the lower our surprise
H = -sum(I)/len(I) #in nats!
# note that this will give a perfect zero for a single count in one language
# or for 5K counts in one language. So we also need the count..
count = R.zscore(g,lang)
if count < min_count and H > max_entropy:
return lang, count
else:
return None, 1

Things we have to think about.

(science)

1. Research offline

2. Do fancy math – find the shortcuts

3. Design infrastructure

4. Re-design to run at scale and speed

Realtime Search
Attributes calculated either at index time or
query time.

Rankings can vary by second.

What are people paying
attention to right now?

actual rate of clicks on phrases
vs
expected rate of clicks on phrases

Dragoneye
We calculate clickrate with a sort of moving
average:

where

Dragoneye
We represent as a sum of delta spikes.

This simplifies to:

Dragoneye
Choosing is important.

It must be interpretable, and smooth (but not
too smooth).

We use a distribution for that is a function
that sums to 1. The function is 0 at the
origin.

How do we communicate what
we’ve learned effectively?

Similar to Short URLs, Big Fun

Data visualization for developmentSara-Jayne Terp

OpenFest 2012 : Leveraging the public internettkisason

Just the basics_strata_2013Ken Mwai

Webinar: Modern Techniques for Better Search Relevance with FusionLucidworks

Progressing and enhancingChristian Heilmann

The Web of Data: do we actually understand what we built?Frank van Harmelen

Understanding Artificial Intelligence St. Petersburg College

October hughuguk

Python 101 for Data Science to Absolute BeginnersSai Linn Thu

Machine Learning ICS 273Abutest

Artificial Assistants: How can I help you? by Christopher CurrinChristopher Currin

Big data and APIs for PHP developers - SXSW 2011Eli White

Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...PyData

Intro to Python for Data ScienceTJ Stalcup

Web technology: Web searchVictor de Boer

Intro to Python for Data ScienceTJ Stalcup

Get connected with pythonJan Kroon

Machine Learning for dummies!ZOLLHOF - Tech Incubator

Pandas, Data Wrangling & Data ScienceKrishna Sankar

Similar to Short URLs, Big Fun (20)

Data visualization for development

OpenFest 2012 : Leveraging the public internet

Just the basics_strata_2013

Webinar: Modern Techniques for Better Search Relevance with Fusion

Progressing and enhancing

The Web of Data: do we actually understand what we built?

Understanding Artificial Intelligence

October hug

Python 101 for Data Science to Absolute Beginners

Machine Learning ICS 273A

Artificial Assistants: How can I help you? by Christopher Currin

Big data and APIs for PHP developers - SXSW 2011

Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...

Intro to Python for Data Science

Web technology: Web search

Intro to Python for Data Science

Get connected with python

Machine Learning for dummies!

Pandas, Data Wrangling & Data Science

Recently uploaded

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Scaling API-first – The story of a global engineering organizationRadu Cotescu

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Slack Application Development 101 Slidespraypatel2

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Recently uploaded (20)

Understanding the Laravel MVC Architecture

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Scaling API-first – The story of a global engineering organization

A Domino Admins Adventures (Engage 2024)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

08448380779 Call Girls In Civil Lines Women Seeking Men

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Finology Group – Insurtech Innovation Award 2024

Slack Application Development 101 Slides

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

GenCyber Cyber Security Day Presentation

Data Cloud, More than a CDP by Matt Robison

My Hashitalk Indonesia April 2024 Presentation

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

SQL Database Design For Developers at php[tek] 2024

Injustice - Developers Among Us (SciFiDevCon 2024)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Short URLs, Big Fun

1. Short URLs, Big Fun: Understanding the World in Realtime Hilary Mason Chief Scientist, bitly @hmason h@bit.ly

3. http://www.pcworld.com/article/223409/move_over_dr_soong_ girls_can_build_android_apps_too.html http://bit.ly/hOnbWg

10.

11. [fireplace]

12.

13. How do we change the world?

14. Can we understand the world, first?

15. Big Data

16. Data

17. 10s of millions of URLs per day 100s of millions of clicks per day 10s of billions of URLs

18. encodes {"g": "zalAU0", "i": "173.213.X.X", "h": "zalAU0", "l": "bitly", "u": "http://www.amazon.com/Country-Life-Cal-Mag- Potassium-Target- Tablets/dp/B0001VUZ3A?SubscriptionId=AKIAJGA7 AAB6QE7WENSQ&tag=mycellrevi- 20&linkCode=sp1&camp=2025&creative=165953&cr eativeASIN=B0001VUZ3A", "t": 1328266799, "_id": "4f2bbe2f-0035d-063a1-3d1cf10a"}

19. decode {"a": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)", "c": "US", "nk": 1, "tz": "America/New_York", "gr": "NY", "g": "xNaZ9h", "i": "98.118.X.X", "h": "wXxuKW", "k": "4eefe4be-003e4-X-X", "l": "moma", "al": "en-US", "hh": "bit.ly", "r": "http://www.facebook.com/l.php?u=http%3A%2F%2Fbit.ly%2FwXxuKW&h=rAQG_ZZ2G AQGQth1IOej-9_KHmbpEvh6FlllZjsAqg6A7Rw", "u": "http://www.brainpickings.org/index.php/2012/02/02/jackson-pollock-father-letter/", "t": 1328272481, "hc": 1328232072, "cy": "East Amherst", "ll": [43.044101715087891, -78.694900512695312]}

20. a link • URL • Content • Ref distribution • Geo distribution • Language • Key phrases • Topic

21. Data Science? Analytics Science

22. Data Science? Things you can Things you just count. can’t.

23. Data scientists? engineering math nerds nerds nerds nerds comp sci hacking awesome nerds

24. bitly science team!

25. What can we learn from a lot of people talking to each other?

26.

27.

28.

29. A few things that we can count...

30. How do people use different devices?

31.

32.

33. What happens on the internet when society isn’t stable?

34.

35.

36.

37. Revolution.

38. (Silly Things on the Internet)

39.

40. the cutest kitten

41. A few things that we can count... cleverly.

42. What spoken languages are in a page?

43. raw data "es" "en-us,en;q=0.5" "pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4" "en-gb,en;q=0.5" "en-US,en;q=0.5" "es-es,es;q=0.8,en-us;q=0.5,en;q=0.3” "de, en-gb;q=0.9, en;q=0.8"

44.

45. entropy calculation def ghash2lang(g, Ri, min_count=3, max_entropy=0.2): ""” returns the majority vote of a langauge for a given hash ""” lang = R.zrevrange(g,0,0)[0] # let's calculate the entropy! # possible languages x = R.zrange(g,0,-1) # distribution over those languages p = np.array([R.zscore(g,langi) for langi in x]) p /= p.sum() # info content I = [pi*np.log(pi) for pi in p] # entropy: smaller the more certain we are! - i.e. the lower our surprise H = -sum(I)/len(I) #in nats! # note that this will give a perfect zero for a single count in one language # or for 5K counts in one language. So we also need the count.. count = R.zscore(g,lang) if count < min_count and H > max_entropy: return lang, count else: return None, 1

46. http://4sq.com/96kc1O

47. What’s the context around a link?

48.

49.

50. [demo]

51. Things we have to think about. (science)

52. What’s a human?

53.

54. normal click distributions

55.

56.

57.

58.

59. abnormal click distributions

60.

61.

62.

63. Organic vs Inorganic?

64.

65.

66. AT SCALE

67. 1. Research offline 2. Do fancy math – find the shortcuts 3. Design infrastructure 4. Re-design to run at scale and speed

68. Realtime Search

69. Realtime Search Attributes calculated either at index time or query time. Rankings can vary by second.

70. [demo]

71. What are people paying attention to right now?

72. actual rate of clicks on phrases vs expected rate of clicks on phrases

73. Dragoneye We calculate clickrate with a sort of moving average: where

74. Dragoneye We represent as a sum of delta spikes. This simplifies to:

75. Dragoneye Choosing is important. It must be interpretable, and smooth (but not too smooth). We use a distribution for that is a function that sums to 1. The function is 0 at the origin.

76. [demo]

77. philosophy

78. simple math > fancy math

79.

80. How do we know when we’ve won?

81. How do we communicate what we’ve learned effectively?

82. Ask the crazy questions.

83. Thank you! h@bit.ly @hmason

Editor's Notes

This is going to be a talk for people who love the internet.
The true story of bitly, engineering, data science, loveHow to do data science at scaleBuilding teams and keeping people happyClever tricks
Thomas Karaganis at MS Research 1% of new URLs per day
Shortened links get shared on different platforms and methods
Messages move across platforms in complicated ways
We have a lot of growing up to do
http://www.flickr.com/photos/wanderingnome/73328967/sizes/l/in/photostream/Philosphy, science, engineering, cool tricks
Pow! Surprise! Here we are!
…first, we understand it
…first, we understand it
Asking questions.
http://www.flickr.com/photos/32443746@N07/4753829490/
Egypt.
Tunisia.
…and we can do the same thing for geo data
Studied offline, using hadoopBuild a supervised classifier over timeseriesBuild a random forest ensemble decision tree classifier
The data system fortune cookie gameCreative commons: http://www.flickr.com/photos/mzn37/308048794/sizes/o/in/photostream/
The simplification is important for three reasons:1) A continuous function of time that simplifies to \\phi2) it’s linear, so the sum of the click rates on each page with a phrase is the click rate per phrase3) IT’S FAST
The 0 at the origin insures that we have seen sustained click rates on a phrase before we think it’s anything useful.

Short URLs, Big Fun

Recommended

Recommended

More Related Content

Similar to Short URLs, Big Fun

Similar to Short URLs, Big Fun (20)

More from Hilary Mason

More from Hilary Mason (10)

Recently uploaded

Recently uploaded (20)

Short URLs, Big Fun

Editor's Notes