Short URLs, Big Fun:Understanding the World in Realtime            Hilary Mason         Chief Scientist, bitly            ...
http://www.pcworld.com/article/223409/move_over_dr_soong_girls_can_build_android_apps_too.html                        http...
[fireplace]
How do we change the world?
Can we understand the world, first?
Big Data
Data
10s of millions of URLs per day100s of millions of clicks per day10s of billions of URLs
encodes{"g": "zalAU0", "i": "173.213.X.X","h": "zalAU0","l": "bitly","u": "http://www.amazon.com/Country-Life-Cal-Mag-Pota...
decode{"a": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","c": "US","nk": 1,"tz": "America/New_...
a link•   URL•   Content•   Ref distribution•   Geo distribution•   Language•   Key phrases•   Topic
Data Science?Analytics   Science
Data Science?Things you can                   Things youjust count.                      can’t.
Data scientists?     engineering                                math                     nerds           nerds            ...
bitly science team!
What can we learn from a lot of people talking to each other?
A few things that we can count...
How do people use different        devices?
What happens on the internet when       society isn’t stable?
Revolution.
(Silly Things on the Internet)
the cutest kitten
A few things that we can count...            cleverly.
What spoken languages are in a           page?
raw data"es""en-us,en;q=0.5""pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4""en-gb,en;q=0.5""en-US,en;q=0.5""es-es,es;q=0.8,en-us;q=0...
entropy calculationdef ghash2lang(g, Ri, min_count=3, max_entropy=0.2):  ""”  returns the majority vote of a langauge for ...
http://4sq.com/96kc1O
What’s the context around a link?
[demo]
Things we have to think about.           (science)
What’s a human?
normal click distributions
abnormal click distributions
Organic vs Inorganic?
AT SCALE
1. Research offline2. Do fancy math – find the shortcuts3. Design infrastructure4. Re-design to run at scale and speed
Realtime Search
Realtime SearchAttributes calculated either at index time orquery time.Rankings can vary by second.
[demo]
What are people payingattention to right now?
actual rate of clicks on phrasesvsexpected rate of clicks on phrases
DragoneyeWe calculate clickrate with a sort of movingaverage:          where
DragoneyeWe represent as a sum of delta spikes.This simplifies to:
DragoneyeChoosing    is important.It must be interpretable, and smooth (but nottoo smooth).We use a distribution for that ...
[demo]
philosophy
simple math > fancy math
How do we knowwhen we’ve won?
How do we communicate what  we’ve learned effectively?
Ask the crazy questions.
Thank you!h@bit.ly@hmason
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Short URLs, Big Fun
Upcoming SlideShare
Loading in...5
×

Short URLs, Big Fun

6,314

Published on

These are the slides from a talk I gave at dropbox this month (Feb 2012). It was a narrative about the evolution of bitly and a technical presentation about algorithms and infrastructure. The live demo portion is not represented in the slides (and each of the visuals has an accompanying story).

Published in: Technology, Education
2 Comments
15 Likes
Statistics
Notes
No Downloads
Views
Total Views
6,314
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
2
Likes
15
Embeds 0
No embeds

No notes for slide
  • This is going to be a talk for people who love the internet.
  • The true story of bitly, engineering, data science, loveHow to do data science at scaleBuilding teams and keeping people happyClever tricks
  • Thomas Karaganis at MS Research 1% of new URLs per day
  • Shortened links get shared on different platforms and methods
  • Messages move across platforms in complicated ways
  • We have a lot of growing up to do
  • http://www.flickr.com/photos/wanderingnome/73328967/sizes/l/in/photostream/Philosphy, science, engineering, cool tricks
  • Pow! Surprise! Here we are!
  • …first, we understand it
  • …first, we understand it
  • Asking questions.
  • http://www.flickr.com/photos/32443746@N07/4753829490/
  • Egypt.
  • Tunisia.
  • …and we can do the same thing for geo data
  • Studied offline, using hadoopBuild a supervised classifier over timeseriesBuild a random forest ensemble decision tree classifier
  • The data system fortune cookie gameCreative commons: http://www.flickr.com/photos/mzn37/308048794/sizes/o/in/photostream/
  • The simplification is important for three reasons:1) A continuous function of time that simplifies to \\phi2) it’s linear, so the sum of the click rates on each page with a phrase is the click rate per phrase3) IT’S FAST
  • The 0 at the origin insures that we have seen sustained click rates on a phrase before we think it’s anything useful.
  • Transcript of "Short URLs, Big Fun"

    1. 1. Short URLs, Big Fun:Understanding the World in Realtime Hilary Mason Chief Scientist, bitly @hmason h@bit.ly
    2. 2. http://www.pcworld.com/article/223409/move_over_dr_soong_girls_can_build_android_apps_too.html http://bit.ly/hOnbWg
    3. 3. [fireplace]
    4. 4. How do we change the world?
    5. 5. Can we understand the world, first?
    6. 6. Big Data
    7. 7. Data
    8. 8. 10s of millions of URLs per day100s of millions of clicks per day10s of billions of URLs
    9. 9. encodes{"g": "zalAU0", "i": "173.213.X.X","h": "zalAU0","l": "bitly","u": "http://www.amazon.com/Country-Life-Cal-Mag-Potassium-Target-Tablets/dp/B0001VUZ3A?SubscriptionId=AKIAJGA7AAB6QE7WENSQ&tag=mycellrevi-20&linkCode=sp1&camp=2025&creative=165953&creativeASIN=B0001VUZ3A", "t": 1328266799,"_id": "4f2bbe2f-0035d-063a1-3d1cf10a"}
    10. 10. decode{"a": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","c": "US","nk": 1,"tz": "America/New_York","gr": "NY","g": "xNaZ9h","i": "98.118.X.X","h": "wXxuKW","k": "4eefe4be-003e4-X-X","l": "moma","al": "en-US", "hh":"bit.ly","r":"http://www.facebook.com/l.php?u=http%3A%2F%2Fbit.ly%2FwXxuKW&h=rAQG_ZZ2GAQGQth1IOej-9_KHmbpEvh6FlllZjsAqg6A7Rw","u": "http://www.brainpickings.org/index.php/2012/02/02/jackson-pollock-father-letter/","t": 1328272481,"hc": 1328232072,"cy": "East Amherst","ll": [43.044101715087891, -78.694900512695312]}
    11. 11. a link• URL• Content• Ref distribution• Geo distribution• Language• Key phrases• Topic
    12. 12. Data Science?Analytics Science
    13. 13. Data Science?Things you can Things youjust count. can’t.
    14. 14. Data scientists? engineering math nerds nerds nerds nerdscomp sci hacking awesome nerds
    15. 15. bitly science team!
    16. 16. What can we learn from a lot of people talking to each other?
    17. 17. A few things that we can count...
    18. 18. How do people use different devices?
    19. 19. What happens on the internet when society isn’t stable?
    20. 20. Revolution.
    21. 21. (Silly Things on the Internet)
    22. 22. the cutest kitten
    23. 23. A few things that we can count... cleverly.
    24. 24. What spoken languages are in a page?
    25. 25. raw data"es""en-us,en;q=0.5""pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4""en-gb,en;q=0.5""en-US,en;q=0.5""es-es,es;q=0.8,en-us;q=0.5,en;q=0.3”"de, en-gb;q=0.9, en;q=0.8"
    26. 26. entropy calculationdef ghash2lang(g, Ri, min_count=3, max_entropy=0.2): ""” returns the majority vote of a langauge for a given hash ""” lang = R.zrevrange(g,0,0)[0] # lets calculate the entropy! # possible languages x = R.zrange(g,0,-1) # distribution over those languages p = np.array([R.zscore(g,langi) for langi in x]) p /= p.sum() # info content I = [pi*np.log(pi) for pi in p] # entropy: smaller the more certain we are! - i.e. the lower our surprise H = -sum(I)/len(I) #in nats! # note that this will give a perfect zero for a single count in one language # or for 5K counts in one language. So we also need the count.. count = R.zscore(g,lang) if count < min_count and H > max_entropy: return lang, count else: return None, 1
    27. 27. http://4sq.com/96kc1O
    28. 28. What’s the context around a link?
    29. 29. [demo]
    30. 30. Things we have to think about. (science)
    31. 31. What’s a human?
    32. 32. normal click distributions
    33. 33. abnormal click distributions
    34. 34. Organic vs Inorganic?
    35. 35. AT SCALE
    36. 36. 1. Research offline2. Do fancy math – find the shortcuts3. Design infrastructure4. Re-design to run at scale and speed
    37. 37. Realtime Search
    38. 38. Realtime SearchAttributes calculated either at index time orquery time.Rankings can vary by second.
    39. 39. [demo]
    40. 40. What are people payingattention to right now?
    41. 41. actual rate of clicks on phrasesvsexpected rate of clicks on phrases
    42. 42. DragoneyeWe calculate clickrate with a sort of movingaverage: where
    43. 43. DragoneyeWe represent as a sum of delta spikes.This simplifies to:
    44. 44. DragoneyeChoosing is important.It must be interpretable, and smooth (but nottoo smooth).We use a distribution for that is a functionthat sums to 1. The function is 0 at theorigin.
    45. 45. [demo]
    46. 46. philosophy
    47. 47. simple math > fancy math
    48. 48. How do we knowwhen we’ve won?
    49. 49. How do we communicate what we’ve learned effectively?
    50. 50. Ask the crazy questions.
    51. 51. Thank you!h@bit.ly@hmason

    ×