Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Short URLs, Big Fun

10,760 views

Published on

These are the slides from a talk I gave at dropbox this month (Feb 2012). It was a narrative about the evolution of bitly and a technical presentation about algorithms and infrastructure. The live demo portion is not represented in the slides (and each of the visuals has an accompanying story).

Published in: Technology, Education

Short URLs, Big Fun

  1. Short URLs, Big Fun:Understanding the World in Realtime Hilary Mason Chief Scientist, bitly @hmason h@bit.ly
  2. http://www.pcworld.com/article/223409/move_over_dr_soong_girls_can_build_android_apps_too.html http://bit.ly/hOnbWg
  3. [fireplace]
  4. How do we change the world?
  5. Can we understand the world, first?
  6. Big Data
  7. Data
  8. 10s of millions of URLs per day100s of millions of clicks per day10s of billions of URLs
  9. encodes{"g": "zalAU0", "i": "173.213.X.X","h": "zalAU0","l": "bitly","u": "http://www.amazon.com/Country-Life-Cal-Mag-Potassium-Target-Tablets/dp/B0001VUZ3A?SubscriptionId=AKIAJGA7AAB6QE7WENSQ&tag=mycellrevi-20&linkCode=sp1&camp=2025&creative=165953&creativeASIN=B0001VUZ3A", "t": 1328266799,"_id": "4f2bbe2f-0035d-063a1-3d1cf10a"}
  10. decode{"a": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","c": "US","nk": 1,"tz": "America/New_York","gr": "NY","g": "xNaZ9h","i": "98.118.X.X","h": "wXxuKW","k": "4eefe4be-003e4-X-X","l": "moma","al": "en-US", "hh":"bit.ly","r":"http://www.facebook.com/l.php?u=http%3A%2F%2Fbit.ly%2FwXxuKW&h=rAQG_ZZ2GAQGQth1IOej-9_KHmbpEvh6FlllZjsAqg6A7Rw","u": "http://www.brainpickings.org/index.php/2012/02/02/jackson-pollock-father-letter/","t": 1328272481,"hc": 1328232072,"cy": "East Amherst","ll": [43.044101715087891, -78.694900512695312]}
  11. a link• URL• Content• Ref distribution• Geo distribution• Language• Key phrases• Topic
  12. Data Science?Analytics Science
  13. Data Science?Things you can Things youjust count. can’t.
  14. Data scientists? engineering math nerds nerds nerds nerdscomp sci hacking awesome nerds
  15. bitly science team!
  16. What can we learn from a lot of people talking to each other?
  17. A few things that we can count...
  18. How do people use different devices?
  19. What happens on the internet when society isn’t stable?
  20. Revolution.
  21. (Silly Things on the Internet)
  22. the cutest kitten
  23. A few things that we can count... cleverly.
  24. What spoken languages are in a page?
  25. raw data"es""en-us,en;q=0.5""pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4""en-gb,en;q=0.5""en-US,en;q=0.5""es-es,es;q=0.8,en-us;q=0.5,en;q=0.3”"de, en-gb;q=0.9, en;q=0.8"
  26. entropy calculationdef ghash2lang(g, Ri, min_count=3, max_entropy=0.2): ""” returns the majority vote of a langauge for a given hash ""” lang = R.zrevrange(g,0,0)[0] # lets calculate the entropy! # possible languages x = R.zrange(g,0,-1) # distribution over those languages p = np.array([R.zscore(g,langi) for langi in x]) p /= p.sum() # info content I = [pi*np.log(pi) for pi in p] # entropy: smaller the more certain we are! - i.e. the lower our surprise H = -sum(I)/len(I) #in nats! # note that this will give a perfect zero for a single count in one language # or for 5K counts in one language. So we also need the count.. count = R.zscore(g,lang) if count < min_count and H > max_entropy: return lang, count else: return None, 1
  27. http://4sq.com/96kc1O
  28. What’s the context around a link?
  29. [demo]
  30. Things we have to think about. (science)
  31. What’s a human?
  32. normal click distributions
  33. abnormal click distributions
  34. Organic vs Inorganic?
  35. AT SCALE
  36. 1. Research offline2. Do fancy math – find the shortcuts3. Design infrastructure4. Re-design to run at scale and speed
  37. Realtime Search
  38. Realtime SearchAttributes calculated either at index time orquery time.Rankings can vary by second.
  39. [demo]
  40. What are people payingattention to right now?
  41. actual rate of clicks on phrasesvsexpected rate of clicks on phrases
  42. DragoneyeWe calculate clickrate with a sort of movingaverage: where
  43. DragoneyeWe represent as a sum of delta spikes.This simplifies to:
  44. DragoneyeChoosing is important.It must be interpretable, and smooth (but nottoo smooth).We use a distribution for that is a functionthat sums to 1. The function is 0 at theorigin.
  45. [demo]
  46. philosophy
  47. simple math > fancy math
  48. How do we knowwhen we’ve won?
  49. How do we communicate what we’ve learned effectively?
  50. Ask the crazy questions.
  51. Thank you!h@bit.ly@hmason

×