Short URLs, Big Fun


Published on

These are the slides from a talk I gave at dropbox this month (Feb 2012). It was a narrative about the evolution of bitly and a technical presentation about algorithms and infrastructure. The live demo portion is not represented in the slides (and each of the visuals has an accompanying story).

Published in: Technology, Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This is going to be a talk for people who love the internet.
  • The true story of bitly, engineering, data science, loveHow to do data science at scaleBuilding teams and keeping people happyClever tricks
  • Thomas Karaganis at MS Research 1% of new URLs per day
  • Shortened links get shared on different platforms and methods
  • Messages move across platforms in complicated ways
  • We have a lot of growing up to do
  •, science, engineering, cool tricks
  • Pow! Surprise! Here we are!
  • …first, we understand it
  • …first, we understand it
  • Asking questions.
  • Egypt.
  • Tunisia.
  • …and we can do the same thing for geo data
  • Studied offline, using hadoopBuild a supervised classifier over timeseriesBuild a random forest ensemble decision tree classifier
  • The data system fortune cookie gameCreative commons:
  • The simplification is important for three reasons:1) A continuous function of time that simplifies to \\phi2) it’s linear, so the sum of the click rates on each page with a phrase is the click rate per phrase3) IT’S FAST
  • The 0 at the origin insures that we have seen sustained click rates on a phrase before we think it’s anything useful.
  • Short URLs, Big Fun

    1. Short URLs, Big Fun:Understanding the World in Realtime Hilary Mason Chief Scientist, bitly @hmason
    3. [fireplace]
    4. How do we change the world?
    5. Can we understand the world, first?
    6. Big Data
    7. Data
    8. 10s of millions of URLs per day100s of millions of clicks per day10s of billions of URLs
    9. encodes{"g": "zalAU0", "i": "173.213.X.X","h": "zalAU0","l": "bitly","u": "", "t": 1328266799,"_id": "4f2bbe2f-0035d-063a1-3d1cf10a"}
    10. decode{"a": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","c": "US","nk": 1,"tz": "America/New_York","gr": "NY","g": "xNaZ9h","i": "98.118.X.X","h": "wXxuKW","k": "4eefe4be-003e4-X-X","l": "moma","al": "en-US", "hh":"","r":"","u": "","t": 1328272481,"hc": 1328232072,"cy": "East Amherst","ll": [43.044101715087891, -78.694900512695312]}
    11. a link• URL• Content• Ref distribution• Geo distribution• Language• Key phrases• Topic
    12. Data Science?Analytics Science
    13. Data Science?Things you can Things youjust count. can’t.
    14. Data scientists? engineering math nerds nerds nerds nerdscomp sci hacking awesome nerds
    15. bitly science team!
    16. What can we learn from a lot of people talking to each other?
    17. A few things that we can count...
    18. How do people use different devices?
    19. What happens on the internet when society isn’t stable?
    20. Revolution.
    21. (Silly Things on the Internet)
    22. the cutest kitten
    23. A few things that we can count... cleverly.
    24. What spoken languages are in a page?
    25. raw data"es""en-us,en;q=0.5""pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4""en-gb,en;q=0.5""en-US,en;q=0.5""es-es,es;q=0.8,en-us;q=0.5,en;q=0.3”"de, en-gb;q=0.9, en;q=0.8"
    26. entropy calculationdef ghash2lang(g, Ri, min_count=3, max_entropy=0.2): ""” returns the majority vote of a langauge for a given hash ""” lang = R.zrevrange(g,0,0)[0] # lets calculate the entropy! # possible languages x = R.zrange(g,0,-1) # distribution over those languages p = np.array([R.zscore(g,langi) for langi in x]) p /= p.sum() # info content I = [pi*np.log(pi) for pi in p] # entropy: smaller the more certain we are! - i.e. the lower our surprise H = -sum(I)/len(I) #in nats! # note that this will give a perfect zero for a single count in one language # or for 5K counts in one language. So we also need the count.. count = R.zscore(g,lang) if count < min_count and H > max_entropy: return lang, count else: return None, 1
    28. What’s the context around a link?
    29. [demo]
    30. Things we have to think about. (science)
    31. What’s a human?
    32. normal click distributions
    33. abnormal click distributions
    34. Organic vs Inorganic?
    35. AT SCALE
    36. 1. Research offline2. Do fancy math – find the shortcuts3. Design infrastructure4. Re-design to run at scale and speed
    37. Realtime Search
    38. Realtime SearchAttributes calculated either at index time orquery time.Rankings can vary by second.
    39. [demo]
    40. What are people payingattention to right now?
    41. actual rate of clicks on phrasesvsexpected rate of clicks on phrases
    42. DragoneyeWe calculate clickrate with a sort of movingaverage: where
    43. DragoneyeWe represent as a sum of delta spikes.This simplifies to:
    44. DragoneyeChoosing is important.It must be interpretable, and smooth (but nottoo smooth).We use a distribution for that is a functionthat sums to 1. The function is 0 at theorigin.
    45. [demo]
    46. philosophy
    47. simple math > fancy math
    48. How do we knowwhen we’ve won?
    49. How do we communicate what we’ve learned effectively?
    50. Ask the crazy questions.
    51. Thank you!