• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Short URLs, Big Fun
 

Short URLs, Big Fun

on

  • 6,148 views

These are the slides from a talk I gave at dropbox this month (Feb 2012). It was a narrative about the evolution of bitly and a technical presentation about algorithms and infrastructure. The live ...

These are the slides from a talk I gave at dropbox this month (Feb 2012). It was a narrative about the evolution of bitly and a technical presentation about algorithms and infrastructure. The live demo portion is not represented in the slides (and each of the visuals has an accompanying story).

Statistics

Views

Total Views
6,148
Views on SlideShare
6,116
Embed Views
32

Actions

Likes
15
Downloads
0
Comments
2

4 Embeds 32

http://contents.ewha.ac.kr 10
https://twitter.com 9
http://203.255.161.242 7
http://www.linkedin.com 6

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is going to be a talk for people who love the internet.
  • The true story of bitly, engineering, data science, loveHow to do data science at scaleBuilding teams and keeping people happyClever tricks
  • Thomas Karaganis at MS Research 1% of new URLs per day
  • Shortened links get shared on different platforms and methods
  • Messages move across platforms in complicated ways
  • We have a lot of growing up to do
  • http://www.flickr.com/photos/wanderingnome/73328967/sizes/l/in/photostream/Philosphy, science, engineering, cool tricks
  • Pow! Surprise! Here we are!
  • …first, we understand it
  • …first, we understand it
  • Asking questions.
  • http://www.flickr.com/photos/32443746@N07/4753829490/
  • Egypt.
  • Tunisia.
  • …and we can do the same thing for geo data
  • Studied offline, using hadoopBuild a supervised classifier over timeseriesBuild a random forest ensemble decision tree classifier
  • The data system fortune cookie gameCreative commons: http://www.flickr.com/photos/mzn37/308048794/sizes/o/in/photostream/
  • The simplification is important for three reasons:1) A continuous function of time that simplifies to \\phi2) it’s linear, so the sum of the click rates on each page with a phrase is the click rate per phrase3) IT’S FAST
  • The 0 at the origin insures that we have seen sustained click rates on a phrase before we think it’s anything useful.

Short URLs, Big Fun Short URLs, Big Fun Presentation Transcript

  • Short URLs, Big Fun:Understanding the World in Realtime Hilary Mason Chief Scientist, bitly @hmason h@bit.ly
  • http://www.pcworld.com/article/223409/move_over_dr_soong_girls_can_build_android_apps_too.html http://bit.ly/hOnbWg
  • [fireplace]
  • How do we change the world?
  • Can we understand the world, first?
  • Big Data
  • Data
  • 10s of millions of URLs per day100s of millions of clicks per day10s of billions of URLs
  • encodes{"g": "zalAU0", "i": "173.213.X.X","h": "zalAU0","l": "bitly","u": "http://www.amazon.com/Country-Life-Cal-Mag-Potassium-Target-Tablets/dp/B0001VUZ3A?SubscriptionId=AKIAJGA7AAB6QE7WENSQ&tag=mycellrevi-20&linkCode=sp1&camp=2025&creative=165953&creativeASIN=B0001VUZ3A", "t": 1328266799,"_id": "4f2bbe2f-0035d-063a1-3d1cf10a"}
  • decode{"a": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","c": "US","nk": 1,"tz": "America/New_York","gr": "NY","g": "xNaZ9h","i": "98.118.X.X","h": "wXxuKW","k": "4eefe4be-003e4-X-X","l": "moma","al": "en-US", "hh":"bit.ly","r":"http://www.facebook.com/l.php?u=http%3A%2F%2Fbit.ly%2FwXxuKW&h=rAQG_ZZ2GAQGQth1IOej-9_KHmbpEvh6FlllZjsAqg6A7Rw","u": "http://www.brainpickings.org/index.php/2012/02/02/jackson-pollock-father-letter/","t": 1328272481,"hc": 1328232072,"cy": "East Amherst","ll": [43.044101715087891, -78.694900512695312]}
  • a link• URL• Content• Ref distribution• Geo distribution• Language• Key phrases• Topic
  • Data Science?Analytics Science
  • Data Science?Things you can Things youjust count. can’t.
  • Data scientists? engineering math nerds nerds nerds nerdscomp sci hacking awesome nerds
  • bitly science team!
  • What can we learn from a lot of people talking to each other?
  • A few things that we can count...
  • How do people use different devices?
  • What happens on the internet when society isn’t stable?
  • Revolution.
  • (Silly Things on the Internet)
  • the cutest kitten
  • A few things that we can count... cleverly.
  • What spoken languages are in a page?
  • raw data"es""en-us,en;q=0.5""pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4""en-gb,en;q=0.5""en-US,en;q=0.5""es-es,es;q=0.8,en-us;q=0.5,en;q=0.3”"de, en-gb;q=0.9, en;q=0.8"
  • entropy calculationdef ghash2lang(g, Ri, min_count=3, max_entropy=0.2): ""” returns the majority vote of a langauge for a given hash ""” lang = R.zrevrange(g,0,0)[0] # lets calculate the entropy! # possible languages x = R.zrange(g,0,-1) # distribution over those languages p = np.array([R.zscore(g,langi) for langi in x]) p /= p.sum() # info content I = [pi*np.log(pi) for pi in p] # entropy: smaller the more certain we are! - i.e. the lower our surprise H = -sum(I)/len(I) #in nats! # note that this will give a perfect zero for a single count in one language # or for 5K counts in one language. So we also need the count.. count = R.zscore(g,lang) if count < min_count and H > max_entropy: return lang, count else: return None, 1
  • http://4sq.com/96kc1O
  • What’s the context around a link?
  • [demo]
  • Things we have to think about. (science)
  • What’s a human?
  • normal click distributions
  • abnormal click distributions
  • Organic vs Inorganic?
  • AT SCALE
  • 1. Research offline2. Do fancy math – find the shortcuts3. Design infrastructure4. Re-design to run at scale and speed
  • Realtime Search
  • Realtime SearchAttributes calculated either at index time orquery time.Rankings can vary by second.
  • [demo]
  • What are people payingattention to right now?
  • actual rate of clicks on phrasesvsexpected rate of clicks on phrases
  • DragoneyeWe calculate clickrate with a sort of movingaverage: where
  • DragoneyeWe represent as a sum of delta spikes.This simplifies to:
  • DragoneyeChoosing is important.It must be interpretable, and smooth (but nottoo smooth).We use a distribution for that is a functionthat sums to 1. The function is 0 at theorigin.
  • [demo]
  • philosophy
  • simple math > fancy math
  • How do we knowwhen we’ve won?
  • How do we communicate what we’ve learned effectively?
  • Ask the crazy questions.
  • Thank you!h@bit.ly@hmason