These are the slides from a talk I gave at dropbox this month (Feb 2012). It was a narrative about the evolution of bitly and a technical presentation about algorithms and infrastructure. The live demo portion is not represented in the slides (and each of the visuals has an accompanying story).
45. entropy calculation
def ghash2lang(g, Ri, min_count=3, max_entropy=0.2):
""”
returns the majority vote of a langauge for a given hash
""”
lang = R.zrevrange(g,0,0)[0]
# let's calculate the entropy!
# possible languages
x = R.zrange(g,0,-1)
# distribution over those languages
p = np.array([R.zscore(g,langi) for langi in x])
p /= p.sum()
# info content
I = [pi*np.log(pi) for pi in p]
# entropy: smaller the more certain we are! - i.e. the lower our surprise
H = -sum(I)/len(I) #in nats!
# note that this will give a perfect zero for a single count in one language
# or for 5K counts in one language. So we also need the count..
count = R.zscore(g,lang)
if count < min_count and H > max_entropy:
return lang, count
else:
return None, 1
75. Dragoneye
Choosing is important.
It must be interpretable, and smooth (but not
too smooth).
We use a distribution for that is a function
that sums to 1. The function is 0 at the
origin.
Studied offline, using hadoopBuild a supervised classifier over timeseriesBuild a random forest ensemble decision tree classifier
The data system fortune cookie gameCreative commons: http://www.flickr.com/photos/mzn37/308048794/sizes/o/in/photostream/
The simplification is important for three reasons:1) A continuous function of time that simplifies to \\phi2) it’s linear, so the sum of the click rates on each page with a phrase is the click rate per phrase3) IT’S FAST
The 0 at the origin insures that we have seen sustained click rates on a phrase before we think it’s anything useful.