9. > 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
10. > Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
11. > 18 M tracks
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
12. > 20 k new tracks added per day
> 18 M tracks
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
13. > 1 century of listening
> 20 k new tracks added per day
> 18 M tracks
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
14. > 500 M playlists
> 1 century of listening
> 20 k new tracks added per day
> 18 M tracks
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 days
Tuesday, October 23, 12
19. Service overview
Storage
User
Search
Metadata
Tuesday, October 23, 12
20. Service overview
Storage
User
Search
Metadata
.
.
.
Tuesday, October 23, 12
21. Service overview
Storage
User
AP
Search
Metadata
.
.
.
Tuesday, October 23, 12
22. Service overview
Storage
User
AP
Search
Metadata
.
.
.
Tuesday, October 23, 12
23. Service overview
Storage
User
AP
Search
Metadata
.
.
.
Tuesday, October 23, 12
24. Service overview
Storage
User
AP
Search
Metadata
.
.
.
Tuesday, October 23, 12
25. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
26. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
27. Ingestion
XM L L
M M
LX MX
X L
Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/
Tuesday, October 23, 12
30. Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
Tuesday, October 23, 12
31. Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
- Spotify delivery format (mostly indies)
Tuesday, October 23, 12
32. Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
- Spotify delivery format (mostly indies)
Thousands of lines of source speciļ¬c code
Tuesday, October 23, 12
33. Data model [simpliļ¬ed]
1 Artist Transcoding
* *
*
Album 1 1
* Disc 1
1 Audio
* 1
*
Track
*
Rights *
Tuesday, October 23, 12
34. Ingestion
LXML and XSLT with extensions for
parsing/transforming XML
Tuesday, October 23, 12
35. Ingestion: XPath extensions
>>> def formerlify(_, name):
... return 'The artist formerly known as %s' %name
>>> #Namespace stuff
>>> from lxml import etree
>>> ns = etree.FunctionNamespace('http://my.org/myfunctions')
>>> ns['hello'] = hello
>>> ns.prefix = 'f'
>>> root = etree.XML('<a><b>Prince</b></a>')
>>> print(root.xpath('f:hello(string(b))'))
... The artist formerly known as Prince
http://lxml.de/extensions.html#xpath-extension-functions
Tuesday, October 23, 12
37. Ingestion
Fun (?!) fact: largest XML ļ¬le seen so far had 3.3 million rows taking up
350 MB of disk space
Tuesday, October 23, 12
38. Ingestion
Fun (?!) fact: largest XML ļ¬le seen so far had 3.3 million rows taking up
350 MB of disk space
Bible apparently ļ¬ts in 3MB XML
Tuesday, October 23, 12
39. Ingestion
Fun (?!) fact: largest XML ļ¬le seen so far had 3.3 million rows taking up
350 MB of disk space
Bible apparently ļ¬ts in 3MB XML
>>> timeit.timeit('e.parse("huge.xml")',
setup='import lxml.etree as e',
number=5) / 5
4.19...
>>> timeit.timeit('e.parse("huge.xml")',
setup='import xml.etree.cElementTree as e',
number=5) / 5
4.78...
>>> timeit.timeit('e.parse("huge.xml")',
setup='import xml.etree.ElementTree as e',
number=5) / 5
55.39...
Tuesday, October 23, 12
40. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
41. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
42. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
44. Metadata - challenges
Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08
Tuesday, October 23, 12
45. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
46. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
47. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
48. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
56. Content matching
(16 * 10 ** 6) ** 2 = A large number
Tuesday, October 23, 12
57. Content matching
(16 * 10 ** 6) ** 2 = A large number
Reduce search space:
>>> from unicodedata import normalize
>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]
Tuesday, October 23, 12
58. Content matching
(16 * 10 ** 6) ** 2 = A large number
Reduce search space:
>>> from unicodedata import normalize
>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]
Side note: Levenshtein (edit) distance is a heavy operation
-> speeded up about 4x with pypy (or use c-extension)
Tuesday, October 23, 12
60. it!
h
Automatic data processing will never be perfect
c
a t
P
Tuesday, October 23, 12
61. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
62. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
63. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
64. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
65. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
g
in
od
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
67. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
68. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
69. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
70. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
71. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
g
in
od
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
72. Content pipeline
ti on e n g
s e r g
e xi
g e d
Label A
In M In
Label B
Label C
Label D Curation/enrichment
g
in
od
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
74. Index build
ā¢ Nightly batch job on db-dumps
Tuesday, October 23, 12
75. Index build
ā¢ Nightly batch job on db-dumps
ā¢ Previously mostly python but now moved to Java for
performance reason
Tuesday, October 23, 12
76. Index build
ā¢ Nightly batch job on db-dumps
ā¢ Previously mostly python but now moved to Java for
performance reason
ā¢ But still lots of python helper scripts :)
Tuesday, October 23, 12
77. Content pipeline
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
78. Content pipeline
ti on
e s
Label A n g
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
79. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
80. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
81. Content pipeline
ti on g e
e s e r
Label A n g M
I
Label B
Label C
Label D Curation/enrichment
g
in
od
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
82. Content pipeline
ti on e n g
s e r g
e xi
g e d
Label A
In M In
Label B
Label C
Label D Curation/enrichment
g
in
od
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
83. Content pipeline
g
on e n g in
s ti r g xi l is
h
e e de b
Label A n g M In u
I P
Label B
Label C
Label D Curation/enrichment
g On site live services,
in
od
e.g. search, browse
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
98. Content pipeline
g
on e n g in
s ti r g xi l is
h
e e de b
Label A n g M In u
I P
Label B
Label C
Label D Curation/enrichment
g On site live services,
in
od
e.g. search, browse
n sc
Tra
Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
101. Choice of database
Depends on the use case - duh!
Tuesday, October 23, 12
102. Choice of database
Depends on the use case - duh!
ā¢ PostgreSQL (e.g. user service)
Tuesday, October 23, 12
103. Choice of database
Depends on the use case - duh!
ā¢ PostgreSQL (e.g. user service)
ā¢ Cassandra (e.g. playlist service)
Tuesday, October 23, 12
104. Choice of database
Depends on the use case - duh!
ā¢ PostgreSQL (e.g. user service)
ā¢ Cassandra (e.g. playlist service)
ā¢ Tokyo cabinet (e.g. browse service)
Tuesday, October 23, 12
105. Choice of database
Depends on the use case - duh!
ā¢ PostgreSQL (e.g. user service)
ā¢ Cassandra (e.g. playlist service)
ā¢ Tokyo cabinet (e.g. browse service)
ā¢ Lucene (search service)
Tuesday, October 23, 12
106. Choice of database
Depends on the use case - duh!
ā¢ PostgreSQL (e.g. user service)
ā¢ Cassandra (e.g. playlist service)
ā¢ Tokyo cabinet (e.g. browse service)
ā¢ Lucene (search service)
ā¢ HDFS
Tuesday, October 23, 12
107. PostgreSQL
[Pic. of elephant]
Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/
Tuesday, October 23, 12
108. PostgreSQL
Redundancy + scaling:
master/slave
Tuesday, October 23, 12
109. PostgreSQL
Joins and subqueries -
let the query planner roll!
Tuesday, October 23, 12
116. Thank you
henok@spotify.com
Tuesday, October 23, 12
117. Distribution/publish
Popen + gevent (although IO-bound)
import gevent
gevent.monkey.patch_all()
def _wait(self):
while True:
res = self.poll()
if res is not None:
return res
gevent.sleep(0.1)
subprocess.Popen.wait = _wait
Tuesday, October 23, 12