Python <3 Content systems

  • 674 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
674
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
12
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Python <3 Content systems - managing millions of tracks for the masses 22nd October 2012Tuesday, October 23, 12
  • 2. Tuesday, October 23, 12
  • 3. Tuesday, October 23, 12
  • 4. Tuesday, October 23, 12
  • 5. Tuesday, October 23, 12
  • 6. Tuesday, October 23, 12
  • 7. Tuesday, October 23, 12
  • 8. Tuesday, October 23, 12
  • 9. > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  • 10. > Available in 15 Countries > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  • 11. > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  • 12. > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  • 13. > 1 century of listening > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  • 14. > 500 M playlists > 1 century of listening > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  • 15. Service overviewTuesday, October 23, 12
  • 16. Service overview StorageTuesday, October 23, 12
  • 17. Service overview Storage UserTuesday, October 23, 12
  • 18. Service overview Storage User SearchTuesday, October 23, 12
  • 19. Service overview Storage User Search MetadataTuesday, October 23, 12
  • 20. Service overview Storage User Search Metadata . . .Tuesday, October 23, 12
  • 21. Service overview Storage User AP Search Metadata . . .Tuesday, October 23, 12
  • 22. Service overview Storage User AP Search Metadata . . .Tuesday, October 23, 12
  • 23. Service overview Storage User AP Search Metadata . . .Tuesday, October 23, 12
  • 24. Service overview Storage User AP Search Metadata . . .Tuesday, October 23, 12
  • 25. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 26. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 27. Ingestion XM L L M M LX MX X L Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/Tuesday, October 23, 12
  • 28. Ingestion: Delivery formatsTuesday, October 23, 12
  • 29. Ingestion: Delivery formats ~ 10 different incoming XML formatsTuesday, October 23, 12
  • 30. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors)Tuesday, October 23, 12
  • 31. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors) - Spotify delivery format (mostly indies)Tuesday, October 23, 12
  • 32. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors) - Spotify delivery format (mostly indies) Thousands of lines of source specific codeTuesday, October 23, 12
  • 33. Data model [simplified] 1 Artist Transcoding * * * Album 1 1 * Disc 1 1 Audio * 1 * Track * Rights *Tuesday, October 23, 12
  • 34. Ingestion LXML and XSLT with extensions for parsing/transforming XMLTuesday, October 23, 12
  • 35. Ingestion: XPath extensions >>> def formerlify(_, name): ... return The artist formerly known as %s %name >>> #Namespace stuff >>> from lxml import etree >>> ns = etree.FunctionNamespace(http://my.org/myfunctions) >>> ns[hello] = hello >>> ns.prefix = f >>> root = etree.XML(<a><b>Prince</b></a>) >>> print(root.xpath(f:hello(string(b)))) ... The artist formerly known as Prince http://lxml.de/extensions.html#xpath-extension-functionsTuesday, October 23, 12
  • 36. IngestionTuesday, October 23, 12
  • 37. Ingestion Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk spaceTuesday, October 23, 12
  • 38. Ingestion Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space Bible apparently fits in 3MB XMLTuesday, October 23, 12
  • 39. Ingestion Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space Bible apparently fits in 3MB XML >>> timeit.timeit(e.parse("huge.xml"), setup=import lxml.etree as e, number=5) / 5 4.19... >>> timeit.timeit(e.parse("huge.xml"), setup=import xml.etree.cElementTree as e, number=5) / 5 4.78... >>> timeit.timeit(e.parse("huge.xml"), setup=import xml.etree.ElementTree as e, number=5) / 5 55.39...Tuesday, October 23, 12
  • 40. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 41. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 42. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 43. Centralized vs. aggregated cataloging Requ Requ ires h ires m uman ergin s! g!Tuesday, October 23, 12
  • 44. Metadata - challenges Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08Tuesday, October 23, 12
  • 45. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 46. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 47. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 48. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 49. Ambiguous artists - thesis workTuesday, October 23, 12
  • 50. Ambiguous artists - thesis work • User inputTuesday, October 23, 12
  • 51. Ambiguous artists - thesis work • User input • Machine learningTuesday, October 23, 12
  • 52. Ambiguous artists - thesis work • User input • Machine learning • Matching against external sourcesTuesday, October 23, 12
  • 53. Ambiguous artists - thesis work • User input • Machine learning • Matching against external sources • Feature selection (#matches per external source, len(name), country-count, multilingual)Tuesday, October 23, 12
  • 54. Ambiguous artists - thesis work • User input • Machine learning • Matching against external sources • Feature selection (#matches per external source, len(name), country-count, multilingual) • Matchings + preprocessing in PythonTuesday, October 23, 12
  • 55. Content matching (16 * 10 ** 6) ** 2Tuesday, October 23, 12
  • 56. Content matching (16 * 10 ** 6) ** 2 = A large numberTuesday, October 23, 12
  • 57. Content matching (16 * 10 ** 6) ** 2 = A large number Reduce search space: >>> from unicodedata import normalize >>> key = .join(normalize(NFD, char)[0].lower() for char in title)[5]Tuesday, October 23, 12
  • 58. Content matching (16 * 10 ** 6) ** 2 = A large number Reduce search space: >>> from unicodedata import normalize >>> key = .join(normalize(NFD, char)[0].lower() for char in title)[5] Side note: Levenshtein (edit) distance is a heavy operation -> speeded up about 4x with pypy (or use c-extension)Tuesday, October 23, 12
  • 59. Automatic data processing will never be perfectTuesday, October 23, 12
  • 60. it! h Automatic data processing will never be perfect c a t PTuesday, October 23, 12
  • 61. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 62. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 63. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 64. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 65. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 66. Transcoding Asynchronous RabbitMQ + amqplib Master / workersTuesday, October 23, 12
  • 67. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 68. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 69. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 70. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 71. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 72. Content pipeline ti on e n g s e r g e xi g e d Label A In M In Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 73. Index buildTuesday, October 23, 12
  • 74. Index build • Nightly batch job on db-dumpsTuesday, October 23, 12
  • 75. Index build • Nightly batch job on db-dumps • Previously mostly python but now moved to Java for performance reasonTuesday, October 23, 12
  • 76. Index build • Nightly batch job on db-dumps • Previously mostly python but now moved to Java for performance reason • But still lots of python helper scripts :)Tuesday, October 23, 12
  • 77. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 78. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 79. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 80. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 81. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 82. Content pipeline ti on e n g s e r g e xi g e d Label A In M In Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 83. Content pipeline g on e n g in s ti r g xi l is h e e de b Label A n g M In u I P Label B Label C Label D Curation/enrichment g On site live services, in od e.g. search, browse n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 84. Distribution/publish Service A Service B Service CTuesday, October 23, 12
  • 85. Distribution/publish Service A Index A Service B Index B Index C Service CTuesday, October 23, 12
  • 86. Distribution/publish Service A Index A Service B Index B Index C Service CTuesday, October 23, 12
  • 87. Distribution/publish Service A Index A Service B Index B Index C Service CTuesday, October 23, 12
  • 88. Distribution/publish Service A Index A Service B Index B Index C Service CTuesday, October 23, 12
  • 89. Distribution/publish Service A Index A Service B Index B Index C Service CTuesday, October 23, 12
  • 90. Scheduling being migrated to ZooKeeper image: http://www.flickr.com/photos/seattlemunicipalarchives/with/3797940791/Tuesday, October 23, 12
  • 91. Distribution/publish Staged rolloutTuesday, October 23, 12
  • 92. Distribution/publishTuesday, October 23, 12
  • 93. Distribution/publish Exponential back-offTuesday, October 23, 12
  • 94. Distribution/publish Exponential back-off waiting 5s ...Tuesday, October 23, 12
  • 95. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ...Tuesday, October 23, 12
  • 96. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ... waiting 30s ...Tuesday, October 23, 12
  • 97. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ... waiting 30s ... waiting 60s ...Tuesday, October 23, 12
  • 98. Content pipeline g on e n g in s ti r g xi l is h e e de b Label A n g M In u I P Label B Label C Label D Curation/enrichment g On site live services, in od e.g. search, browse n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  • 99. Store ’da dataTuesday, October 23, 12
  • 100. Choice of databaseTuesday, October 23, 12
  • 101. Choice of database Depends on the use case - duh!Tuesday, October 23, 12
  • 102. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service)Tuesday, October 23, 12
  • 103. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service)Tuesday, October 23, 12
  • 104. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service) • Tokyo cabinet (e.g. browse service)Tuesday, October 23, 12
  • 105. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service) • Tokyo cabinet (e.g. browse service) • Lucene (search service)Tuesday, October 23, 12
  • 106. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service) • Tokyo cabinet (e.g. browse service) • Lucene (search service) • HDFSTuesday, October 23, 12
  • 107. PostgreSQL [Pic. of elephant] Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/Tuesday, October 23, 12
  • 108. PostgreSQL Redundancy + scaling: master/slaveTuesday, October 23, 12
  • 109. PostgreSQL Joins and subqueries - let the query planner roll!Tuesday, October 23, 12
  • 110. PostgreSQL Python?Tuesday, October 23, 12
  • 111. PostgreSQL Python? - psycopg2 + SQL-queries - SQLAlchemy migrator for versioning of db-schemasTuesday, October 23, 12
  • 112. PostgreSQL Python? - psycopg2 + SQL-queries - SQLAlchemy migrator for ! versioning of db-schemas p Ti Server side, aka named, cursors: conn = psycopg2.connect(database=huge_db, user=postgres, password=secret) sscursor = conn.cursor(my_cursor) sscursor.execute(SELECT * FROM big_table) rows = sscursor.fetchmany(1000) ...Tuesday, October 23, 12
  • 113. Scaling the content pipeline What to scale for?Tuesday, October 23, 12
  • 114. Scaling the content pipeline Size of catalogTuesday, October 23, 12
  • 115. Scaling the content pipeline # UsersTuesday, October 23, 12
  • 116. Thank you henok@spotify.comTuesday, October 23, 12
  • 117. Distribution/publish Popen + gevent (although IO-bound) import gevent gevent.monkey.patch_all() def _wait(self): while True: res = self.poll() if res is not None: return res gevent.sleep(0.1) subprocess.Popen.wait = _waitTuesday, October 23, 12