0
Python <3 Content systems                          - managing millions of tracks for the masses                           ...
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
> 15 M active users*                          * Users active within the previous 30 daysTuesday, October 23, 12
> Available in 15 Countries   > 15 M active users*                                       * Users active within the previou...
> 18 M tracks                    > Available in 15 Countries   > 15 M active users*                                       ...
> 20 k new tracks added per day                          > 18 M tracks                    > Available in 15 Countries   > ...
> 1 century of listening                               > 20 k new tracks added per day                          > 18 M tra...
> 500 M playlists                                          > 1 century of listening                               > 20 k n...
Service overviewTuesday, October 23, 12
Service overview                          StorageTuesday, October 23, 12
Service overview                          Storage                           UserTuesday, October 23, 12
Service overview                          Storage                           User                          SearchTuesday, O...
Service overview                          Storage                            User                           Search        ...
Service overview                          Storage                            User                           Search        ...
Service overview                          Storage                            User                                     AP  ...
Service overview                          Storage                            User                                     AP  ...
Service overview                          Storage                            User                                     AP  ...
Service overview                          Storage                            User                                     AP  ...
Content pipeline    Label A   Label B  Label C   Label D                          Image: Steve Juvertson (CC BY 2.0) http:...
Content pipeline                                                          ti on                                           ...
Ingestion                                   XM       L L                                           M M                    ...
Ingestion: Delivery formatsTuesday, October 23, 12
Ingestion: Delivery formats             ~ 10 different incoming XML formatsTuesday, October 23, 12
Ingestion: Delivery formats             ~ 10 different incoming XML formats                     - Proprietary formats (maj...
Ingestion: Delivery formats             ~ 10 different incoming XML formats                     - Proprietary formats (maj...
Ingestion: Delivery formats             ~ 10 different incoming XML formats                     - Proprietary formats (maj...
Data model [simplified]                                                  1   Artist                   Transcoding          ...
Ingestion                          LXML and XSLT with extensions for                          parsing/transforming XMLTues...
Ingestion: XPath extensions     >>> def formerlify(_, name):     ...    return The artist formerly known as %s %name     >...
IngestionTuesday, October 23, 12
Ingestion          Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up          350 MB of disk space...
Ingestion          Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up          350 MB of disk space...
Ingestion          Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up          350 MB of disk space...
Content pipeline    Label A   Label B  Label C   Label D                          Image: Steve Juvertson (CC BY 2.0) http:...
Content pipeline                                                          ti on                                           ...
Content pipeline                                                          ti on          g e                              ...
Centralized vs. aggregated cataloging          Requ                               Requ                          ires h    ...
Metadata - challenges                          Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08T...
Content pipeline    Label A   Label B  Label C   Label D                          Image: Steve Juvertson (CC BY 2.0) http:...
Content pipeline                                                          ti on                                           ...
Content pipeline                                                          ti on          g e                              ...
Content pipeline                                                          ti on          g e                              ...
Ambiguous artists - thesis workTuesday, October 23, 12
Ambiguous artists - thesis work    • User inputTuesday, October 23, 12
Ambiguous artists - thesis work    • User input    • Machine learningTuesday, October 23, 12
Ambiguous artists - thesis work    • User input    • Machine learning    • Matching against external sourcesTuesday, Octob...
Ambiguous artists - thesis work    •       User input    •       Machine learning    •       Matching against external sou...
Ambiguous artists - thesis work    •       User input    •       Machine learning    •       Matching against external sou...
Content matching                          (16 * 10 ** 6) ** 2Tuesday, October 23, 12
Content matching                          (16 * 10 ** 6) ** 2 = A large numberTuesday, October 23, 12
Content matching                          (16 * 10 ** 6) ** 2 = A large number Reduce search space: >>> from unicodedata i...
Content matching                          (16 * 10 ** 6) ** 2 = A large number Reduce search space: >>> from unicodedata i...
Automatic data processing will never be perfectTuesday, October 23, 12
it!                                           h                      Automatic data processing will never be perfect      ...
Content pipeline    Label A   Label B  Label C   Label D                          Image: Steve Juvertson (CC BY 2.0) http:...
Content pipeline                                                          ti on                                           ...
Content pipeline                                                          ti on          g e                              ...
Content pipeline                                                          ti on          g e                              ...
Content pipeline                                                          ti on          g e                              ...
Transcoding                          Asynchronous                            RabbitMQ + amqplib                Master / wo...
Content pipeline    Label A   Label B  Label C   Label D                          Image: Steve Juvertson (CC BY 2.0) http:...
Content pipeline                                                          ti on                                           ...
Content pipeline                                                          ti on          g e                              ...
Content pipeline                                                          ti on          g e                              ...
Content pipeline                                                          ti on          g e                              ...
Content pipeline                                                          ti on            e                    n g       ...
Index buildTuesday, October 23, 12
Index build     • Nightly batch job on db-dumpsTuesday, October 23, 12
Index build     • Nightly batch job on db-dumps     • Previously mostly python but now moved to Java for             perfo...
Index build     • Nightly batch job on db-dumps     • Previously mostly python but now moved to Java for             perfo...
Content pipeline    Label A   Label B  Label C   Label D                          Image: Steve Juvertson (CC BY 2.0) http:...
Content pipeline                                                          ti on                                           ...
Content pipeline                                                          ti on          g e                              ...
Content pipeline                                                          ti on          g e                              ...
Content pipeline                                                          ti on          g e                              ...
Content pipeline                                                          ti on            e                    n g       ...
Content pipeline                                                                                                          ...
Distribution/publish   Service A                                         Service B                             Service CTu...
Distribution/publish              Service A                          Index A                                              ...
Distribution/publish              Service A                          Index A                                              ...
Distribution/publish              Service A                          Index A                                              ...
Distribution/publish              Service A                          Index A                                              ...
Distribution/publish              Service A                          Index A                                              ...
Scheduling being migrated to ZooKeeper                          image: http://www.flickr.com/photos/seattlemunicipalarchive...
Distribution/publish                             Staged rolloutTuesday, October 23, 12
Distribution/publishTuesday, October 23, 12
Distribution/publish                             Exponential back-offTuesday, October 23, 12
Distribution/publish                             Exponential back-off                             waiting 5s ...Tuesday, O...
Distribution/publish                             Exponential back-off                             waiting 5s ...          ...
Distribution/publish                             Exponential back-off                             waiting 5s ...          ...
Distribution/publish                             Exponential back-off                             waiting   5s ...        ...
Content pipeline                                                                                                          ...
Store ’da dataTuesday, October 23, 12
Choice of databaseTuesday, October 23, 12
Choice of database                    Depends on the use case - duh!Tuesday, October 23, 12
Choice of database                    Depends on the use case - duh!                    • PostgreSQL (e.g. user service)Tu...
Choice of database                    Depends on the use case - duh!                    • PostgreSQL (e.g. user service)  ...
Choice of database                    Depends on the use case - duh!                    • PostgreSQL (e.g. user service)  ...
Choice of database                    Depends on the use case - duh!                    •     PostgreSQL (e.g. user servic...
Choice of database                    Depends on the use case - duh!                    •     PostgreSQL (e.g. user servic...
PostgreSQL                                                          [Pic. of elephant]                          Image: htt...
PostgreSQL                          Redundancy + scaling:                          master/slaveTuesday, October 23, 12
PostgreSQL                          Joins and subqueries -                          let the query planner roll!Tuesday, Oc...
PostgreSQL          Python?Tuesday, October 23, 12
PostgreSQL          Python?                          - psycopg2 + SQL-queries                          - SQLAlchemy migrat...
PostgreSQL          Python?                          - psycopg2 + SQL-queries                          - SQLAlchemy migrat...
Scaling the content pipeline                           What to scale for?Tuesday, October 23, 12
Scaling the content pipeline                               Size of catalogTuesday, October 23, 12
Scaling the content pipeline                                     # UsersTuesday, October 23, 12
Thank you                          henok@spotify.comTuesday, October 23, 12
Distribution/publish                          Popen + gevent (although IO-bound)                          import gevent   ...
Upcoming SlideShare
Loading in...5
×

Python &lt;3 Content systems

759

Published on

Published in: Technology

Transcript of "Python &lt;3 Content systems"

  1. 1. Python <3 Content systems - managing millions of tracks for the masses 22nd October 2012Tuesday, October 23, 12
  2. 2. Tuesday, October 23, 12
  3. 3. Tuesday, October 23, 12
  4. 4. Tuesday, October 23, 12
  5. 5. Tuesday, October 23, 12
  6. 6. Tuesday, October 23, 12
  7. 7. Tuesday, October 23, 12
  8. 8. Tuesday, October 23, 12
  9. 9. > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  10. 10. > Available in 15 Countries > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  11. 11. > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  12. 12. > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  13. 13. > 1 century of listening > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  14. 14. > 500 M playlists > 1 century of listening > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 daysTuesday, October 23, 12
  15. 15. Service overviewTuesday, October 23, 12
  16. 16. Service overview StorageTuesday, October 23, 12
  17. 17. Service overview Storage UserTuesday, October 23, 12
  18. 18. Service overview Storage User SearchTuesday, October 23, 12
  19. 19. Service overview Storage User Search MetadataTuesday, October 23, 12
  20. 20. Service overview Storage User Search Metadata . . .Tuesday, October 23, 12
  21. 21. Service overview Storage User AP Search Metadata . . .Tuesday, October 23, 12
  22. 22. Service overview Storage User AP Search Metadata . . .Tuesday, October 23, 12
  23. 23. Service overview Storage User AP Search Metadata . . .Tuesday, October 23, 12
  24. 24. Service overview Storage User AP Search Metadata . . .Tuesday, October 23, 12
  25. 25. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  26. 26. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  27. 27. Ingestion XM L L M M LX MX X L Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/Tuesday, October 23, 12
  28. 28. Ingestion: Delivery formatsTuesday, October 23, 12
  29. 29. Ingestion: Delivery formats ~ 10 different incoming XML formatsTuesday, October 23, 12
  30. 30. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors)Tuesday, October 23, 12
  31. 31. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors) - Spotify delivery format (mostly indies)Tuesday, October 23, 12
  32. 32. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors) - Spotify delivery format (mostly indies) Thousands of lines of source specific codeTuesday, October 23, 12
  33. 33. Data model [simplified] 1 Artist Transcoding * * * Album 1 1 * Disc 1 1 Audio * 1 * Track * Rights *Tuesday, October 23, 12
  34. 34. Ingestion LXML and XSLT with extensions for parsing/transforming XMLTuesday, October 23, 12
  35. 35. Ingestion: XPath extensions >>> def formerlify(_, name): ... return The artist formerly known as %s %name >>> #Namespace stuff >>> from lxml import etree >>> ns = etree.FunctionNamespace(http://my.org/myfunctions) >>> ns[hello] = hello >>> ns.prefix = f >>> root = etree.XML(<a><b>Prince</b></a>) >>> print(root.xpath(f:hello(string(b)))) ... The artist formerly known as Prince http://lxml.de/extensions.html#xpath-extension-functionsTuesday, October 23, 12
  36. 36. IngestionTuesday, October 23, 12
  37. 37. Ingestion Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk spaceTuesday, October 23, 12
  38. 38. Ingestion Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space Bible apparently fits in 3MB XMLTuesday, October 23, 12
  39. 39. Ingestion Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space Bible apparently fits in 3MB XML >>> timeit.timeit(e.parse("huge.xml"), setup=import lxml.etree as e, number=5) / 5 4.19... >>> timeit.timeit(e.parse("huge.xml"), setup=import xml.etree.cElementTree as e, number=5) / 5 4.78... >>> timeit.timeit(e.parse("huge.xml"), setup=import xml.etree.ElementTree as e, number=5) / 5 55.39...Tuesday, October 23, 12
  40. 40. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  41. 41. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  42. 42. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  43. 43. Centralized vs. aggregated cataloging Requ Requ ires h ires m uman ergin s! g!Tuesday, October 23, 12
  44. 44. Metadata - challenges Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08Tuesday, October 23, 12
  45. 45. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  46. 46. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  47. 47. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  48. 48. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  49. 49. Ambiguous artists - thesis workTuesday, October 23, 12
  50. 50. Ambiguous artists - thesis work • User inputTuesday, October 23, 12
  51. 51. Ambiguous artists - thesis work • User input • Machine learningTuesday, October 23, 12
  52. 52. Ambiguous artists - thesis work • User input • Machine learning • Matching against external sourcesTuesday, October 23, 12
  53. 53. Ambiguous artists - thesis work • User input • Machine learning • Matching against external sources • Feature selection (#matches per external source, len(name), country-count, multilingual)Tuesday, October 23, 12
  54. 54. Ambiguous artists - thesis work • User input • Machine learning • Matching against external sources • Feature selection (#matches per external source, len(name), country-count, multilingual) • Matchings + preprocessing in PythonTuesday, October 23, 12
  55. 55. Content matching (16 * 10 ** 6) ** 2Tuesday, October 23, 12
  56. 56. Content matching (16 * 10 ** 6) ** 2 = A large numberTuesday, October 23, 12
  57. 57. Content matching (16 * 10 ** 6) ** 2 = A large number Reduce search space: >>> from unicodedata import normalize >>> key = .join(normalize(NFD, char)[0].lower() for char in title)[5]Tuesday, October 23, 12
  58. 58. Content matching (16 * 10 ** 6) ** 2 = A large number Reduce search space: >>> from unicodedata import normalize >>> key = .join(normalize(NFD, char)[0].lower() for char in title)[5] Side note: Levenshtein (edit) distance is a heavy operation -> speeded up about 4x with pypy (or use c-extension)Tuesday, October 23, 12
  59. 59. Automatic data processing will never be perfectTuesday, October 23, 12
  60. 60. it! h Automatic data processing will never be perfect c a t PTuesday, October 23, 12
  61. 61. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  62. 62. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  63. 63. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  64. 64. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  65. 65. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  66. 66. Transcoding Asynchronous RabbitMQ + amqplib Master / workersTuesday, October 23, 12
  67. 67. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  68. 68. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  69. 69. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  70. 70. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  71. 71. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  72. 72. Content pipeline ti on e n g s e r g e xi g e d Label A In M In Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  73. 73. Index buildTuesday, October 23, 12
  74. 74. Index build • Nightly batch job on db-dumpsTuesday, October 23, 12
  75. 75. Index build • Nightly batch job on db-dumps • Previously mostly python but now moved to Java for performance reasonTuesday, October 23, 12
  76. 76. Index build • Nightly batch job on db-dumps • Previously mostly python but now moved to Java for performance reason • But still lots of python helper scripts :)Tuesday, October 23, 12
  77. 77. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  78. 78. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  79. 79. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  80. 80. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  81. 81. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  82. 82. Content pipeline ti on e n g s e r g e xi g e d Label A In M In Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  83. 83. Content pipeline g on e n g in s ti r g xi l is h e e de b Label A n g M In u I P Label B Label C Label D Curation/enrichment g On site live services, in od e.g. search, browse n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  84. 84. Distribution/publish Service A Service B Service CTuesday, October 23, 12
  85. 85. Distribution/publish Service A Index A Service B Index B Index C Service CTuesday, October 23, 12
  86. 86. Distribution/publish Service A Index A Service B Index B Index C Service CTuesday, October 23, 12
  87. 87. Distribution/publish Service A Index A Service B Index B Index C Service CTuesday, October 23, 12
  88. 88. Distribution/publish Service A Index A Service B Index B Index C Service CTuesday, October 23, 12
  89. 89. Distribution/publish Service A Index A Service B Index B Index C Service CTuesday, October 23, 12
  90. 90. Scheduling being migrated to ZooKeeper image: http://www.flickr.com/photos/seattlemunicipalarchives/with/3797940791/Tuesday, October 23, 12
  91. 91. Distribution/publish Staged rolloutTuesday, October 23, 12
  92. 92. Distribution/publishTuesday, October 23, 12
  93. 93. Distribution/publish Exponential back-offTuesday, October 23, 12
  94. 94. Distribution/publish Exponential back-off waiting 5s ...Tuesday, October 23, 12
  95. 95. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ...Tuesday, October 23, 12
  96. 96. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ... waiting 30s ...Tuesday, October 23, 12
  97. 97. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ... waiting 30s ... waiting 60s ...Tuesday, October 23, 12
  98. 98. Content pipeline g on e n g in s ti r g xi l is h e e de b Label A n g M In u I P Label B Label C Label D Curation/enrichment g On site live services, in od e.g. search, browse n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/Tuesday, October 23, 12
  99. 99. Store ’da dataTuesday, October 23, 12
  100. 100. Choice of databaseTuesday, October 23, 12
  101. 101. Choice of database Depends on the use case - duh!Tuesday, October 23, 12
  102. 102. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service)Tuesday, October 23, 12
  103. 103. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service)Tuesday, October 23, 12
  104. 104. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service) • Tokyo cabinet (e.g. browse service)Tuesday, October 23, 12
  105. 105. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service) • Tokyo cabinet (e.g. browse service) • Lucene (search service)Tuesday, October 23, 12
  106. 106. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service) • Tokyo cabinet (e.g. browse service) • Lucene (search service) • HDFSTuesday, October 23, 12
  107. 107. PostgreSQL [Pic. of elephant] Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/Tuesday, October 23, 12
  108. 108. PostgreSQL Redundancy + scaling: master/slaveTuesday, October 23, 12
  109. 109. PostgreSQL Joins and subqueries - let the query planner roll!Tuesday, October 23, 12
  110. 110. PostgreSQL Python?Tuesday, October 23, 12
  111. 111. PostgreSQL Python? - psycopg2 + SQL-queries - SQLAlchemy migrator for versioning of db-schemasTuesday, October 23, 12
  112. 112. PostgreSQL Python? - psycopg2 + SQL-queries - SQLAlchemy migrator for ! versioning of db-schemas p Ti Server side, aka named, cursors: conn = psycopg2.connect(database=huge_db, user=postgres, password=secret) sscursor = conn.cursor(my_cursor) sscursor.execute(SELECT * FROM big_table) rows = sscursor.fetchmany(1000) ...Tuesday, October 23, 12
  113. 113. Scaling the content pipeline What to scale for?Tuesday, October 23, 12
  114. 114. Scaling the content pipeline Size of catalogTuesday, October 23, 12
  115. 115. Scaling the content pipeline # UsersTuesday, October 23, 12
  116. 116. Thank you henok@spotify.comTuesday, October 23, 12
  117. 117. Distribution/publish Popen + gevent (although IO-bound) import gevent gevent.monkey.patch_all() def _wait(self): while True: res = self.poll() if res is not None: return res gevent.sleep(0.1) subprocess.Popen.wait = _waitTuesday, October 23, 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×