SlideShare a Scribd company logo
1 of 117
Download to read offline
Python <3 Content systems
                          - managing millions of tracks for the masses




                                                                         22nd October 2012

Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
> 15 M active users*
                          * Users active within the previous 30 days
Tuesday, October 23, 12
> Available in 15 Countries

   > 15 M active users*
                                       * Users active within the previous 30 days
Tuesday, October 23, 12
> 18 M tracks

                    > Available in 15 Countries

   > 15 M active users*
                                          * Users active within the previous 30 days
Tuesday, October 23, 12
> 20 k new tracks added per day

                          > 18 M tracks

                    > Available in 15 Countries

   > 15 M active users*
                                          * Users active within the previous 30 days
Tuesday, October 23, 12
> 1 century of listening
                               > 20 k new tracks added per day

                          > 18 M tracks

                    > Available in 15 Countries

   > 15 M active users*
                                          * Users active within the previous 30 days
Tuesday, October 23, 12
> 500 M playlists
                                          > 1 century of listening
                               > 20 k new tracks added per day

                          > 18 M tracks

                    > Available in 15 Countries

   > 15 M active users*
                                          * Users active within the previous 30 days
Tuesday, October 23, 12
Service overview




Tuesday, October 23, 12
Service overview


                          Storage




Tuesday, October 23, 12
Service overview


                          Storage


                           User




Tuesday, October 23, 12
Service overview


                          Storage


                           User


                          Search




Tuesday, October 23, 12
Service overview


                          Storage


                            User


                           Search


                          Metadata




Tuesday, October 23, 12
Service overview


                          Storage


                            User


                           Search


                          Metadata
                             .
                             .
                             .




Tuesday, October 23, 12
Service overview


                          Storage


                            User
                                     AP
                           Search


                          Metadata
                             .
                             .
                             .




Tuesday, October 23, 12
Service overview


                          Storage


                            User
                                     AP
                           Search


                          Metadata
                             .
                             .
                             .




Tuesday, October 23, 12
Service overview


                          Storage


                            User
                                     AP
                           Search


                          Metadata
                             .
                             .
                             .




Tuesday, October 23, 12
Service overview


                          Storage


                            User
                                     AP
                           Search


                          Metadata
                             .
                             .
                             .




Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Ingestion




                                   XM       L L
                                           M M
                                         LX MX
                                           X L




                          Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/
Tuesday, October 23, 12
Ingestion: Delivery formats




Tuesday, October 23, 12
Ingestion: Delivery formats


             ~ 10 different incoming XML formats




Tuesday, October 23, 12
Ingestion: Delivery formats


             ~ 10 different incoming XML formats
                     - Proprietary formats (majors)




Tuesday, October 23, 12
Ingestion: Delivery formats


             ~ 10 different incoming XML formats
                     - Proprietary formats (majors)
                     - Spotify delivery format (mostly indies)




Tuesday, October 23, 12
Ingestion: Delivery formats


             ~ 10 different incoming XML formats
                     - Proprietary formats (majors)
                     - Spotify delivery format (mostly indies)
             Thousands of lines of source speciļ¬c code




Tuesday, October 23, 12
Data model [simpliļ¬ed]



                                                  1   Artist                   Transcoding
                                                           *                            *

                                *
                      Album         1                                               1



                                            *   Disc   1
                                                                                1   Audio
                                                                    *      1
                                                               *
                                                                   Track
                            *
                          Rights        *




Tuesday, October 23, 12
Ingestion




                          LXML and XSLT with extensions for
                          parsing/transforming XML




Tuesday, October 23, 12
Ingestion: XPath extensions
     >>> def formerlify(_, name):
     ...    return 'The artist formerly known as %s' %name

     >>>        #Namespace stuff
     >>>        from lxml import etree
     >>>        ns = etree.FunctionNamespace('http://my.org/myfunctions')
     >>>        ns['hello'] = hello
     >>>        ns.prefix = 'f'

     >>> root = etree.XML('<a><b>Prince</b></a>')
     >>> print(root.xpath('f:hello(string(b))'))

     ... The artist formerly known as Prince




                          http://lxml.de/extensions.html#xpath-extension-functions

Tuesday, October 23, 12
Ingestion




Tuesday, October 23, 12
Ingestion
          Fun (?!) fact: largest XML ļ¬le seen so far had 3.3 million rows taking up
          350 MB of disk space




Tuesday, October 23, 12
Ingestion
          Fun (?!) fact: largest XML ļ¬le seen so far had 3.3 million rows taking up
          350 MB of disk space

          Bible apparently ļ¬ts in 3MB XML




Tuesday, October 23, 12
Ingestion
          Fun (?!) fact: largest XML ļ¬le seen so far had 3.3 million rows taking up
          350 MB of disk space

          Bible apparently ļ¬ts in 3MB XML
                   >>> timeit.timeit('e.parse("huge.xml")',
                                     setup='import lxml.etree as e',
                                     number=5) / 5
                   4.19...

                   >>> timeit.timeit('e.parse("huge.xml")',
                                     setup='import xml.etree.cElementTree as e',
                                     number=5) / 5
                   4.78...

                   >>> timeit.timeit('e.parse("huge.xml")',
                                     setup='import xml.etree.ElementTree as e',
                                     number=5) / 5
                   55.39...




Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Centralized vs. aggregated cataloging




          Requ                               Requ
                          ires h                 ires m
                                   uman                ergin
                                        s!                  g!




Tuesday, October 23, 12
Metadata - challenges




                          Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08
Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Ambiguous artists - thesis work




Tuesday, October 23, 12
Ambiguous artists - thesis work


    ā€¢ User input




Tuesday, October 23, 12
Ambiguous artists - thesis work


    ā€¢ User input
    ā€¢ Machine learning




Tuesday, October 23, 12
Ambiguous artists - thesis work


    ā€¢ User input
    ā€¢ Machine learning
    ā€¢ Matching against external sources




Tuesday, October 23, 12
Ambiguous artists - thesis work


    ā€¢       User input
    ā€¢       Machine learning
    ā€¢       Matching against external sources
    ā€¢       Feature selection (#matches per external
            source, len(name), country-count,
            multilingual)




Tuesday, October 23, 12
Ambiguous artists - thesis work


    ā€¢       User input
    ā€¢       Machine learning
    ā€¢       Matching against external sources
    ā€¢       Feature selection (#matches per external
            source, len(name), country-count,
            multilingual)
    ā€¢ Matchings + preprocessing in Python


Tuesday, October 23, 12
Content matching




                          (16 * 10 ** 6) ** 2




Tuesday, October 23, 12
Content matching




                          (16 * 10 ** 6) ** 2 = A large number




Tuesday, October 23, 12
Content matching




                          (16 * 10 ** 6) ** 2 = A large number

 Reduce search space:
 >>> from unicodedata import normalize
 >>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]




Tuesday, October 23, 12
Content matching




                          (16 * 10 ** 6) ** 2 = A large number

 Reduce search space:
 >>> from unicodedata import normalize
 >>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]




                                   Side note: Levenshtein (edit) distance is a heavy operation

                                   -> speeded up about 4x with pypy (or use c-extension)



Tuesday, October 23, 12
Automatic data processing will never be perfect




Tuesday, October 23, 12
it!
                                           h
                      Automatic data processing will never be perfect
                                         c
                                     a t
                                    P



Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g
                                                                          in
                                                                       od
                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Transcoding



                          Asynchronous

                            RabbitMQ + amqplib

                Master / workers


Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g
                                                                          in
                                                                       od
                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on            e                    n g
                                                      s              e r g
                                                                                          e xi
                                                  g e                                    d
    Label A
                                               In                 M                   In
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g
                                                                          in
                                                                       od
                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Index build




Tuesday, October 23, 12
Index build



     ā€¢ Nightly batch job on db-dumps




Tuesday, October 23, 12
Index build



     ā€¢ Nightly batch job on db-dumps
     ā€¢ Previously mostly python but now moved to Java for
             performance reason




Tuesday, October 23, 12
Index build



     ā€¢ Nightly batch job on db-dumps
     ā€¢ Previously mostly python but now moved to Java for
             performance reason
     ā€¢ But still lots of python helper scripts :)




Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment




                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g
                                                                          in
                                                                       od
                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on            e                    n g
                                                      s              e r g
                                                                                          e xi
                                                  g e                                    d
    Label A
                                               In                 M                   In
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g
                                                                          in
                                                                       od
                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline



                                                                                                                          g
                                                             on            e                    n g                    in
                                                      s   ti           r g                  xi                   l is
                                                                                                                     h
                                                    e                e                   de                     b
    Label A                                     n g               M                   In                      u
                                               I                                                             P
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g                                                On site live services,
                                                                          in
                                                                       od
                                                                                                                              e.g. search, browse

                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Distribution/publish   Service A




                                         Service B




                             Service C




Tuesday, October 23, 12
Distribution/publish              Service A




                          Index A
                                                    Service B
                              Index B
                          Index C




                                        Service C




Tuesday, October 23, 12
Distribution/publish              Service A




                          Index A
                                                    Service B
                              Index B
                          Index C




                                        Service C




Tuesday, October 23, 12
Distribution/publish              Service A




                          Index A
                                                    Service B
                              Index B
                          Index C




                                        Service C




Tuesday, October 23, 12
Distribution/publish              Service A




                          Index A
                                                    Service B
                              Index B
                          Index C




                                        Service C




Tuesday, October 23, 12
Distribution/publish              Service A




                          Index A
                                                    Service B
                              Index B
                          Index C




                                        Service C




Tuesday, October 23, 12
Scheduling being migrated to ZooKeeper




                          image: http://www.ļ¬‚ickr.com/photos/seattlemunicipalarchives/with/3797940791/

Tuesday, October 23, 12
Distribution/publish




                             Staged rollout



Tuesday, October 23, 12
Distribution/publish




Tuesday, October 23, 12
Distribution/publish




                             Exponential back-off




Tuesday, October 23, 12
Distribution/publish




                             Exponential back-off
                             waiting 5s ...




Tuesday, October 23, 12
Distribution/publish




                             Exponential back-off
                             waiting 5s ...
                             waiting 10s ...




Tuesday, October 23, 12
Distribution/publish




                             Exponential back-off
                             waiting 5s ...
                             waiting 10s ...
                             waiting 30s ...




Tuesday, October 23, 12
Distribution/publish




                             Exponential back-off
                             waiting   5s ...
                             waiting   10s ...
                             waiting   30s ...
                             waiting   60s ...




Tuesday, October 23, 12
Content pipeline



                                                                                                                          g
                                                             on            e                    n g                    in
                                                      s   ti           r g                  xi                   l is
                                                                                                                     h
                                                    e                e                   de                     b
    Label A                                     n g               M                   In                      u
                                               I                                                             P
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g                                                On site live services,
                                                                          in
                                                                       od
                                                                                                                              e.g. search, browse

                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Store ā€™da data



Tuesday, October 23, 12
Choice of database




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!
                    ā€¢ PostgreSQL (e.g. user service)




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!
                    ā€¢ PostgreSQL (e.g. user service)
                    ā€¢ Cassandra (e.g. playlist service)




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!
                    ā€¢ PostgreSQL (e.g. user service)
                    ā€¢ Cassandra (e.g. playlist service)
                    ā€¢ Tokyo cabinet (e.g. browse service)




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!
                    ā€¢     PostgreSQL (e.g. user service)
                    ā€¢     Cassandra (e.g. playlist service)
                    ā€¢     Tokyo cabinet (e.g. browse service)
                    ā€¢     Lucene (search service)




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!
                    ā€¢     PostgreSQL (e.g. user service)
                    ā€¢     Cassandra (e.g. playlist service)
                    ā€¢     Tokyo cabinet (e.g. browse service)
                    ā€¢     Lucene (search service)
                    ā€¢     HDFS




Tuesday, October 23, 12
PostgreSQL




                                                          [Pic. of elephant]




                          Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/
Tuesday, October 23, 12
PostgreSQL




                          Redundancy + scaling:
                          master/slave



Tuesday, October 23, 12
PostgreSQL




                          Joins and subqueries -
                          let the query planner roll!


Tuesday, October 23, 12
PostgreSQL


          Python?




Tuesday, October 23, 12
PostgreSQL


          Python?
                          - psycopg2 + SQL-queries
                          - SQLAlchemy migrator for
                          versioning of db-schemas




Tuesday, October 23, 12
PostgreSQL


          Python?
                          - psycopg2 + SQL-queries
                          - SQLAlchemy migrator for
              !
                          versioning of db-schemas
            p
       Ti
             Server side, aka named, cursors:
             conn = psycopg2.connect(database='huge_db', user='postgres',
                                     password='secret')
             sscursor = conn.cursor('my_cursor')
             sscursor.execute('SELECT * FROM big_table')
             rows = sscursor.fetchmany(1000)
             ...


Tuesday, October 23, 12
Scaling the content pipeline




                           What to scale for?



Tuesday, October 23, 12
Scaling the content pipeline




                               Size of catalog



Tuesday, October 23, 12
Scaling the content pipeline




                                     # Users



Tuesday, October 23, 12
Thank you
                          henok@spotify.com




Tuesday, October 23, 12
Distribution/publish




                          Popen + gevent (although IO-bound)
                          import gevent

                          gevent.monkey.patch_all()

                          def _wait(self):
                              while True:
                                  res = self.poll()
                                  if res is not None:
                                      return res
                                  gevent.sleep(0.1)

                          subprocess.Popen.wait = _wait


Tuesday, October 23, 12

More Related Content

Recently uploaded

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
Ā 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
Ā 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationRadu Cotescu
Ā 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
Ā 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
Ā 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
Ā 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
Ā 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜RTylerCroy
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
Ā 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
Ā 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
Ā 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Ā 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Ā 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organization
Ā 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Ā 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Ā 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Ā 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Ā 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Ā 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Ā 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
Ā 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
Ā 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
Ā 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
Ā 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
Ā 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
Ā 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
Ā 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
Ā 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
Ā 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
Ā 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
Ā 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
Ā 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
Ā 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
Ā 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
Ā 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
Ā 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
Ā 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slidesAlireza Esmikhani
Ā 

Featured (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
Ā 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
Ā 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Ā 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Ā 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
Ā 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Ā 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Ā 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Ā 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
Ā 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Ā 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Ā 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Ā 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Ā 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Ā 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Ā 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
Ā 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Ā 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Ā 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
Ā 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Ā 

Python &lt;3 Content systems

  • 1. Python <3 Content systems - managing millions of tracks for the masses 22nd October 2012 Tuesday, October 23, 12
  • 9. > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 10. > Available in 15 Countries > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 11. > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 12. > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 13. > 1 century of listening > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 14. > 500 M playlists > 1 century of listening > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 16. Service overview Storage Tuesday, October 23, 12
  • 17. Service overview Storage User Tuesday, October 23, 12
  • 18. Service overview Storage User Search Tuesday, October 23, 12
  • 19. Service overview Storage User Search Metadata Tuesday, October 23, 12
  • 20. Service overview Storage User Search Metadata . . . Tuesday, October 23, 12
  • 21. Service overview Storage User AP Search Metadata . . . Tuesday, October 23, 12
  • 22. Service overview Storage User AP Search Metadata . . . Tuesday, October 23, 12
  • 23. Service overview Storage User AP Search Metadata . . . Tuesday, October 23, 12
  • 24. Service overview Storage User AP Search Metadata . . . Tuesday, October 23, 12
  • 25. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 26. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 27. Ingestion XM L L M M LX MX X L Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/ Tuesday, October 23, 12
  • 29. Ingestion: Delivery formats ~ 10 different incoming XML formats Tuesday, October 23, 12
  • 30. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors) Tuesday, October 23, 12
  • 31. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors) - Spotify delivery format (mostly indies) Tuesday, October 23, 12
  • 32. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors) - Spotify delivery format (mostly indies) Thousands of lines of source speciļ¬c code Tuesday, October 23, 12
  • 33. Data model [simpliļ¬ed] 1 Artist Transcoding * * * Album 1 1 * Disc 1 1 Audio * 1 * Track * Rights * Tuesday, October 23, 12
  • 34. Ingestion LXML and XSLT with extensions for parsing/transforming XML Tuesday, October 23, 12
  • 35. Ingestion: XPath extensions >>> def formerlify(_, name): ... return 'The artist formerly known as %s' %name >>> #Namespace stuff >>> from lxml import etree >>> ns = etree.FunctionNamespace('http://my.org/myfunctions') >>> ns['hello'] = hello >>> ns.prefix = 'f' >>> root = etree.XML('<a><b>Prince</b></a>') >>> print(root.xpath('f:hello(string(b))')) ... The artist formerly known as Prince http://lxml.de/extensions.html#xpath-extension-functions Tuesday, October 23, 12
  • 37. Ingestion Fun (?!) fact: largest XML ļ¬le seen so far had 3.3 million rows taking up 350 MB of disk space Tuesday, October 23, 12
  • 38. Ingestion Fun (?!) fact: largest XML ļ¬le seen so far had 3.3 million rows taking up 350 MB of disk space Bible apparently ļ¬ts in 3MB XML Tuesday, October 23, 12
  • 39. Ingestion Fun (?!) fact: largest XML ļ¬le seen so far had 3.3 million rows taking up 350 MB of disk space Bible apparently ļ¬ts in 3MB XML >>> timeit.timeit('e.parse("huge.xml")', setup='import lxml.etree as e', number=5) / 5 4.19... >>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.cElementTree as e', number=5) / 5 4.78... >>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.ElementTree as e', number=5) / 5 55.39... Tuesday, October 23, 12
  • 40. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 41. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 42. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 43. Centralized vs. aggregated cataloging Requ Requ ires h ires m uman ergin s! g! Tuesday, October 23, 12
  • 44. Metadata - challenges Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08 Tuesday, October 23, 12
  • 45. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 46. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 47. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 48. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 49. Ambiguous artists - thesis work Tuesday, October 23, 12
  • 50. Ambiguous artists - thesis work ā€¢ User input Tuesday, October 23, 12
  • 51. Ambiguous artists - thesis work ā€¢ User input ā€¢ Machine learning Tuesday, October 23, 12
  • 52. Ambiguous artists - thesis work ā€¢ User input ā€¢ Machine learning ā€¢ Matching against external sources Tuesday, October 23, 12
  • 53. Ambiguous artists - thesis work ā€¢ User input ā€¢ Machine learning ā€¢ Matching against external sources ā€¢ Feature selection (#matches per external source, len(name), country-count, multilingual) Tuesday, October 23, 12
  • 54. Ambiguous artists - thesis work ā€¢ User input ā€¢ Machine learning ā€¢ Matching against external sources ā€¢ Feature selection (#matches per external source, len(name), country-count, multilingual) ā€¢ Matchings + preprocessing in Python Tuesday, October 23, 12
  • 55. Content matching (16 * 10 ** 6) ** 2 Tuesday, October 23, 12
  • 56. Content matching (16 * 10 ** 6) ** 2 = A large number Tuesday, October 23, 12
  • 57. Content matching (16 * 10 ** 6) ** 2 = A large number Reduce search space: >>> from unicodedata import normalize >>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5] Tuesday, October 23, 12
  • 58. Content matching (16 * 10 ** 6) ** 2 = A large number Reduce search space: >>> from unicodedata import normalize >>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5] Side note: Levenshtein (edit) distance is a heavy operation -> speeded up about 4x with pypy (or use c-extension) Tuesday, October 23, 12
  • 59. Automatic data processing will never be perfect Tuesday, October 23, 12
  • 60. it! h Automatic data processing will never be perfect c a t P Tuesday, October 23, 12
  • 61. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 62. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 63. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 64. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 65. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 66. Transcoding Asynchronous RabbitMQ + amqplib Master / workers Tuesday, October 23, 12
  • 67. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 68. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 69. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 70. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 71. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 72. Content pipeline ti on e n g s e r g e xi g e d Label A In M In Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 74. Index build ā€¢ Nightly batch job on db-dumps Tuesday, October 23, 12
  • 75. Index build ā€¢ Nightly batch job on db-dumps ā€¢ Previously mostly python but now moved to Java for performance reason Tuesday, October 23, 12
  • 76. Index build ā€¢ Nightly batch job on db-dumps ā€¢ Previously mostly python but now moved to Java for performance reason ā€¢ But still lots of python helper scripts :) Tuesday, October 23, 12
  • 77. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 78. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 79. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 80. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 81. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 82. Content pipeline ti on e n g s e r g e xi g e d Label A In M In Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 83. Content pipeline g on e n g in s ti r g xi l is h e e de b Label A n g M In u I P Label B Label C Label D Curation/enrichment g On site live services, in od e.g. search, browse n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 84. Distribution/publish Service A Service B Service C Tuesday, October 23, 12
  • 85. Distribution/publish Service A Index A Service B Index B Index C Service C Tuesday, October 23, 12
  • 86. Distribution/publish Service A Index A Service B Index B Index C Service C Tuesday, October 23, 12
  • 87. Distribution/publish Service A Index A Service B Index B Index C Service C Tuesday, October 23, 12
  • 88. Distribution/publish Service A Index A Service B Index B Index C Service C Tuesday, October 23, 12
  • 89. Distribution/publish Service A Index A Service B Index B Index C Service C Tuesday, October 23, 12
  • 90. Scheduling being migrated to ZooKeeper image: http://www.ļ¬‚ickr.com/photos/seattlemunicipalarchives/with/3797940791/ Tuesday, October 23, 12
  • 91. Distribution/publish Staged rollout Tuesday, October 23, 12
  • 93. Distribution/publish Exponential back-off Tuesday, October 23, 12
  • 94. Distribution/publish Exponential back-off waiting 5s ... Tuesday, October 23, 12
  • 95. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ... Tuesday, October 23, 12
  • 96. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ... waiting 30s ... Tuesday, October 23, 12
  • 97. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ... waiting 30s ... waiting 60s ... Tuesday, October 23, 12
  • 98. Content pipeline g on e n g in s ti r g xi l is h e e de b Label A n g M In u I P Label B Label C Label D Curation/enrichment g On site live services, in od e.g. search, browse n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.ļ¬‚ickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 100. Choice of database Tuesday, October 23, 12
  • 101. Choice of database Depends on the use case - duh! Tuesday, October 23, 12
  • 102. Choice of database Depends on the use case - duh! ā€¢ PostgreSQL (e.g. user service) Tuesday, October 23, 12
  • 103. Choice of database Depends on the use case - duh! ā€¢ PostgreSQL (e.g. user service) ā€¢ Cassandra (e.g. playlist service) Tuesday, October 23, 12
  • 104. Choice of database Depends on the use case - duh! ā€¢ PostgreSQL (e.g. user service) ā€¢ Cassandra (e.g. playlist service) ā€¢ Tokyo cabinet (e.g. browse service) Tuesday, October 23, 12
  • 105. Choice of database Depends on the use case - duh! ā€¢ PostgreSQL (e.g. user service) ā€¢ Cassandra (e.g. playlist service) ā€¢ Tokyo cabinet (e.g. browse service) ā€¢ Lucene (search service) Tuesday, October 23, 12
  • 106. Choice of database Depends on the use case - duh! ā€¢ PostgreSQL (e.g. user service) ā€¢ Cassandra (e.g. playlist service) ā€¢ Tokyo cabinet (e.g. browse service) ā€¢ Lucene (search service) ā€¢ HDFS Tuesday, October 23, 12
  • 107. PostgreSQL [Pic. of elephant] Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/ Tuesday, October 23, 12
  • 108. PostgreSQL Redundancy + scaling: master/slave Tuesday, October 23, 12
  • 109. PostgreSQL Joins and subqueries - let the query planner roll! Tuesday, October 23, 12
  • 110. PostgreSQL Python? Tuesday, October 23, 12
  • 111. PostgreSQL Python? - psycopg2 + SQL-queries - SQLAlchemy migrator for versioning of db-schemas Tuesday, October 23, 12
  • 112. PostgreSQL Python? - psycopg2 + SQL-queries - SQLAlchemy migrator for ! versioning of db-schemas p Ti Server side, aka named, cursors: conn = psycopg2.connect(database='huge_db', user='postgres', password='secret') sscursor = conn.cursor('my_cursor') sscursor.execute('SELECT * FROM big_table') rows = sscursor.fetchmany(1000) ... Tuesday, October 23, 12
  • 113. Scaling the content pipeline What to scale for? Tuesday, October 23, 12
  • 114. Scaling the content pipeline Size of catalog Tuesday, October 23, 12
  • 115. Scaling the content pipeline # Users Tuesday, October 23, 12
  • 116. Thank you henok@spotify.com Tuesday, October 23, 12
  • 117. Distribution/publish Popen + gevent (although IO-bound) import gevent gevent.monkey.patch_all() def _wait(self): while True: res = self.poll() if res is not None: return res gevent.sleep(0.1) subprocess.Popen.wait = _wait Tuesday, October 23, 12