Working with Humongous Music
Database
MongoDB
Prasoon Kumar
#HyderabadDataScienceGroup
Agenda
•  MongoDB Features
•  Bulk Import
•  Full Text Index creation
•  Full Text Search
•  Musicbrainz Database
MUSIC BRAINZ
What is MusicBrainz ?
•  MusicBrainz is a community-maintained open
source encyclopedia of music information.
•  This mean...
MusicBrainz
•  Along the way, the scope of the project has
expanded from its origins as a mere a CDDB
replacement to today...
MusicBrainz Database
The MusicBrainz Database is where all of the various pieces of information we
collect about music is ...
MongoDB
Document
Database
Open-
Source
General
Purpose
Scalability
Auto-Sharding
•  Increase capacity as you go
•  Commodity and cloud architectures
•  Improved operational simp...
Morphia
MEAN Stack
Java
Python
Perl
Ruby
Support for the
most popular
languages and
frameworks
Drivers & Ecosystem
Music Mongo
•  Load (import)
•  Run
– Exact match
– Full text search
•  Todo
–  Application interface
AWS Setup
s0 54.225.100.65
s1 54.235.157.214
s2 54.225.100.42
Client & mongos
54.225.100.39
config
184.73.195.120
Relevant schema of MusicBrainz:
Import strategies
•  Denormalized from source DB
–  Import TSV in PostgreSQL
–  Export joined tables from PostgreSQL
–  mo...
Steps for creating denormalized table:
Client join
Import statistics
recording:
2013-11-11T22:02:51.213+0000 imported 12817015 objects real 69m49.949s
artist_credit:
2013-11...
Import via Postgres
Operation Time
Postgres Import 08m11s
Denormalize 14m57s
Export 00m29s
(Unsharded) (Sharded)
MongoDB I...
Indexes & Sharding
Indexes & Sharding - Text Index
Indexes & Sharding - Shard key
musicbrainz2.records3
shard key: { "name" : 1,
"_id" : 1 }
chunks:
shard0002 18
shard0000 1...
Thank You
team = {
members: [“Jonathan”, “Prasoon”],
company: “MongoDB
}
@prasoonk
Upcoming SlideShare
Loading in …5
×

MongoDB for storing humongous music database

1,934 views

Published on

Musicbrainz is an encyclopedia of music tracks, artists and albums. It is available in PostgreSQL under CC license. 2 different approaches to load the database into MongoDB are examined - one where 4 tables are first denormalized in Postgres and then loaded into MongoDB. Other one loads them into MongoDB and denormalizes into a single collection there. We also show MongoDB's fulltext index.

Published in: Technology
  • Be the first to comment

MongoDB for storing humongous music database

  1. 1. Working with Humongous Music Database MongoDB Prasoon Kumar #HyderabadDataScienceGroup
  2. 2. Agenda •  MongoDB Features •  Bulk Import •  Full Text Index creation •  Full Text Search •  Musicbrainz Database
  3. 3. MUSIC BRAINZ
  4. 4. What is MusicBrainz ? •  MusicBrainz is a community-maintained open source encyclopedia of music information. •  This means that anyone - including you - can help contribute to the project by adding information about your favorite artists and their related works. •  Robert Kaye founded MusicBrainz. The project has grown rapidly from a one-man operation to an international community of enthusiasts who appreciate both music and music metadata.
  5. 5. MusicBrainz •  Along the way, the scope of the project has expanded from its origins as a mere a CDDB replacement to today, where MusicBrainz has become a true encyclopedia of music. •  As an encyclopedia and as a community, MusicBrainz exists solely to collect as much information about music as we can without discriminating or preferring one "type" of music over another.
  6. 6. MusicBrainz Database The MusicBrainz Database is where all of the various pieces of information we collect about music is stored, from artists and their releases to works and their composers, and of course much more. The majority of the data in the MusicBrainz Database is placed in the Public Domain, which means that anyone can download the data and use it in any way they see fit. The remaining data is released under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 license.
  7. 7. MongoDB Document Database Open- Source General Purpose
  8. 8. Scalability Auto-Sharding •  Increase capacity as you go •  Commodity and cloud architectures •  Improved operational simplicity and cost visibility
  9. 9. Morphia MEAN Stack Java Python Perl Ruby Support for the most popular languages and frameworks Drivers & Ecosystem
  10. 10. Music Mongo •  Load (import) •  Run – Exact match – Full text search •  Todo –  Application interface
  11. 11. AWS Setup s0 54.225.100.65 s1 54.235.157.214 s2 54.225.100.42 Client & mongos 54.225.100.39 config 184.73.195.120
  12. 12. Relevant schema of MusicBrainz:
  13. 13. Import strategies •  Denormalized from source DB –  Import TSV in PostgreSQL –  Export joined tables from PostgreSQL –  mongoimport TSV •  Separate collections from TSV –  mongoimport TSVs into temporary collections –  “Join” temporary collections in client (PyMongo) and insert to destination collection
  14. 14. Steps for creating denormalized table:
  15. 15. Client join
  16. 16. Import statistics recording: 2013-11-11T22:02:51.213+0000 imported 12817015 objects real 69m49.949s artist_credit: 2013-11-11T22:04:41.469+0000 imported 756247 objects real 1m50.256s track: 2013-11-11T22:48:59.423+0000 imported 15427255 objects real 44m17.973s release: 2013-11-11T22:53:06.627+0000 imported 1208854 objects real 4m7.183s medium: 2013-11-11T22:57:45.030+0000 imported 1343234 objects real 4m38.414s
  17. 17. Import via Postgres Operation Time Postgres Import 08m11s Denormalize 14m57s Export 00m29s (Unsharded) (Sharded) MongoDB Import 14m59s 12m15s Index 07m45s 02m35s Overall 45m23s 40m13s
  18. 18. Indexes & Sharding
  19. 19. Indexes & Sharding - Text Index
  20. 20. Indexes & Sharding - Shard key musicbrainz2.records3 shard key: { "name" : 1, "_id" : 1 } chunks: shard0002 18 shard0000 18 shard0001 18
  21. 21. Thank You team = { members: [“Jonathan”, “Prasoon”], company: “MongoDB } @prasoonk

×