Schema Design by Gary Murakami
 

Schema Design by Gary Murakami

on

  • 2,903 views

Schema Design by Gary Murakami

Schema Design by Gary Murakami

Statistics

Views

Total Views
2,903
Views on SlideShare
1,246
Embed Views
1,657

Actions

Likes
3
Downloads
63
Comments
0

5 Embeds 1,657

http://www.mongodb.com 1642
https://www.mongodb.com 9
http://dev-mongodb.10gen.com 2
http://mongodb.dev 2
https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • A long, long time ago in a state not to far from here, I was in high school.There I discovered the wonder of computer programming. I was on the chess team, …and on the wresting time. I ran laps as conditioning for wrestling, and to keep running, I dreamed up algorithms and data structures to play chess.The importance of data structures was confirmed to me at Northwestern University when I took a course that used Pascal and Niklaus Wirth’s book “Algorithms + Data Structures = Programs.”And such data structures could be used to program computers to play chess.Next slide – skip the followingAt the Illinois High School chess finals, I was astounded by my opponent. Fortunately, it was not by his play on the chess board, but by an extremely thick printout of his Tic-Tac-Toe program.It was one huge nested if statement exhaustively enumerated all of the possibilities.The complexity of this is illustrated in the diagram that shows the map for O – playing second – of optimal moves.I knew that the “program” an abuse of a programming language and a tree, and worse than a chess blunder, a travesty.An application without good Schema Design is a similar travesty.
  • Chess 4.5 was a pioneering chess program in the 1970s.It was the first program to win a human chess tournament.I enjoyed playing against it at Northwestern, and I even played a rated chess game against the programmer Dave Slate.Chess 4.5 added a database of “book” openings that greatly improved the capability of the program.So the chess program melded algorithms, data structures, and a database to take on human chess masters.Could you do similar great things with good schema design?
  • Perhaps you will have moments of insight where you say “Aha!”For those of you who say “Of course, I knew that,” may the truth resonate and grow.Some might disagree strongly with my general recommendations.May you all find the presentation interesting and thought provoking.And may it inspire enthusiasm in your schema design work for your applications.
  • Schema Design is very important; its impact on your application is pervasive.
  • Wrong data structure will hurt you.Proper data structure can make all the pieces fall into place.
  • One-dimensional storage can be very fast but is limited with respect to querying.Speed is why key-value stores are popular for modern web applications.
  • A record in a traditional relational DBs is atOOple or row in a table.This table representation forces normalization of your data.Normalization is good for querying anything that the data can answer, and it is good for new queries.Relational DBs won out over other DBs that came before.To me, the winning technology is that every field or value is first class,In essence, every field can be addressed in queries and can be indexed for faster responses.But normalizationrequires many tables, joins to rehydrate relations, indexes to make joins faster, and it results in poor data locality.For example, in order to represent an array, another table must be used just for that array.Slow performance is whyNoSQL alternatives are becoming popular.In-place updates * SQL storage may use “padding” space for dynamic strings instead of fixed allocation
  • Document somewhat of a misnomer, not the Constitution or XML  object data (without methods) – often visualized as JSONInline updates * padding factor can reduce the need to move a documentThe essential capability (querying and indexing) persists and gets even better.The document structure can match your data structures – your schema.
  • Answers  dataQuestions  applicationDoes your schema take advantage of your application-specific knowledge of known queries, use cases, and client-program data structures?Traditional DBs make it hard to take advantage of them.Document DBs make it easy to take advantage of them.MongoDB documents can match your application – given good schema design.
  • Not “schema-less” but rather “flexible schema”Common structure can be enforced by applicationWhile MongoDB does not enforce common structure, neither does it restrict your applicationDocuments may have a common structure that is optionally extended at the document-levelUse this flexibility for class hierarchy with subclasses- Traditional relational representation requires separate tables- Work around with multiple mostly-empty columns- Example, three days for schema migrationKeywords: flexible, choice, evolve, change, modify
  • The lack of multivalued fields is usually the first complaint of programmers that don’t wish to pay the cost for normalization.Concept of arrays incorporates multiple values and also associations involving many entities.Keywords: array, multiple, many
  • Documents may have a common structure that is optionally extended at the document-level.The application mapping can enforce the required and optional fields. What could you do with these building blocks?Perhaps play chess, and beat human chess masters?
  • Belle (picture on the left) was the first computer built for the sole purpose of chess playing.It wasdeveloped by former coworkers of mineJoe Condon and Ken Thompson at Bell Labs in the 1970s and 1980s.Ken is reknown for developing the Unix operating system in the C programming language.Bell officially became the first master-level machine in 1983 and dominated play throughout the 1980s.Ken used Belle extensively for pioneering research with chess endgame tablebase.Starting from all possible checkmates with 3 pieces, retrograde analysis was used to exhaustively calculate all possible positions with forced mates.Ken completed the endgame tablebase for up to five pieces and published it on CD-ROM.It represents years of compute time and is still available online under the caption “Play chess with God”Good Schema Design matched the endgame tablebase to live chess playing so that Belle could beat human chess masters.Let’s investigate good schema design for an application.
  • “Vintage” business card
  • Contact and Address entities areassociated one to one.Traditional relational association is via referencing.In this example, the contact record for Steve Jobs has a reference to his address via the address_id field.
  • We’ve discussed Entities, Associations, Referencing, Embedding, and business cards, and we’ll build on that knowledge.Chess programmers have built on the endgame database with interesting results.
  • Entity-Relational diagram
  • Entity-Relational diagram for embedding documents
  • Left – relational - requires either two fetches/queries (or a join in a relational DB)Right – document – requires only one fetch/query and has data locality
  • We have discussed Entities, Associations, Referencing, Embedding, and business cards as sample data.We’ll build on that knowledge.Chess programmers have built on the endgame database with interesting results.
  • Likewise for your application, use Schema Design and the flexible schema of MongoDB to empower your database analytics
  • A common example will help us understand the joy of flexible document structure.
  • Left: One to one We're going to assume users only have on Twitter account. A thumbnail is a small profile image while portrait is a very large profile image.Right: One to manyMiddle: Many to many
  • Arrays of references are more direct than a join table and save a fetch.
  • fundamentally not “contains”Concerns – exceptional casesExceeding maximum document size due to large data or scalingTransferring very large documents is probably a performance concernScaling may affect working set sizeSchema can be adjusted to improve performance- Fetch only the data that you need
  • Embedding entities in the contact document reduces six fetches to one
  • We’ve completed our address book example, but what about chess?
  • Chess is not just an interesting challenge that raises philosophical questions about the intelligence of humans and computers.It is also a prime example of the effectiveness of algorithms plus data structures, plus good schema design for databases.And the endgame database has the challenges of big data and working set size that we face in our growing big data applications.
  • To increase resources with MongoDBUse a replica set and read from secondariesUse sharding
  • Embedding is a bit like pre-joined dataBSON (Binary JSON) document ops are easy for the serverChoose embedding by default as oppose to referencing.Embed (90/10 following rule of thumb)When the “one” or “many” objects are viewed in the context of their parentReference for easy consistency with “many to many” associations without duplicated dataReferencing is not just the default for relational DBs, there is no other choice.
  • You no longer have to coerce your data into a form acceptable to a SQL database.You can now architect or tailor your data to your application in your programming language and persist it to MongoDB.
  • May you build Great Big Data Applications.Perhaps you can say inspiring quotes like Ken Thompson, “Play chess with God.”
  • Good news – giving power and control back to the programmer and the programming languageKen and I worked on Perceptual Audio Coding, better known as Advanced Audio Coding or AAC as found in the iPod and iPhone.So I hope that this will inspire you to“Play music with God”to build your killer app.How is this made possible?Here’s the technology in MongoDB that makes this all possible.
  • BSON (Binary JSON) is the “magic” or core technology in MongoDB for data structures and performance.BSON does not have to be parsed like JSON, but is rather a format that can be traversed easily.

Schema Design by Gary Murakami Schema Design by Gary Murakami Presentation Transcript

  • Lead Engineer / Evangelist Gary J. Murakami, Ph.D. #MongoDB Schema Design
  • Schema Design – Gary Murakami
  • Schema Design – Gary Murakami Chess 4.5 (Northwestern University) Larry Atkin & Dave Slate
  • Schema Design – Gary Murakami Agenda • What is a Record? • Core Concepts • What is an Entity? • Associating Entities • General Recommendations • Questions
  • Schema Design – Gary Murakami All application development is Schema Design
  • Schema Design – Gary Murakami Success comes from Proper Data Structure
  • What is a Record?
  • Schema Design – Gary Murakami Key → Value • One-dimensional • Single value is a blob • Query on key only • No schema • Value cannot be updated, only replaced Key Blob
  • Schema Design – Gary Murakami Relational • Two-dimensional (tuples) • Each field is a single value • Query on any field • Very structured schema (table) • In-place updates * • Normalization requires many tables, joins, indexes, and poor data locality and performance Primary Key
  • Schema Design – Gary Murakami Document • N-dimensional • Each field can contain 0, 1, many, or embedded values • Query on any field & level • Flexible schema • Inline updates * • Embedding related data has optimal data locality, requires fewer indexes, has better performance _id
  • Core Concepts
  • Schema Design – Gary Murakami Traditional Schema Design Focus on data storage
  • Schema Design – Gary Murakami Document Schema Design Focus on data use
  • Schema Design – Gary Murakami Another way to think about it Traditional: What answers do I have? Document: What questions do I have?
  • Schema Design – Gary Murakami Three Building Blocks of Document Schema Design
  • Schema Design – Gary Murakami 1 – Flexibility • Choices for schema design • Each record can have different fields • Field names consistent for programming • Common structure can be enforced by application • Easy to evolve as needed
  • Schema Design – Gary Murakami 2 – Arrays Multiple Values per Field • Each field can be: – Absent – Set to null – Set to a single value – Set to an array of many values • Query for any matching value – Can be indexed and each value in the array is in the index
  • Schema Design – Gary Murakami 3 - Embedded Documents • Any value can be a document • Nested documents provide structure • Query any field at any level – Can be indexed
  • Schema Design – Gary Murakami Belle and Endgame tablebases Play chess with God – Ken Thompson
  • What is an Entity?
  • Schema Design – Gary Murakami An Entity • Object in your model • Associations with other entities Referencing (Relational) Embedding (Document) has_one embeds_one belongs_to embedded_in has_many embeds_many has_and_belongs_to_ma ny MongoDB has both referencing and embedding for universal coverage
  • Schema Design – Gary Murakami Let's model something together How about a business card?
  • Business Card Schema Design – Gary Murakami
  • Contacts { “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “phone”: “408-996-1010”, “address_id”: 1 } Referencing Schema Design – Gary Murakami Addresses { “_id”: 1, “street”: “10260 Bandley Dr”, “city”: “Cupertino”, “state”: “CA”, “zip_code”: ”95014”, “country”: “USA” }
  • Contacts { “_id”: 2, “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: { “street”: “10260 Bandley Dr”, “city”: “Cupertino”, “state”: “CA”, “zip_code”: ”95014”, “country”: “USA” }, “phone”: “408-996-1010” } Embedding Schema Design – Gary Murakami
  • Schema Design – Gary Murakami Relational Schema Contact • name • company • title • phone Address • street • city • state • zip_code
  • Contact • name • company • adress • Street • City • State • Zip • title • phone • address • street • city • State • zip_code Schema Design – Gary Murakami Document Schema
  • Schema Design – Gary Murakami How are they different? Why? Contact • name • company • title • phone Address • street • city • state • zip_code Contact • name • company • adress • Street • City • State • Zip • title • phone • address • street • city • state • zip_code
  • { “name”: “Steven Jobs”, “title”: “VP, New Product Development”, “company”: “Apple Computer”, “address”: { “street”: “10260 Bandley Dr”, “city”: “Cupertino”, “state”: “CA”, “zip_code”: ”95014” }, “phone”: “408-996-1010” } Schema Flexibility Schema Design – Gary Murakami { “name”: “Larry Page”, “url”: “http://google.com/”, “title”: “CEO”, “company”: “Google!”, “email”: “larry@google.com”, “address”: { “street”: “555 Bryant, #106”, “city”: “Palo Alto”, “state”: “CA”, “zip_code”: “94301” } “phone”: “650-618-1499”, “fax”: “650-330-0100” }
  • Schema Design – Gary Murakami Longest “Database Endgame” Mate • Augment schema with meta data – Distance to mate (DTM) – Distance to conversion (DTC) • Retrograde analysis of DB • Longest checkmate – 6 piece – 262 moves, KRNKNN – 7 piece – 517 moves, so far • Completion by 2015
  • Example
  • Schema Design – Gary Murakami Let’s Look at an Address Book
  • Schema Design – Gary Murakami Address Book • What questions do I have? • What are my entities? • What are my associations?
  • Schema Design – Gary Murakami Address Book Entity- Relationship Contacts • name • company • title Addresses • type • street • city • state • zip_code Phones • type • number Emails • type • address Thumbnail s • mime_type • data Portraits • mime_type • data Groups • name N 1 N 1 N N N 1 1 1 11 Twitters • name • location • web • bio 1 1
  • Associating Entities
  • Schema Design – Gary Murakami One to One Contacts • name • company • title Addresses • type • street • city • state • zip_code Phones • type • number Emails • type • address Thumbnail s • mime_type • data Portraits • mime_type • data Groups • name N 1 N 1 N N N 1 1 1 11 Twitters • name • location • web • bio 1 1
  • Schema Design – Gary Murakami One to One Schema Design Choices contact • twitter_id twitter1 1 contact twitter • contact_id1 1 Redundant to track relationship on both sides • Both references must be updated for consistency • Saves a fetch if no twitter Contact • twitter twitter 1
  • Schema Design – Gary Murakami One to One General Recommendation • Full contact info all at once – Contact embeds twitter • Parent-child relationship – “contains” • No additional data duplication • Can query or index on embedded field – e.g., “twitter.name” Contact • twitter twitter 1
  • Schema Design – Gary Murakami One to Many Contacts • name • company • title Addresses • type • street • city • state • zip_code Phones • type • number Emails • type • address Thumbnail s • mime_type • data Portraits • mime_type • data Groups • name N 1 N 1 N N N 1 1 1 11 Twitters • name • location • web • bio 1 1
  • Schema Design – Gary Murakami One to Many Schema Design Choices contact • phone_ids: [ ] phone1 N contact phone • contact_id1 N Redundant to track relationship on both sides • Both references must be updated for consistency • Not possible in relational DBs • Saves a fetch if no phones Contact • phones phone N
  • Schema Design – Gary Murakami One to Many General Recommendation • Full contact info all at once – Contact embeds multiple phones • Parent-children relationship – “contains” • No additional data duplication • Can query or index on any field – e.g., { “phones.type”: “mobile” } Contact • phones phone N
  • Schema Design – Gary Murakami Many to Many Contacts • name • company • title Addresses • type • street • city • state • zip_code Phones • type • number Emails • type • address Thumbnail s • mime_type • data Portraits • mime_type • data Groups • name N 1 N 1 N N N 1 1 1 11 Twitters • name • location • web • bio 1 1
  • Schema Design – Gary Murakami Many to Many Traditional Relational Association Join table Contacts • name • company • title • phone Groups • name GroupContacts • group_id • contact_id X Use arrays instead
  • Schema Design – Gary Murakami Many to Many Schema Design Choices group • contact_ids: [ ] contactN N group contact • group_ids: [ ] N N Redundant to track relationship on both sides • Both references must be updated for consistency Redundant to track relationship on both sides • Duplicated data must be updated for consistency group • contacts contact N contact • groups group N
  • Schema Design – Gary Murakami Many to Many General Recommendation • Depends on use case 1. Simple address book • Contact references groups 2. Corporate email groups • Group embeds contacts for performance group contact • group_ids: [ ] N N
  • Schema Design – Gary Murakami Contacts • name • company • title addresses • type • street • city • state • zip_code phones • type • number emails • type • address thumbnail • mime_type • data Portraits • mime_type • data Groups • name N 1 N 1 twitter • name • location • web • bio N N N 1 1 Document model - holistic and efficient representation
  • { “name” : “Gary J. Murakami, Ph.D.”, “company” : “10gen (the MongoDB) company”, “title” : “Lead Engineer and Ruby Evangelist”, “twitter” : { “name” : “GaryMurakami”, “location” : “New Providence, NJ”, “web” : “http://www.nobell.org” }, “portrait_id” : 1, “addresses” : [ { “type” : “work”, “street” : ”229 W 43rd St.”, “city” : “New York”, “zip_code” : “10036” } ], “phones” : [ { “type” : “work”, “number” : “1-866-237-8815 x8015” } ], “emails” : [ { “type” : “work”, “address” : “gary.murakami@10gen.com” }, { “type” : “home”, “address” : “gjm@nobell.org” } ] } Contact document example Schema Design – Gary Murakami
  • Schema Design – Gary Murakami Can We Solve Chess One Day? • Chess tablebase problem – Chess programs often play worse – Search is not localized, poor cache performance, seeks – Working set too large for memory • Endgame database size – big data – 5 piece: 7 GB compressed 75% • 157 MB Shredderbase – 1000x • 441 MB Shredderbase – 10,000x – 6 piece: 1.2 TB compressed – 7 piece: 70 TB estimated by 2015
  • Schema Design – Gary Murakami Working Set 1. To reduce the working set – reference less-used data instead of embedding • extract into referenced child document – reference bulk data, e.g., portrait 2. To increase resources – read from secondaries in a replica set – use sharding
  • General Recommendations
  • Schema Design – Gary Murakami Embedding over Referencing • Embed – When “one” or “many” objects are viewed with their parent – For performance – For atomicity • Reference – When you need more scaling: max document size is 16MB – For easy “many to many” associations – For smaller parent documents and working set
  • Schema Design – Gary Murakami Legacy Migration 1. Copy existing schema & some data to MongoDB 2. Iterate schema design 1. Measure performance and find bottlenecks 2. Denormalize by embedding 1. one to one associations first 2. one to many associations next 3. many to many associations last 3. Examine, measure and analyze, review concerns, scaling
  • Schema Design – Gary Murakami New Application 1. Focus on your application 1. Requests 2. Responses 3. Business-domain model objects / data structures 2. Then persist language object data to MongoDB 1. Collections 2. Associations 3. Refactor for optimization and add indices
  • Schema Design – Gary Murakami It’s All About Your Application • Your schema is the impedance matcher – Design choices: normalize/denormalize, reference/embed – Melds programming with MongoDB for best of both – Flexible for development and change • Programs+Databases = (Big) Data Applications
  • Schema Design – Gary Murakami It’s All About Your Application • Your schema is the impedance matcher – Design choices: normalize/denormalize, reference/embed – Melds programming with MongoDB for best of both – Flexible for development and change • Programs MongoDB = Great Big Data Applications • Play chess with God
  • Schema Design – Gary Murakami It’s All About Your Application • Your schema is the impedance matcher – Design choices: normalize/denormalize, reference/embed – Melds programming with MongoDB for best of both – Flexible for development and change • Programs MongoDB = Great Big Data Applications • Play music with God – AAC
  • Lead Engineer / Evangelist Gary J. Murakami, Ph.D. #MongoDB Questions? "His pattern indicates two-dimensional thinking.” - Spock Star Trek II: The Wrath of Khan www.3dchessfederation.com
  • Thank you so much to our community who made An Evening with MongoDB Minneapolis possible: • David Hussman • Josh Kennedy • Matthew Chimento • Jeffrey Lemmerman • Dan Chamberlain • Christopher Rueber • Erin Newkirk Thank you DevJam for hosting our event!