SlideShare a Scribd company logo
1 of 112
Download to read offline
Simplifying
Database Design



         Josh Berkus
         PostgreSQL Experts, Inc.
         O'Reilly OSCON 2009
How Not To Do It
four popular methods
1. One Big Spreadsheet
2. Hashes, EAV & E-Blob
ID    Property   Setting   ID            Properties

407    Eyes      Brown

407    Height     73in     407   <eyes=”brown”><height=”73”>
                                  <married=”1”><smoker=”1”>
407   Married?   TRUE
                           408    <hair=”brown”><age=”49”>
408   Married?   FALSE           <married=”0”><smoker=”0”>
408   Smoker     FALSE
                           409    <age=”37”><height=”66”>
408     Age        37             <hat=”old”><teeth=”gold”>

409    Height     66in
3. Incremental Development
4. Leave It to the ORM
Data Modeling
Entity Resource Diagram
Unified Modeling Language
Wait, which standard?
Simple Bulletin Board
  Database Design
Database As Model
1. Your database is a model of your application
2. Your application is a model of your problem
domain
conclusion: you can simply model the database as
a derivative of your problem domain


corollary: if you don't understand your database,
you don't understand the problem you're solving
Get Together
Your Whole Dev Team
Why the whole team?
●   You need to know the entire problem you're
    modeling through the database.
●   Some developers may be working on specific
    features which need database support which
    the managers forget about.
●   All developers need to understand that the
    database is part of the software development
    and release cycle.
Start with a List
“things” we need to store
      ●   Forums
      ●   Threads
      ●   Posts
      ●   Users
      ●   Administrators
      ●   Messages
Simple Relationships
Categories

                    Messages
 Threads


                      Users
  Posts
                       Admins
Figure out the Attributes
                 of each “thing”
●   name
●   email
●   login
●   password
●   status
                                 Users

                                   Admins
Figure out what kind of data
●   name       text
●   email      text – special
●   login      text
●   password   text
●   status     char
                                Users

                                 Admins
Repeat for all “Things”
●   forums                       ●   messages
       –    name, description,          –   sender, recipients,
             owner, created                   subject, content
●   threads
       –    name, description,
             owner, created
●   posts
       –    created, owner,
              content, flag
OK, Now Get Out!
Interlude



          All
the Relational Theory
  You Need to Know
    in 20 Minutes
E.F. Codd
Database Engineer, IBM 1970
IBM Databases Run Amok
         1.losing data
        2.duplicate data
         3.wrong data
     4.crappy performance

5.downtime for database redesign
    whenever anyone made an
        application change
The Relational Model
Set (Bag) Theory
Relations

    Relation
(table, view, rowset)
Tuples

                    Relation
               (table, view, rowset)
Tuple (row)         Tuple (row)             Tuple (row)




          Tuple (row)         Tuple (row)
Attributes

                                 Relation
                           (table, view, rowset)
    Tuple (row)                  Tuple (row)                 Tuple (row)
Attribute     Attribute     Attribute    Attribute       Attribute    Attribute
       Attribute                  Attribute                    Attribute


                   Tuple (row)                Tuple (row)
               Attribute    Attribute    Attribute      Attribute

                     Attribute                   Attribute
Domains (types)

                                 Relation
                        (table, view, rowset)
 Tuple (row)                     Tuple (row)                     Tuple (row)
INT          TEXT               INT          TEXT               INT          Attribute
                                                                              TEXT
      DATE                            DATE                            Attribute
                                                                       DATE


               Tuple (row)                     Tuple (row)
              INT          Attribute
                            TEXT              INT          Attribute
                                                            TEXT

                    Attribute
                     DATE                           Attribute
                                                     DATE
Keys

                                      Relation
                             (table, view, rowset)
    Tuple (row)                       Tuple (row)                         Tuple (row)
Attribute
  Key         Attribute              Key          Attribute              Key          Attribute
       Attribute                           Attribute                           Attribute


                    Tuple (row)                         Tuple (row)
                   Key          Attribute              Key          Attribute

                         Attribute                           Attribute
Constraints

                                   Relation
                             (table, view, rowset)
      Tuple (row)                  Tuple (row)                 Tuple (row)
  Attribute     Attribute     Attribute    Attribute       Attribute    Attribute
         Attribute                  Attribute                    Attribute


                     Tuple (row)                Tuple (row)
                 Attribute    Attribute    Attribute      Attribute

                       Attribute                   Attribute

Attribute > 5
Foreign Key Constraint
Derived Relation (query)
Atomic Data
Non-Atomic Attributes
●   name (text)
●   email (text)
●   login (text)
●   password (text)
●   status (char)
                             Users

                               Admins
Atomic, Shmomic. Who Cares?
●   Atomic Values:
       –   make joins easier
       –   make constraints easier
●   Non-atomic Values:
       –   increase CPU usage
       –   make you more likely to forget something
What's Atomic?
The simplest form of a datum, which is not
   divisible without loss of information.

                    name
                 Josh Berkus
SELECT SUBSTR(name,STRPOS(name, ' ')) ...

                    Status
                      a

    … WHERE status = 'a' or status = 'u' ...
What's Atomic?
The simplest form of a datum, which is not
   divisible without loss of information.

        first_name     last_name
            Josh         Berkus



          active         access
          TRUE             a
Table Atomized!
●   first_name (text)
●   last_name (text)
●   email (text)
●   login (text)
●   password (text)
●   active (boolean)             Users
●   access (char)
                                     Admins
Where Are My Keys?
●   first_name (text)
●   last_name (text)
●   email (text)
●   login (text)
●   password (text)
●   active (boolean)        Users
●   access (char)
                              Admins
Candidate (Natural) Keys
●   first_name (text)
●   last_name (text)
●   email (text)
●   login (text)        Key
●   password (text)
●   active (boolean)          Users
●   access (char)
                               Admins
A Good Key
●   Should have to be unique because the
    application requires it to be.
●   Expresses a unique predicate which describes
    the tuple (row):
        –   user with login “jberkus”
        –   post from “jberkus” on “2009-05-02 13:41:22” in
             thread “Making your own wine”
●   If you can't find a good key, your table design is
    missing data.
Surrogate Key
●   first_name (text)         ●   user (serial)
●   last_name (text)
●   email (text)
●   login (text)        Key
●   password (text)
●   active (boolean)                      Users
●   access (char)
                                            Admins
When shouldn't I
                use surrogate keys?
●   As a substitute for real keys
        –   not ever
●   If the real key works for the application
        –   it's a single column
        –   it's small
●   For Join Tables (more later)
●   If they are not going to be used
        –   leaf tables
When should I
                use surrogate keys?
●   If the real key is complex or really large
        –   4 columns
        –   large text field
        –   time range
●   If your application framework requires them
        –   but probably better to get a better framework
●   If you're doing data warehousing
        –   where the bytes count
But wait, aren't ID fields “faster”?

               No.

    While INTs are smaller,
     joins are expensive.

   Test twice, design once.
users: no surrogate key
create table users (
   first_name text not null check
      ( length(first_name) between 1 and 40 ),
   last_name    text not null check
      ( length(last_name) between 2 and 30 ),
   login     text not null unique check
      ( length(login) between 4 and 30 ),
   password     text not null check
      ( length(login) between 6 and 30 ),
   email        email not null unique,
   description text,
   icon      text,
   level     integer not null default 1
            references access_levels (level)
            on update cascade on delete set default,
   active       boolean not null default TRUE
);
posts: surrogate keys
create table posts (
   post SERIAL not null unique,
   thread integer not null references threads(thread)
      on delete cascade on update cascade,
   created     timestamp with time zone
               not null default current_timestamp,
   owner    text not null
         references users (login) on update cascade
         on delete cascade,
   content     text not null,
   flag     char(1) references flags(flag)
         on update cascade on delete set null
   constraint posts_key unique (thread, created, owner)
);
Constraints
                          for clean data
●   Are there to prevent “bad data”.
        –   allow you to rely on specific assertions being
              true
        –   prevent garbage rows
        –   deter application errors
                ●   and stupid display problems
Is VARCHAR(#) a Constraint?
●   No, not really
        –   if you need an upper limit, you probably need a
               lower limit
●   but … data types are primitive constraints
        –   just not constraining enough to prevent bad
              input
Defaults
                     for convenience
●   Allow you to forget about some columns
        –   help support “NOT NULL” constraints
●   Let you set values for “invisible” columns
        –   like auditing information
●   Let you set things “automatically”
        –   like created on current_timestamp
But my Application Code
       Takes Care of Data Format!
●   Maybe
       –   you probably don't want to make column
             constraints too restrictive
       –   allow some room for cosmetic changes
               ●   and non-essential data
●   Maybe Not
       –   applications have bugs
       –   everything has a RESTful interface now
       –   NULLs can behave very oddly in queries
No Constraints

first_name   last_name        email           login     password active      level
   Josh        Berkus    josh@pgexperts.com   jberkus   jehosaphat   TRUE      a
   NULL        NULL           kelley@ucb         k         NULL      FALSE     u
   Mike         Hunt       www.pornking.com   c34521      c34521     TRUE      I
    S            F           gavin@sf.gov      gavin       twitter   NULL      x
Constraints and Defaults
●   first_name text
        –   not null check (length between 1 and 40)
●   last_name text
        –   not null check (length between 2 and 40)
●   email text not null ???
●   login text
        –   not null unique check (length between 4 and 40)
●   password text
        –   not null unique check (length between 6 and 30)
Constraints and Defaults
●   active boolean
        –   not null default TRUE
●   access char(1)
        –   not null check in('a','u') default 'u'
●   user_id serial
        –   not null unique
Gee, that was easy!
        is that all there is?




Well, no. It gets more complicated.
      See you after the break.
We All Just Want to Be Normal
Abby Normal
 login              level        last_name
jberkus              u               Berkus
selena               a          Deckelman



           login             title            posted   level
          jberkus           Dinner?           09:28     u
          selena            Dinner?           09:37     u
          jberkus           Dinner?           09:44     a
Abby Normal
 login              level        last_name
jberkus              u               Berkus
selena               a          Deckelman



           login             title            posted   level
          jberkus           Dinner?           09:28     u
          selena            Dinner?           09:37     u
          jberkus           Dinner?           09:44     a
How can I be “Normal”?
1. Each piece of data only appears in one relation
        –     except as a “foreign key” attribute
●   No “repeated” attributes
      login           level      last_name
     jberkus           u          Berkus
     selena            a         Deckelman

                        login         title    posted
                       jberkus      Dinner?     09:28
                       selena       Dinner?     09:37
                       jberkus      Dinner?     09:44
But What's Really “Non-Repeated”?


  login      level              privileges
 jberkus      u             read,post,search
  selena      a      read,post,search,edit,delete,ban
webcrawler     r                  read
  mike        u             read,post,search
  carol       u             read,post,search


             obviously repeated
But What's Really “Non-Repeated”?
  login             level
 jberkus             u
  selena             a
webcrawler            r
  mike               u
  carol
  level      read    u post     search   edit   delete   ban
    u         t             t      t      f       f       f
    a         t             t      t      t       t       t
    r         t             f      f      f       f       f

                            non-repeated
But What's Really “Non-Repeated”?
  login      level   level       name
 jberkus      u       u           user
  selena      a       a       administrator
webcrawler     r       r       read-only
  mike        u       level      privilege
  carol       u         u          read
                        u          post
                        u         search
                        r          read
                        a         delete
non-repeated            a           ban
                        a          post
How do you decide between
          one/several tables?
●   Simple Rule: “one thought, one table”
       –   like “one thought, one paragraph”
●   You probably need more tables if:
       –   there's no unique key
       –   there's more than one unique key
●   You may need less tables if:
       –   you're doing lots of one-to-one joins
How do you decide between
          one/several tables?
●   Otherwise, it's based on the Application
●   how does the application use the data?
        –   does it want an array?
        –   use a flat series of columns
●   does it want a single fact or check?
        –   do you expect to add new types a lot?
        –   use a vertical child table
But wait, doesn't Normalization
have something to do with ID fields
 and everything in a lookup table?



               No.
Special Case #1:
        many-to-many relationships
      access_levels                 privileges
level         name                  privilege
 u             user                    read
 a         administrator             search
  r         read-only                  post
          access_level_privileges    delete
             level    privilege         ban
               r        read
               u        read
               u       search
               a       search
               a       delete
Join Tables
●   Contain only the keys of two or more other
    tables
●   Should have a single unique index across all
    keys
●   Should have Foreign Keys to all the other
    tables with CASCADE
Special Case #2:
             Lookup Tables
                for constraints
                                  privileges
                                  privilege
access_level_privileges             read
   level    privilege              search
     r        read                  post
     u        read                 delete
     u       search                  ban
     a       search
     a       delete
Special Case #2:
               Lookup Tables
                dimension tables

first_name   last_name   city   city      name      state
    Josh       Berkus     2      2         San       CA
   David        Fetter    3             Francisco
   Selena    Deckelman   17      3      Oakland      CA
    Miho      Ishakura    2     17      Portland     OR
   David        Gould     3     42     Washington DC
   Robert       Treat    42     91     Philadelphia PA
   Bruce      Momjian    91
When do I use Dimension Tables?
●   When there's multiple facts/levels to the
    dimension
        –   locations
        –   demography
●   When you need to save space
        –   really, really big tables (millions of rows)
●   Do not use them “just because”.
        –   dimension tables are not normalization
Special Case #3:
                     Tree Structures
●   Developers want posts to “nest”
        –   posts should form a tree, one under the other



●
    “Palio Restaurant” July 19th
        –   “Re: Palio Restaurant”   July 21st
                ● “Re: Re: Palio Restaurant” July 23rd
                ● “Re: Re: Palio Restaurant” July 24th


        –   “Re: Palio Restaurant” July 23rd
Tree Structures:
                    Proximity Tree
●   Each item has a link to its parent item
        –   post 34 | parent_post 21
●   Advantages
        –   most common
        –   fast to update
●   Disadvantages
        –   slow to query
        –   requires WITH RECURSIVE or
              CONNECT_BY()
Tree Structures:
                         Path Fields
●   Each item has a full “path” of its parentage
        –   post 34 | path 7,21,26
●   Advantages
        –   fast to sort
        –   fast to query & search
●   Disadvantages
        –   slow to update
        –   requires non-standard SQL extensions
                 ●   or text parsing
posts Table
create table posts (
   post SERIAL not null unique,
   thread integer not null references threads(thread)
               on delete cascade on update cascade,
   parent_post    integer references posts(post)
         on delete cascade on update cascade,
   created     timestamp with time zone
               not null default current_timestamp,
   owner    text not null references users (login)
            on update cascade on delete cascade,
   content     text not null,
   flag     char(1) references flags(flag)
            on update cascade on delete set null
   constraint posts_key unique (thread, created, owner)
);
Special Case #4:
                  Extensible Data
●   Developers want admins to be able to create
    “flexible profiles”
       –   series of items
       –   undefined at installation time


●   Josh Berkus
       –   male
       –   bearded
       –   wears glasses
Extensible Data:
      Entity-Attribute-Value
ID     Property   Setting

407     Eyes      Brown

407     Height     73in

407    Married?   TRUE      property   format
                            Eyes       text
408    Married?   FALSE
                            Height     number
408    Smoker     FALSE     Married?   boolean
                            Age        number
408      Age        37
                            Smoker     boolean
409     Height     66in
EAVil
●   Space-consumptive
        –   many many rows, lots of row overhead
●   Enforcing constraints by procedural code
        –   very CPU-intensive
●   Can't make anything “required”
●   Can't index effectively
●   Many-Way Joins
        –   selecting combinations performs horribly
●   however, you can cascade-drop
EAVil
●   All unmarried men with red hair under 30
SELECT first_name, last_name
FROM users
   JOIN user_profiles married USING (login)
   JOIN user_profiles men USING (login)
   JOIN user_profiles hair USING (login)
   JOIN user_profiles age USING (login)
WHERE married.property = 'Married?'
        and married.value::BOOLEAN = FALSE
   AND men.property = 'Gender' and men.value = 'm'
   AND hair.property = 'Hair' and hair.value = 'Red'
   AND age.property = 'Age' and age.value::INT < 30
E-Blob
ID            Properties



407   <eyes=”brown”><height=”73”>
       <married=”1”><smoker=”1”>

408    <hair=”brown”><age=”49”>
      <married=”0”><smoker=”0”>

409    <age=”37”><height=”66”>
       <hat=”old”><teeth=”gold”>
E-Blobby
●   Slow to update, slow to search
       –   need to use application code or lots of parsing
●   Requires special database extensions
       –   XML, hstore, etc.
●   Advantages over EAV
       –   smaller storage space (with compression)
       –   no horrible joins
       –   combinations easier
       –   feeds directly into application code
How to Decide:
                EAVil vs. ThE-Blob


●   Will you be searching for specific items?
        –   EAVil
●   Will you be just spitting out all data to the
    application?
        –   E-Blob
●   Do you have special DB extenstions?
        –   E-Blob
When Not to use EAV & E-Blob
●   As the foundation for all of your data
        –   non-relational databases do this better
●   For data which has important checks and
    constraints
        –   or is required
●   For data which needs to be searched fast
●   As a way of modifying your application
        –   alter the database!
E-blob: The users Table
create table users (
   first_name text not null
      check ( length(first_name) between 1 and 40 ),
   last_name    text not null
      check ( length(last_name) between 2 and 30 ),
   login     text not null unique
      check ( length(login) between 4 and 30 ),
   password     text not null
      check ( length(login) between 6 and 30 ),
   email        email not null unique,
   description text,
   icon      text,
   level     integer not null default 1
         references access_levels (level)
         on update cascade on delete set default,
   active       boolean not null default TRUE,
   profile      xml
);
Managing Change
Making a DB Schema is a Process
                    not an end result
●   Waterfall is Dead
       –   don't make the schema static and the
            application dynamic
       –   if you use Aglie/TDD/etc. for app, use it for DB
       –   Plan to Iterate
Software Development Cycle (TDD)
 create                        write tests
 specification


                                     develop
get user                             software
feedback


                 deploy
                 new version
Database Development Cycle (TDD)
 write data              write tests
 requirements


                               develop
get                            new
developer                      schema
feedback

            deploy
            new schema
But wait, how do I manage change
 without breaking the application?

●   The same as for software development
       1)Testing
       2)Migrations
       3)Backwards-compatible APIs
Testing
●   Unit tests for database objects
        –   especially stored procedures
●   Application tests for application queries
        –   need to be able to run all application queries
             and test for breakage
●   Performance Regression tests
        –   make sure you're not breaking performance
Migrations
●   For each schema change, write a SQL
    migration
        –   use transactional DDL (if available)
●   Sequence these updates
        –   tie them to application updates
●   Watch out for irreversability
        –   unlike application migrations, database
             reversions may destroy data
Backwards-Compatible API




          Views
Views: Messages Table
●   messages are sent from one user to one user
    create table messages (
       message SERIAL not null unique,
       sender text not null references users(login)
          on delete cascade on update cascade,
       recipient text not null references users(login)
          on delete cascade on update cascade,
       sent timestamp with time zone
          not null default current_timestamp,
       subject text not null
          check (length(subject) between 3 and 200 ),
       content text not null
    );
●   developers want multiple recipients
        –   but, they don't want to refactor all code
1. Create message_recipients


create table message_recipients (
   message int not null references
      messages(message)
      on delete cascade on update cascade,
   recipient text not null references users(login)
      on delete cascade on update cascade,
   constraint message_recipients_key
      unique ( message, recipient )
);
2. Rename and modify messages

INSERT INTO message_recipients
SELECT message, recipient FROM messages;
ALTER TABLE messages
NAME to message_contents;
ALTER TABLE message_contents
DROP COLUMN recipient;
3. Create VIEW for backwards
           compatibility
CREATE VIEW messages AS
SELECT message, sender,
  array_agg(recipient),
  sent, subject, content
FROM message_contents JOIN
  message_recipients
  USING ( message )
GROUP BY message, sender,
  sent, subject, content;
Some Good Practices
“practice doesn't make perfect,
perfect practice makes perfect.”
Consistent, Clear Naming
●   Pick a Style, and Stick To It
        –   plural tables or singular?
        –   camel case or underscore?
        –   have a “stylebook” for all developers
●   Name objects what they are
        –   don't abbreviate
        –   don't use “cute” or “temporary” names
●   If the object changes, change its name
Comment Your DB
●   Use COMMENT ON … IS
       –   describe each object
       –   if you have time, each column
       –   keep comments up to date
       –   just like you would with application code


comment on table privileges is 'a list of application
privileges which can be assigned to various privilege
levels.';
Use Source Code Management
●   DDL (data definition language) is Text
        –   check it into Git/SVN/Mercurial/Bazaar
        –   version it
Some Bad Practices
Premature Optimization
●   Don't do anything “for performance” which
    compromises the logical model of your data
        –   unless you've tested it thoroughly first
●   Poor optimization limits your throughput
        –   but you can always buy more hardware
●   Poor design can result in days of downtime
        –   besides, database engines are designed to
             optimize for good design
“Downtime is more costly
  than slow throughput”
Premature Optimization:
          Five Warning Signs
1)Are you choosing data types according to
  which is “faster”?
2)Do you find yourself counting bytes?
3)Did you disable foreign keys and constraints
  because they were “too slow”?
4)Have you “denormalized” or “flattened” tables
  to make them “faster”?
5)Do you find yourself trying to override the query
  planner?
Polymorphic Fields
●   Fields which mean different things depending
    on the value of another field
       result      link_type               link
        309           URL       http://www.postgresql.org
       4718           port                 443
       5223           OS                    88
       9001       application             1915
Magic Numbers

      ID = 0
   2009-02-30
   2000-01-01
-1, 1,2,3,4,5 100
Summary
1)The database is a simplified model of the
  problem you're solving
2)It can be designed simply by working with the
  development team on creating lists
3)Relational Theory is simple and has only a few
  rules.
4)Normalization simply means removing
  duplication
Summary
5)Designing a Table in 5 simple steps:
     1)list your attributes
     2)make them atomic
     3)choose data types
     4)choose keys
     5)add constraints and defaults
Summary
6) For any given set of data, there are several
  possible structures: pick the one the application
  likes.
7) Dimension tables aren't for everyone.
8) Four Special Cases require Special SQL:
      1) Many-to-Many Join Tables
      2) Lookup tables and Dimension Tables
      3) Tree Structured Data
      4) Extensible Data
Summary
9) Managing Changes
    1) Testing
    2) Migrations
    3) Views & Procedures as Compatible API
10) Follow Good Practices
11) Avoid Bad Practices
More Information
●   me
         –    josh@pgexperts.com
         –    www.pgexperts.com
         –    it.toolbox.com/blogs/database-soup
●   postgresql
         –    www.postgresql.org
●   at OSCON
         –    PostgreSQL booth
         –    State of Lightning Talks (Thursday 1:45)
             This presentation copyright 2009 Josh Berkus, licensed for distribution under the
              Creative Commons Attribution License, except for photos, most of which were
              stolen from other people's websites via images.google.com. Thanks, Google!

More Related Content

Viewers also liked

Viewers also liked (6)

Main Task
Main TaskMain Task
Main Task
 
Writeable CTEs: The Next Big Thing
Writeable CTEs: The Next Big ThingWriteable CTEs: The Next Big Thing
Writeable CTEs: The Next Big Thing
 
New pitch
New pitchNew pitch
New pitch
 
Affective Interaction Design
Affective Interaction DesignAffective Interaction Design
Affective Interaction Design
 
Safety LAMP: data security & agile languages
Safety LAMP: data security & agile languagesSafety LAMP: data security & agile languages
Safety LAMP: data security & agile languages
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 

Similar to Simplifying Database Development (OSCON 2009)

Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for RubyistsSean Cribbs
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceLiquidHub
 
XSLT 1 and XPath Quick Reference (from mulberrytech.com)
XSLT 1 and XPath Quick Reference (from mulberrytech.com)XSLT 1 and XPath Quick Reference (from mulberrytech.com)
XSLT 1 and XPath Quick Reference (from mulberrytech.com)FrescatiStory
 
楽々Scalaプログラミング
楽々Scalaプログラミング楽々Scalaプログラミング
楽々ScalaプログラミングTomoharu ASAMI
 
Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)RichardWarburton
 
Xbase - Implementing Domain-Specific Languages for Java
Xbase - Implementing Domain-Specific Languages for JavaXbase - Implementing Domain-Specific Languages for Java
Xbase - Implementing Domain-Specific Languages for Javameysholdt
 
Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)Chia-Chi Chang
 
stream processing engine
stream processing enginestream processing engine
stream processing enginetiana528
 
Scaling search to a million pages with Solr, Python, and Django
Scaling search to a million pages with Solr, Python, and DjangoScaling search to a million pages with Solr, Python, and Django
Scaling search to a million pages with Solr, Python, and Djangotow21
 
R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011Mandi Walls
 
Model-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax DefinitionModel-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax DefinitionEelco Visser
 

Similar to Simplifying Database Development (OSCON 2009) (20)

Os Raysmith
Os RaysmithOs Raysmith
Os Raysmith
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
 
Xsl Tand X Path Quick Reference
Xsl Tand X Path Quick ReferenceXsl Tand X Path Quick Reference
Xsl Tand X Path Quick Reference
 
XSLT 1 and XPath Quick Reference (from mulberrytech.com)
XSLT 1 and XPath Quick Reference (from mulberrytech.com)XSLT 1 and XPath Quick Reference (from mulberrytech.com)
XSLT 1 and XPath Quick Reference (from mulberrytech.com)
 
楽々Scalaプログラミング
楽々Scalaプログラミング楽々Scalaプログラミング
楽々Scalaプログラミング
 
Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)
 
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
 
Xbase - Implementing Domain-Specific Languages for Java
Xbase - Implementing Domain-Specific Languages for JavaXbase - Implementing Domain-Specific Languages for Java
Xbase - Implementing Domain-Specific Languages for Java
 
Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)
 
On metadata for Open Data
On metadata for Open DataOn metadata for Open Data
On metadata for Open Data
 
stream processing engine
stream processing enginestream processing engine
stream processing engine
 
Database
DatabaseDatabase
Database
 
RaleighFS v5
RaleighFS v5RaleighFS v5
RaleighFS v5
 
tutorial5
tutorial5tutorial5
tutorial5
 
tutorial5
tutorial5tutorial5
tutorial5
 
Sql alchemy bpstyle_4
Sql alchemy bpstyle_4Sql alchemy bpstyle_4
Sql alchemy bpstyle_4
 
Scaling search to a million pages with Solr, Python, and Django
Scaling search to a million pages with Solr, Python, and DjangoScaling search to a million pages with Solr, Python, and Django
Scaling search to a million pages with Solr, Python, and Django
 
R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011
 
Model-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax DefinitionModel-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax Definition
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
 

More from PostgreSQL Experts, Inc.

PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALEPostgreSQL Experts, Inc.
 
Elephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and VariantsElephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and VariantsPostgreSQL Experts, Inc.
 

More from PostgreSQL Experts, Inc. (20)

Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
Shootout at the AWS Corral
Shootout at the AWS CorralShootout at the AWS Corral
Shootout at the AWS Corral
 
Fail over fail_back
Fail over fail_backFail over fail_back
Fail over fail_back
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
 
HowTo DR
HowTo DRHowTo DR
HowTo DR
 
Give A Great Tech Talk 2013
Give A Great Tech Talk 2013Give A Great Tech Talk 2013
Give A Great Tech Talk 2013
 
Pg py-and-squid-pypgday
Pg py-and-squid-pypgdayPg py-and-squid-pypgday
Pg py-and-squid-pypgday
 
92 grand prix_2013
92 grand prix_201392 grand prix_2013
92 grand prix_2013
 
Five steps perform_2013
Five steps perform_2013Five steps perform_2013
Five steps perform_2013
 
7 Ways To Crash Postgres
7 Ways To Crash Postgres7 Ways To Crash Postgres
7 Ways To Crash Postgres
 
PWNage: Producing a newsletter with Perl
PWNage: Producing a newsletter with PerlPWNage: Producing a newsletter with Perl
PWNage: Producing a newsletter with Perl
 
10 Ways to Destroy Your Community
10 Ways to Destroy Your Community10 Ways to Destroy Your Community
10 Ways to Destroy Your Community
 
Open Source Press Relations
Open Source Press RelationsOpen Source Press Relations
Open Source Press Relations
 
5 (more) Ways To Destroy Your Community
5 (more) Ways To Destroy Your Community5 (more) Ways To Destroy Your Community
5 (more) Ways To Destroy Your Community
 
Preventing Community (from Linux Collab)
Preventing Community (from Linux Collab)Preventing Community (from Linux Collab)
Preventing Community (from Linux Collab)
 
Development of 8.3 In India
Development of 8.3 In IndiaDevelopment of 8.3 In India
Development of 8.3 In India
 
PostgreSQL and MySQL
PostgreSQL and MySQLPostgreSQL and MySQL
PostgreSQL and MySQL
 
50 Ways To Love Your Project
50 Ways To Love Your Project50 Ways To Love Your Project
50 Ways To Love Your Project
 
8.4 Upcoming Features
8.4 Upcoming Features 8.4 Upcoming Features
8.4 Upcoming Features
 
Elephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and VariantsElephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and Variants
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Simplifying Database Development (OSCON 2009)

  • 1. Simplifying Database Design Josh Berkus PostgreSQL Experts, Inc. O'Reilly OSCON 2009
  • 2. How Not To Do It four popular methods
  • 3. 1. One Big Spreadsheet
  • 4. 2. Hashes, EAV & E-Blob ID Property Setting ID Properties 407 Eyes Brown 407 Height 73in 407 <eyes=”brown”><height=”73”> <married=”1”><smoker=”1”> 407 Married? TRUE 408 <hair=”brown”><age=”49”> 408 Married? FALSE <married=”0”><smoker=”0”> 408 Smoker FALSE 409 <age=”37”><height=”66”> 408 Age 37 <hat=”old”><teeth=”gold”> 409 Height 66in
  • 6. 4. Leave It to the ORM
  • 11.
  • 12. Simple Bulletin Board Database Design
  • 13. Database As Model 1. Your database is a model of your application 2. Your application is a model of your problem domain conclusion: you can simply model the database as a derivative of your problem domain corollary: if you don't understand your database, you don't understand the problem you're solving
  • 15. Why the whole team? ● You need to know the entire problem you're modeling through the database. ● Some developers may be working on specific features which need database support which the managers forget about. ● All developers need to understand that the database is part of the software development and release cycle.
  • 16. Start with a List “things” we need to store ● Forums ● Threads ● Posts ● Users ● Administrators ● Messages
  • 17. Simple Relationships Categories Messages Threads Users Posts Admins
  • 18. Figure out the Attributes of each “thing” ● name ● email ● login ● password ● status Users Admins
  • 19. Figure out what kind of data ● name text ● email text – special ● login text ● password text ● status char Users Admins
  • 20. Repeat for all “Things” ● forums ● messages – name, description, – sender, recipients, owner, created subject, content ● threads – name, description, owner, created ● posts – created, owner, content, flag
  • 21. OK, Now Get Out!
  • 22. Interlude All the Relational Theory You Need to Know in 20 Minutes
  • 24. IBM Databases Run Amok 1.losing data 2.duplicate data 3.wrong data 4.crappy performance 5.downtime for database redesign whenever anyone made an application change
  • 27. Relations Relation (table, view, rowset)
  • 28. Tuples Relation (table, view, rowset) Tuple (row) Tuple (row) Tuple (row) Tuple (row) Tuple (row)
  • 29. Attributes Relation (table, view, rowset) Tuple (row) Tuple (row) Tuple (row) Attribute Attribute Attribute Attribute Attribute Attribute Attribute Attribute Attribute Tuple (row) Tuple (row) Attribute Attribute Attribute Attribute Attribute Attribute
  • 30. Domains (types) Relation (table, view, rowset) Tuple (row) Tuple (row) Tuple (row) INT TEXT INT TEXT INT Attribute TEXT DATE DATE Attribute DATE Tuple (row) Tuple (row) INT Attribute TEXT INT Attribute TEXT Attribute DATE Attribute DATE
  • 31. Keys Relation (table, view, rowset) Tuple (row) Tuple (row) Tuple (row) Attribute Key Attribute Key Attribute Key Attribute Attribute Attribute Attribute Tuple (row) Tuple (row) Key Attribute Key Attribute Attribute Attribute
  • 32. Constraints Relation (table, view, rowset) Tuple (row) Tuple (row) Tuple (row) Attribute Attribute Attribute Attribute Attribute Attribute Attribute Attribute Attribute Tuple (row) Tuple (row) Attribute Attribute Attribute Attribute Attribute Attribute Attribute > 5
  • 36. Non-Atomic Attributes ● name (text) ● email (text) ● login (text) ● password (text) ● status (char) Users Admins
  • 37. Atomic, Shmomic. Who Cares? ● Atomic Values: – make joins easier – make constraints easier ● Non-atomic Values: – increase CPU usage – make you more likely to forget something
  • 38. What's Atomic? The simplest form of a datum, which is not divisible without loss of information. name Josh Berkus SELECT SUBSTR(name,STRPOS(name, ' ')) ... Status a … WHERE status = 'a' or status = 'u' ...
  • 39. What's Atomic? The simplest form of a datum, which is not divisible without loss of information. first_name last_name Josh Berkus active access TRUE a
  • 40. Table Atomized! ● first_name (text) ● last_name (text) ● email (text) ● login (text) ● password (text) ● active (boolean) Users ● access (char) Admins
  • 41. Where Are My Keys? ● first_name (text) ● last_name (text) ● email (text) ● login (text) ● password (text) ● active (boolean) Users ● access (char) Admins
  • 42. Candidate (Natural) Keys ● first_name (text) ● last_name (text) ● email (text) ● login (text) Key ● password (text) ● active (boolean) Users ● access (char) Admins
  • 43. A Good Key ● Should have to be unique because the application requires it to be. ● Expresses a unique predicate which describes the tuple (row): – user with login “jberkus” – post from “jberkus” on “2009-05-02 13:41:22” in thread “Making your own wine” ● If you can't find a good key, your table design is missing data.
  • 44. Surrogate Key ● first_name (text) ● user (serial) ● last_name (text) ● email (text) ● login (text) Key ● password (text) ● active (boolean) Users ● access (char) Admins
  • 45. When shouldn't I use surrogate keys? ● As a substitute for real keys – not ever ● If the real key works for the application – it's a single column – it's small ● For Join Tables (more later) ● If they are not going to be used – leaf tables
  • 46. When should I use surrogate keys? ● If the real key is complex or really large – 4 columns – large text field – time range ● If your application framework requires them – but probably better to get a better framework ● If you're doing data warehousing – where the bytes count
  • 47. But wait, aren't ID fields “faster”? No. While INTs are smaller, joins are expensive. Test twice, design once.
  • 48. users: no surrogate key create table users ( first_name text not null check ( length(first_name) between 1 and 40 ), last_name text not null check ( length(last_name) between 2 and 30 ), login text not null unique check ( length(login) between 4 and 30 ), password text not null check ( length(login) between 6 and 30 ), email email not null unique, description text, icon text, level integer not null default 1 references access_levels (level) on update cascade on delete set default, active boolean not null default TRUE );
  • 49. posts: surrogate keys create table posts ( post SERIAL not null unique, thread integer not null references threads(thread) on delete cascade on update cascade, created timestamp with time zone not null default current_timestamp, owner text not null references users (login) on update cascade on delete cascade, content text not null, flag char(1) references flags(flag) on update cascade on delete set null constraint posts_key unique (thread, created, owner) );
  • 50. Constraints for clean data ● Are there to prevent “bad data”. – allow you to rely on specific assertions being true – prevent garbage rows – deter application errors ● and stupid display problems
  • 51. Is VARCHAR(#) a Constraint? ● No, not really – if you need an upper limit, you probably need a lower limit ● but … data types are primitive constraints – just not constraining enough to prevent bad input
  • 52. Defaults for convenience ● Allow you to forget about some columns – help support “NOT NULL” constraints ● Let you set values for “invisible” columns – like auditing information ● Let you set things “automatically” – like created on current_timestamp
  • 53. But my Application Code Takes Care of Data Format! ● Maybe – you probably don't want to make column constraints too restrictive – allow some room for cosmetic changes ● and non-essential data ● Maybe Not – applications have bugs – everything has a RESTful interface now – NULLs can behave very oddly in queries
  • 54. No Constraints first_name last_name email login password active level Josh Berkus josh@pgexperts.com jberkus jehosaphat TRUE a NULL NULL kelley@ucb k NULL FALSE u Mike Hunt www.pornking.com c34521 c34521 TRUE I S F gavin@sf.gov gavin twitter NULL x
  • 55. Constraints and Defaults ● first_name text – not null check (length between 1 and 40) ● last_name text – not null check (length between 2 and 40) ● email text not null ??? ● login text – not null unique check (length between 4 and 40) ● password text – not null unique check (length between 6 and 30)
  • 56. Constraints and Defaults ● active boolean – not null default TRUE ● access char(1) – not null check in('a','u') default 'u' ● user_id serial – not null unique
  • 57. Gee, that was easy! is that all there is? Well, no. It gets more complicated. See you after the break.
  • 58. We All Just Want to Be Normal
  • 59. Abby Normal login level last_name jberkus u Berkus selena a Deckelman login title posted level jberkus Dinner? 09:28 u selena Dinner? 09:37 u jberkus Dinner? 09:44 a
  • 60. Abby Normal login level last_name jberkus u Berkus selena a Deckelman login title posted level jberkus Dinner? 09:28 u selena Dinner? 09:37 u jberkus Dinner? 09:44 a
  • 61. How can I be “Normal”? 1. Each piece of data only appears in one relation – except as a “foreign key” attribute ● No “repeated” attributes login level last_name jberkus u Berkus selena a Deckelman login title posted jberkus Dinner? 09:28 selena Dinner? 09:37 jberkus Dinner? 09:44
  • 62. But What's Really “Non-Repeated”? login level privileges jberkus u read,post,search selena a read,post,search,edit,delete,ban webcrawler r read mike u read,post,search carol u read,post,search obviously repeated
  • 63. But What's Really “Non-Repeated”? login level jberkus u selena a webcrawler r mike u carol level read u post search edit delete ban u t t t f f f a t t t t t t r t f f f f f non-repeated
  • 64. But What's Really “Non-Repeated”? login level level name jberkus u u user selena a a administrator webcrawler r r read-only mike u level privilege carol u u read u post u search r read a delete non-repeated a ban a post
  • 65. How do you decide between one/several tables? ● Simple Rule: “one thought, one table” – like “one thought, one paragraph” ● You probably need more tables if: – there's no unique key – there's more than one unique key ● You may need less tables if: – you're doing lots of one-to-one joins
  • 66. How do you decide between one/several tables? ● Otherwise, it's based on the Application ● how does the application use the data? – does it want an array? – use a flat series of columns ● does it want a single fact or check? – do you expect to add new types a lot? – use a vertical child table
  • 67. But wait, doesn't Normalization have something to do with ID fields and everything in a lookup table? No.
  • 68. Special Case #1: many-to-many relationships access_levels privileges level name privilege u user read a administrator search r read-only post access_level_privileges delete level privilege ban r read u read u search a search a delete
  • 69. Join Tables ● Contain only the keys of two or more other tables ● Should have a single unique index across all keys ● Should have Foreign Keys to all the other tables with CASCADE
  • 70. Special Case #2: Lookup Tables for constraints privileges privilege access_level_privileges read level privilege search r read post u read delete u search ban a search a delete
  • 71. Special Case #2: Lookup Tables dimension tables first_name last_name city city name state Josh Berkus 2 2 San CA David Fetter 3 Francisco Selena Deckelman 17 3 Oakland CA Miho Ishakura 2 17 Portland OR David Gould 3 42 Washington DC Robert Treat 42 91 Philadelphia PA Bruce Momjian 91
  • 72. When do I use Dimension Tables? ● When there's multiple facts/levels to the dimension – locations – demography ● When you need to save space – really, really big tables (millions of rows) ● Do not use them “just because”. – dimension tables are not normalization
  • 73. Special Case #3: Tree Structures ● Developers want posts to “nest” – posts should form a tree, one under the other ● “Palio Restaurant” July 19th – “Re: Palio Restaurant” July 21st ● “Re: Re: Palio Restaurant” July 23rd ● “Re: Re: Palio Restaurant” July 24th – “Re: Palio Restaurant” July 23rd
  • 74. Tree Structures: Proximity Tree ● Each item has a link to its parent item – post 34 | parent_post 21 ● Advantages – most common – fast to update ● Disadvantages – slow to query – requires WITH RECURSIVE or CONNECT_BY()
  • 75. Tree Structures: Path Fields ● Each item has a full “path” of its parentage – post 34 | path 7,21,26 ● Advantages – fast to sort – fast to query & search ● Disadvantages – slow to update – requires non-standard SQL extensions ● or text parsing
  • 76. posts Table create table posts ( post SERIAL not null unique, thread integer not null references threads(thread) on delete cascade on update cascade, parent_post integer references posts(post) on delete cascade on update cascade, created timestamp with time zone not null default current_timestamp, owner text not null references users (login) on update cascade on delete cascade, content text not null, flag char(1) references flags(flag) on update cascade on delete set null constraint posts_key unique (thread, created, owner) );
  • 77. Special Case #4: Extensible Data ● Developers want admins to be able to create “flexible profiles” – series of items – undefined at installation time ● Josh Berkus – male – bearded – wears glasses
  • 78. Extensible Data: Entity-Attribute-Value ID Property Setting 407 Eyes Brown 407 Height 73in 407 Married? TRUE property format Eyes text 408 Married? FALSE Height number 408 Smoker FALSE Married? boolean Age number 408 Age 37 Smoker boolean 409 Height 66in
  • 79. EAVil ● Space-consumptive – many many rows, lots of row overhead ● Enforcing constraints by procedural code – very CPU-intensive ● Can't make anything “required” ● Can't index effectively ● Many-Way Joins – selecting combinations performs horribly ● however, you can cascade-drop
  • 80. EAVil ● All unmarried men with red hair under 30 SELECT first_name, last_name FROM users JOIN user_profiles married USING (login) JOIN user_profiles men USING (login) JOIN user_profiles hair USING (login) JOIN user_profiles age USING (login) WHERE married.property = 'Married?' and married.value::BOOLEAN = FALSE AND men.property = 'Gender' and men.value = 'm' AND hair.property = 'Hair' and hair.value = 'Red' AND age.property = 'Age' and age.value::INT < 30
  • 81. E-Blob ID Properties 407 <eyes=”brown”><height=”73”> <married=”1”><smoker=”1”> 408 <hair=”brown”><age=”49”> <married=”0”><smoker=”0”> 409 <age=”37”><height=”66”> <hat=”old”><teeth=”gold”>
  • 82. E-Blobby ● Slow to update, slow to search – need to use application code or lots of parsing ● Requires special database extensions – XML, hstore, etc. ● Advantages over EAV – smaller storage space (with compression) – no horrible joins – combinations easier – feeds directly into application code
  • 83. How to Decide: EAVil vs. ThE-Blob ● Will you be searching for specific items? – EAVil ● Will you be just spitting out all data to the application? – E-Blob ● Do you have special DB extenstions? – E-Blob
  • 84. When Not to use EAV & E-Blob ● As the foundation for all of your data – non-relational databases do this better ● For data which has important checks and constraints – or is required ● For data which needs to be searched fast ● As a way of modifying your application – alter the database!
  • 85. E-blob: The users Table create table users ( first_name text not null check ( length(first_name) between 1 and 40 ), last_name text not null check ( length(last_name) between 2 and 30 ), login text not null unique check ( length(login) between 4 and 30 ), password text not null check ( length(login) between 6 and 30 ), email email not null unique, description text, icon text, level integer not null default 1 references access_levels (level) on update cascade on delete set default, active boolean not null default TRUE, profile xml );
  • 87. Making a DB Schema is a Process not an end result ● Waterfall is Dead – don't make the schema static and the application dynamic – if you use Aglie/TDD/etc. for app, use it for DB – Plan to Iterate
  • 88. Software Development Cycle (TDD) create write tests specification develop get user software feedback deploy new version
  • 89. Database Development Cycle (TDD) write data write tests requirements develop get new developer schema feedback deploy new schema
  • 90. But wait, how do I manage change without breaking the application? ● The same as for software development 1)Testing 2)Migrations 3)Backwards-compatible APIs
  • 91. Testing ● Unit tests for database objects – especially stored procedures ● Application tests for application queries – need to be able to run all application queries and test for breakage ● Performance Regression tests – make sure you're not breaking performance
  • 92. Migrations ● For each schema change, write a SQL migration – use transactional DDL (if available) ● Sequence these updates – tie them to application updates ● Watch out for irreversability – unlike application migrations, database reversions may destroy data
  • 94. Views: Messages Table ● messages are sent from one user to one user create table messages ( message SERIAL not null unique, sender text not null references users(login) on delete cascade on update cascade, recipient text not null references users(login) on delete cascade on update cascade, sent timestamp with time zone not null default current_timestamp, subject text not null check (length(subject) between 3 and 200 ), content text not null ); ● developers want multiple recipients – but, they don't want to refactor all code
  • 95. 1. Create message_recipients create table message_recipients ( message int not null references messages(message) on delete cascade on update cascade, recipient text not null references users(login) on delete cascade on update cascade, constraint message_recipients_key unique ( message, recipient ) );
  • 96. 2. Rename and modify messages INSERT INTO message_recipients SELECT message, recipient FROM messages; ALTER TABLE messages NAME to message_contents; ALTER TABLE message_contents DROP COLUMN recipient;
  • 97. 3. Create VIEW for backwards compatibility CREATE VIEW messages AS SELECT message, sender, array_agg(recipient), sent, subject, content FROM message_contents JOIN message_recipients USING ( message ) GROUP BY message, sender, sent, subject, content;
  • 98. Some Good Practices “practice doesn't make perfect, perfect practice makes perfect.”
  • 99. Consistent, Clear Naming ● Pick a Style, and Stick To It – plural tables or singular? – camel case or underscore? – have a “stylebook” for all developers ● Name objects what they are – don't abbreviate – don't use “cute” or “temporary” names ● If the object changes, change its name
  • 100. Comment Your DB ● Use COMMENT ON … IS – describe each object – if you have time, each column – keep comments up to date – just like you would with application code comment on table privileges is 'a list of application privileges which can be assigned to various privilege levels.';
  • 101. Use Source Code Management ● DDL (data definition language) is Text – check it into Git/SVN/Mercurial/Bazaar – version it
  • 103. Premature Optimization ● Don't do anything “for performance” which compromises the logical model of your data – unless you've tested it thoroughly first ● Poor optimization limits your throughput – but you can always buy more hardware ● Poor design can result in days of downtime – besides, database engines are designed to optimize for good design
  • 104. “Downtime is more costly than slow throughput”
  • 105. Premature Optimization: Five Warning Signs 1)Are you choosing data types according to which is “faster”? 2)Do you find yourself counting bytes? 3)Did you disable foreign keys and constraints because they were “too slow”? 4)Have you “denormalized” or “flattened” tables to make them “faster”? 5)Do you find yourself trying to override the query planner?
  • 106. Polymorphic Fields ● Fields which mean different things depending on the value of another field result link_type link 309 URL http://www.postgresql.org 4718 port 443 5223 OS 88 9001 application 1915
  • 107. Magic Numbers ID = 0 2009-02-30 2000-01-01 -1, 1,2,3,4,5 100
  • 108. Summary 1)The database is a simplified model of the problem you're solving 2)It can be designed simply by working with the development team on creating lists 3)Relational Theory is simple and has only a few rules. 4)Normalization simply means removing duplication
  • 109. Summary 5)Designing a Table in 5 simple steps: 1)list your attributes 2)make them atomic 3)choose data types 4)choose keys 5)add constraints and defaults
  • 110. Summary 6) For any given set of data, there are several possible structures: pick the one the application likes. 7) Dimension tables aren't for everyone. 8) Four Special Cases require Special SQL: 1) Many-to-Many Join Tables 2) Lookup tables and Dimension Tables 3) Tree Structured Data 4) Extensible Data
  • 111. Summary 9) Managing Changes 1) Testing 2) Migrations 3) Views & Procedures as Compatible API 10) Follow Good Practices 11) Avoid Bad Practices
  • 112. More Information ● me – josh@pgexperts.com – www.pgexperts.com – it.toolbox.com/blogs/database-soup ● postgresql – www.postgresql.org ● at OSCON – PostgreSQL booth – State of Lightning Talks (Thursday 1:45) This presentation copyright 2009 Josh Berkus, licensed for distribution under the Creative Commons Attribution License, except for photos, most of which were stolen from other people's websites via images.google.com. Thanks, Google!