Scaling PostgreSQL
     with Stado
Who Am I?
• Jim Mlodgenski
  – Founder of Cirrus Technologies
  – Former Chief Architect of EnterpriseDB
  – Co-organizer of NYCPUG
Agenda
•   What is Stado?
•   Architecture
•   Query Flow
•   Scaling
•   Limitations
What is Stado?
• Continuation of GridSQL
• “Shared-Nothing”, distributed data architecture.
   – Leverage the power of multiple commodity
     servers while appearing as a single database
     to the application
• Essentially...
     Open Source
     Greenplum, Netezza or Teradata
Stado Details
• Designed for Parallel Querying
• Not just “Read-Only”, can execute
  UPDATE, DELETE
• Data Loader for parallel loading
• Standard connectivity via PostgreSQL
  compatible connectors: JDBC, ODBC,
  ADO.NET, libpq (psql)
What Stado is not?
• A replication solution like Slony or Bucardo
• A high availability solution like Synchronous
  Replication in PostgreSQL 9.1
• A scalable transactional solution like PostgresXC
• An elastic, eventually consistent NoSQL database
Architecture
• Loosely coupled, shared-
  nothing architecture
• Data repositories
   – Metadata database
   – Stado database
• Stado processes
   – Central coordinator
   – Agents
Configuration
• Can be configured for multiple logical “nodes” per
  physical server
  – Take advantage of multi-core processors
• Tables may be either replicated or partitioned
• Replicated tables for static lookup data or
  dimensions
  – Partitioned tables for large fact tables
Partitioning
• Tables may simultaneously use Stado
  Partitioning with Constraint Exclusion
  Partitioning
  – Large queries scan a much smaller subset of
    data by using subtables
  – Since each subtable is also partitioned
    across nodes, they are scanned in parallel
  – Queries execute much faster
Creating Tables
• Tables can be partitioned or
  replicated
CREATE TABLE STATE_CODES (
     STATE_CD varchar(2) PRIMARY KEY,
     USPS_CD varchar(2),
     NAME varchar(100),
     GNISIS varchar(8)) REPLICATED;
Creating Tables

CREATE TABLE roads (
  gid integer NOT NULL,
  statefp character varying(2),
  countyfp character varying(3),
  linearid character varying(22),
  fullname character varying(100),
  rttyp character varying(1),
  mtfcc character varying(5),
  the_geom geometry)
PARTITIONING KEY gid ON ALL;
Query Optimization
• Cost Based Optimizer
   – Takes into account Row Shipping
     (expensive)
• Looks for joins with replicated tables
   – Can be done locally
   – Looks for joins between tables on
     partitioned columns
Two Phase Aggregation
• SUM
  – SUM(stat1)
  – SUM2(SUM(stat1)
• AVG
  – SUM(stat1) / COUNT(stat1)
  – SUM2 (SUM(stat1)) / SUM2 (COUNT(stat1))
Query 1
SELECT sum(st_length_spheroid(the_geom,
         'SPHEROID["GRS_1980",6378137,298.257222101]'))/1609.344
        as interstate_miles
 FROM roads
 WHERE rttyp = 'I';




                 interstate_miles
                ------------------
                 84588.5425986619
                (1 row)
Query 1 :
Results
                                       120




                                       100



Nodes Actual (sec)                     80

    1 101.2080566

                      Time (seconds)
    4   25.6410708                     60                              Linear
                                                                       Actual
    8    14.3321144
                                       40
   12     5.4738612
   16     4.8214672
                                       20




                                        0
                                             1   4      8    12   16

                                                     Nodes
Query 2
SELECT s.name as state, c.name as county, a.population, b.road_length,
       a.population/b.road_length as person_per_km
  FROM (SELECT state_cd, county_cd, sum(population) as population
          FROM census_tract
         GROUP BY 1, 2) a,
       (SELECT statefp, countyfp,
               sum(st_length_spheroid(the_geom,
'SPHEROID["GRS_1980",6378137,298.257222101]'))/1000 as road_length
          FROM roads
         GROUP BY 1, 2) b,
       state_codes s, county_codes c
 WHERE a.state_cd = b.statefp
   AND a.county_cd = b.countyfp
   AND a.state_cd = c.state_cd
   AND a.county_cd = c.county_cd
   AND c.state_cd = s.state_cd
 ORDER BY 5 DESC
 LIMIT 20;
state       |     county       | population |   road_length    |   person_per_km
----------------------+-----------------+------------+------------------+------------------
New York             | New York         |    1537195 | 1465.35561969273 | 1049.02521909483
New York             | Kings            |    2465326 | 2785.37685011507 | 885.096032839562
New York             | Bronx            |    1332650 | 1638.47925579201 | 813.345665066614
New York             | Queens           |    2229379 | 4343.78066667893 | 513.234707521383
New Jersey           | Hudson           |     608975 | 1474.86512729116 | 412.902162191933
California           | San Francisco    |     776733 | 2125.05706617179 |   365.51159607175
Pennsylvania         | Philadelphia     |    1517550 | 5067.19918355051 | 299.484970894054
District of Columbia | Washington       |     572059 | 2191.33029860109 | 261.055579054054
New York             | Richmond         |     443728 | 1758.77468237864 | 252.293829588156
Massachusetts        | Suffolk          |     689807 | 2805.37242915611 | 245.887851762877
New Jersey           | Essex            |     793633 | 3359.22581976629 | 236.254733257324
Virginia             | Alexandria City |      128283 |   577.98117468444 | 221.950135434841
Puerto Rico          | San Juan         |     434374 | 1994.26820504899 | 217.811224638829
Virginia             | Arlington        |     189453 | 967.505165121908 | 195.816008874876
New Jersey           | Union            |     522541 | 2827.74655887522 | 184.790605919029
Maryland             | Baltimore City   |     651154 | 3707.01218958787 | 175.654669231717
Puerto Rico          | Catano           |      30071 | 174.765650431886 | 172.064704509654
Hawaii               | Honolulu         |     876156 |   5098.8482067881 | 171.834101441493
Puerto Rico          | Toa Baja         |      94085 | 558.532996996738 | 168.450208861249
Puerto Rico          | Carolina         |     186076 | 1122.20560229076 | 165.812752690026
(20 rows)
Query 2 :
Results
                                        4500


                                        4000


                                        3500
Nodes Actual (sec)
                                        3000
    1   3983.1002548

                       Time (seconds)
                                        2500
    4   1007.1235182                                                     Linear
                                                                         Actual
                                        2000
    8    563.6259202
   12     365.152858                    1500


   16    282.7345952                    1000


                                        500


                                          0
                                               1   4       8   12   16

                                                       Nodes
Scalability
Limitations
• SQL Support
  – Uses its own parser and optimizer
    so:
     • No Window Functions
     • No Stored Procedures
     • No Full Text Search
Transaction Performance
• Single row Insert, Update, or Delete are slow compared
  to a single PostgreSQL instance
   – The data must make an additional network trip to be
     committed
   – All partitioned rows must be hashed to be mapped to
     the proper node
   – All replicated rows must be committed to all nodes
• Use “gs-loader” for bulk loading for better performance
High Availability
• No heartbeat or fail-over control in the coordinator
  – High Availability for each PostgreSQL node must be
    configured separately
  – Streaming replication can be ideal for this
• Getting a consistent backup of the entire Stado
  database is difficult
  – Must ensure there are no transaction are occurring
  – Backup each node separately
Adding Nodes
• Requires Downtime
  – Data must be manually reloaded to partition
    the data to the new node
• With planning, the process can be fast with no
  mapping of data
  – Run multiple PostgreSQL instances on each
    physical server and move the PostgreSQL
    instances to new hardware as needed
Summary
• Stado can improve performance
  tremendously of queries
• Stado can scale linearly as more nodes
  are added
• Stado is open source so if the
  limitations are an issue,
  submit a patch
Download Stado at:
http://stado.us


Jim Mlodgenski
 Email:     jim@cirrusql.com
 Twitter:   @jim_mlodgenski


 NYC PostgreSQL User Group
 http://nycpug.org

Scaling PostreSQL with Stado

  • 1.
  • 2.
    Who Am I? •Jim Mlodgenski – Founder of Cirrus Technologies – Former Chief Architect of EnterpriseDB – Co-organizer of NYCPUG
  • 3.
    Agenda • What is Stado? • Architecture • Query Flow • Scaling • Limitations
  • 4.
    What is Stado? •Continuation of GridSQL • “Shared-Nothing”, distributed data architecture. – Leverage the power of multiple commodity servers while appearing as a single database to the application • Essentially... Open Source Greenplum, Netezza or Teradata
  • 5.
    Stado Details • Designedfor Parallel Querying • Not just “Read-Only”, can execute UPDATE, DELETE • Data Loader for parallel loading • Standard connectivity via PostgreSQL compatible connectors: JDBC, ODBC, ADO.NET, libpq (psql)
  • 6.
    What Stado isnot? • A replication solution like Slony or Bucardo • A high availability solution like Synchronous Replication in PostgreSQL 9.1 • A scalable transactional solution like PostgresXC • An elastic, eventually consistent NoSQL database
  • 7.
    Architecture • Loosely coupled,shared- nothing architecture • Data repositories – Metadata database – Stado database • Stado processes – Central coordinator – Agents
  • 8.
    Configuration • Can beconfigured for multiple logical “nodes” per physical server – Take advantage of multi-core processors • Tables may be either replicated or partitioned • Replicated tables for static lookup data or dimensions – Partitioned tables for large fact tables
  • 9.
    Partitioning • Tables maysimultaneously use Stado Partitioning with Constraint Exclusion Partitioning – Large queries scan a much smaller subset of data by using subtables – Since each subtable is also partitioned across nodes, they are scanned in parallel – Queries execute much faster
  • 10.
    Creating Tables • Tablescan be partitioned or replicated CREATE TABLE STATE_CODES ( STATE_CD varchar(2) PRIMARY KEY, USPS_CD varchar(2), NAME varchar(100), GNISIS varchar(8)) REPLICATED;
  • 11.
    Creating Tables CREATE TABLEroads ( gid integer NOT NULL, statefp character varying(2), countyfp character varying(3), linearid character varying(22), fullname character varying(100), rttyp character varying(1), mtfcc character varying(5), the_geom geometry) PARTITIONING KEY gid ON ALL;
  • 12.
    Query Optimization • CostBased Optimizer – Takes into account Row Shipping (expensive) • Looks for joins with replicated tables – Can be done locally – Looks for joins between tables on partitioned columns
  • 13.
    Two Phase Aggregation •SUM – SUM(stat1) – SUM2(SUM(stat1) • AVG – SUM(stat1) / COUNT(stat1) – SUM2 (SUM(stat1)) / SUM2 (COUNT(stat1))
  • 14.
    Query 1 SELECT sum(st_length_spheroid(the_geom, 'SPHEROID["GRS_1980",6378137,298.257222101]'))/1609.344 as interstate_miles FROM roads WHERE rttyp = 'I'; interstate_miles ------------------ 84588.5425986619 (1 row)
  • 15.
    Query 1 : Results 120 100 Nodes Actual (sec) 80 1 101.2080566 Time (seconds) 4 25.6410708 60 Linear Actual 8 14.3321144 40 12 5.4738612 16 4.8214672 20 0 1 4 8 12 16 Nodes
  • 16.
    Query 2 SELECT s.nameas state, c.name as county, a.population, b.road_length, a.population/b.road_length as person_per_km FROM (SELECT state_cd, county_cd, sum(population) as population FROM census_tract GROUP BY 1, 2) a, (SELECT statefp, countyfp, sum(st_length_spheroid(the_geom, 'SPHEROID["GRS_1980",6378137,298.257222101]'))/1000 as road_length FROM roads GROUP BY 1, 2) b, state_codes s, county_codes c WHERE a.state_cd = b.statefp AND a.county_cd = b.countyfp AND a.state_cd = c.state_cd AND a.county_cd = c.county_cd AND c.state_cd = s.state_cd ORDER BY 5 DESC LIMIT 20;
  • 17.
    state | county | population | road_length | person_per_km ----------------------+-----------------+------------+------------------+------------------ New York | New York | 1537195 | 1465.35561969273 | 1049.02521909483 New York | Kings | 2465326 | 2785.37685011507 | 885.096032839562 New York | Bronx | 1332650 | 1638.47925579201 | 813.345665066614 New York | Queens | 2229379 | 4343.78066667893 | 513.234707521383 New Jersey | Hudson | 608975 | 1474.86512729116 | 412.902162191933 California | San Francisco | 776733 | 2125.05706617179 | 365.51159607175 Pennsylvania | Philadelphia | 1517550 | 5067.19918355051 | 299.484970894054 District of Columbia | Washington | 572059 | 2191.33029860109 | 261.055579054054 New York | Richmond | 443728 | 1758.77468237864 | 252.293829588156 Massachusetts | Suffolk | 689807 | 2805.37242915611 | 245.887851762877 New Jersey | Essex | 793633 | 3359.22581976629 | 236.254733257324 Virginia | Alexandria City | 128283 | 577.98117468444 | 221.950135434841 Puerto Rico | San Juan | 434374 | 1994.26820504899 | 217.811224638829 Virginia | Arlington | 189453 | 967.505165121908 | 195.816008874876 New Jersey | Union | 522541 | 2827.74655887522 | 184.790605919029 Maryland | Baltimore City | 651154 | 3707.01218958787 | 175.654669231717 Puerto Rico | Catano | 30071 | 174.765650431886 | 172.064704509654 Hawaii | Honolulu | 876156 | 5098.8482067881 | 171.834101441493 Puerto Rico | Toa Baja | 94085 | 558.532996996738 | 168.450208861249 Puerto Rico | Carolina | 186076 | 1122.20560229076 | 165.812752690026 (20 rows)
  • 18.
    Query 2 : Results 4500 4000 3500 Nodes Actual (sec) 3000 1 3983.1002548 Time (seconds) 2500 4 1007.1235182 Linear Actual 2000 8 563.6259202 12 365.152858 1500 16 282.7345952 1000 500 0 1 4 8 12 16 Nodes
  • 19.
  • 20.
    Limitations • SQL Support – Uses its own parser and optimizer so: • No Window Functions • No Stored Procedures • No Full Text Search
  • 21.
    Transaction Performance • Singlerow Insert, Update, or Delete are slow compared to a single PostgreSQL instance – The data must make an additional network trip to be committed – All partitioned rows must be hashed to be mapped to the proper node – All replicated rows must be committed to all nodes • Use “gs-loader” for bulk loading for better performance
  • 22.
    High Availability • Noheartbeat or fail-over control in the coordinator – High Availability for each PostgreSQL node must be configured separately – Streaming replication can be ideal for this • Getting a consistent backup of the entire Stado database is difficult – Must ensure there are no transaction are occurring – Backup each node separately
  • 23.
    Adding Nodes • RequiresDowntime – Data must be manually reloaded to partition the data to the new node • With planning, the process can be fast with no mapping of data – Run multiple PostgreSQL instances on each physical server and move the PostgreSQL instances to new hardware as needed
  • 24.
    Summary • Stado canimprove performance tremendously of queries • Stado can scale linearly as more nodes are added • Stado is open source so if the limitations are an issue, submit a patch
  • 25.
    Download Stado at: http://stado.us JimMlodgenski Email: jim@cirrusql.com Twitter: @jim_mlodgenski NYC PostgreSQL User Group http://nycpug.org