You Can Do That
    Without Perl?
 Lists and Recursion and Trees, Oh My!
 YAPC10, Pittsburgh 2009

Subject: The YAPC::NA 2009 Survey
Windowing Functions
What are Windowing Functions?

        PGCon2009, 21-22 May 2009 Ottawa
What are the Windowing Functions?

• Similar to classical aggregates but does more!
• Provides access to set of rows from the current row
• Introduced SQL:2003 and more detail in SQL:2008
   – SQL:2008 ( 02: SQL/Foundation
      • 4.14.9 Window tables
      • 4.15.3 Window functions
      • 6.10 <window function>
      • 7.8 <window clause>
• Supported by Oracle, SQL Server, Sybase and DB2
   – No open source RDBMS so far except PostgreSQL (Firebird trying)
• Used in OLAP mainly but also useful in OLTP
   – Analysis and reporting by rankings, cumulative aggregates

                        PGCon2009, 21-22 May 2009 Ottawa
How they work and what you get

        PGCon2009, 21-22 May 2009 Ottawa
How they work and what do you get

• Windowed table
  – Operates on a windowed table
  – Returns a value for each row
  – Returned value is calculated from the rows in the window

                         PGCon2009, 21-22 May 2009 Ottawa
How they work and what do you get

     • You can use…
        – New window functions
        – Existing aggregate functions
        – User-defined window functions
        – User-defined aggregate functions

            PGCon2009, 21-22 May 2009 Ottawa
How they work and what do you get

• Completely new concept!
  – With Windowing Functions, you can reach outside the current row

                      PGCon2009, 21-22 May 2009 Ottawa
How they work and what you get
[Aggregates]   SELECT key, SUM(val) FROM tbl GROUP BY key;

                    PGCon2009, 21-22 May 2009 Ottawa
How they work and what you get
[Windowing Functions]   SELECT key, SUM(val) OVER (PARTITION BY key) FROM tbl;

                             PGCon2009, 21-22 May 2009 Ottawa
How they work and what you get
Who is the highest paid relatively compared with the department average?

        depname             empno                    salary
        develop             10                       5200
        sales               1                        5000
        personnel           5                        3500
        sales               4                        4800
        sales               6                        550
        personnel           2                        3900
        develop             7                        4200
        develop             9                        4500
        sales               3                        4800
        develop             8                        6000
        develop             11                       5200

                        PGCon2009, 21-22 May 2009 Ottawa
How they work and what you get
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname)::int,

        salary - avg(salary) OVER (PARTITION BY depname)::int AS diff
FROM empsalary ORDER BY diff DESC

          depname      empno         salary         avg        diff
          develop      8             6000           5020       980
          sales        6             5500           5025       475
          personnel    2             3900           3700       200
          develop      11            5200           5020       180
          develop      10            5200           5020       180
          sales        1             5000           5025       -25
          personnel    5             3500           3700       -200
          sales        4             4800           5025       -225
          sales        3             4800           5025       -225
          develop      9             4500           5020       -520
          develop      7             4200           5020       -820

                            PGCon2009, 21-22 May 2009 Ottawa
Anatomy of a Window
• Represents set of rows abstractly as:
• A partition
  – Specified by PARTITION BY clause
  – Never moves
  – Can contain:
• A frame
  – Specified by ORDER BY clause and frame clause
  – Defined in a partition
  – Moves within a partition
  – Never goes across two partitions
• Some functions take values from a partition.
  Others take them from a frame.
                    PGCon2009, 21-22 May 2009 Ottawa
A partition
• Specified by PARTITION BY clause in OVER()
• Allows to subdivide the table, much like
  GROUP BY clause
• Without a PARTITION BY clause, the whole
  table is in a single partition

             OVER   (                                                 )
                        partition-clause    order-

         PARTITION BY expr, …
                          PGCon2009, 21-22 May 2009 Ottawa
A frame
• Specified by ORDER BY clause and frame
  clause in OVER()
• Allows to tell how far the set is applied
• Also defines ordering of the set
• Without order and frame clauses, the whole of
  partition is a single frame
            OVER   (                                          )
                       partition-clause   order-

                         PGCon2009, 21-22 May 2009 Ottawa
The WINDOW clause
• You can specify common window definitions
  in one place and name it, for convenience
• You can use the window at the same query
SELECT target_list, … wfunc() OVER w
FROM table_list, …
WHERE qual_list, ….
GROUP BY groupkey_list, ….
HAVING groupqual_list, …
WINDOW w      A   (                                                )
                       partition-clause     order-

                       PGCon2009, 21-22 May 2009 Ottawa
Built-in Windowing Functions

         PGCon2009, 21-22 May 2009 Ottawa
List of built-in Windowing Functions

•   row_number()                 •   lag()
•   rank()                       •   lead()
•   dense_rank()                 •   first_value()
•   percent_rank()               •   last_value()
•   cume_dist()                  •   nth_value()
•   ntile()

              PGCon2009, 21-22 May 2009 Ottawa
     • Returns number of the current row

   SELECT val, row_number() OVER (ORDER BY val DESC) FROM tbl;

     val                             row_number()
     5                               1
     5                               2
     3                               3
     1                               4
Note: row_number() always incremented values independent of frame

                    PGCon2009, 21-22 May 2009 Ottawa
• Returns rank of the current row with gap

    SELECT val, rank() OVER (ORDER BY val DESC) FROM tbl;

    val                             rank()
    5                               1
    5                               1
    3                               3
    1                               4
    Note: rank() OVER(*empty*) returns 1 for all rows, since all rows
    are peers to each other

                       PGCon2009, 21-22 May 2009 Ottawa
• Returns rank of the current row without gap

       SELECT val, dense_rank() OVER (ORDER BY val DESC) FROM tbl;

       val                             dense_rank()
       5                               1
                                                                     no gap
       5                               1
       3                               2
       1                               3
Note: dense_rank() OVER(*empty*) returns 1 for all rows, since all rows
are peers to each other

                          PGCon2009, 21-22 May 2009 Ottawa
• Returns relative rank; (rank() – 1) / (total row – 1)
   SELECT val, percent_rank() OVER (ORDER BY val DESC) FROM tbl;

       val                             percent_rank()
       5                               0
       5                               0
       3                               0.666666666666667
       1                               1

      values are always between 0 and 1 inclusive.

Note: percent_rank() OVER(*empty*) returns 0 for all rows, since all rows
are peers to each other

                         PGCon2009, 21-22 May 2009 Ottawa
• Returns relative rank; (# of preced. or peers) / (total row)

 SELECT val, cume_dist() OVER (ORDER BY val DESC) FROM tbl;

   val                             cume_dist()
   5                               0.5                          =2/4

   5                               0.5                          =2/4

   3                               0.75                         =3/4

   1                               1                            =4/4
       The result can be emulated by
       “count(*) OVER (ORDER BY val DESC) / count(*) OVER ()”
       Note: cume_dist() OVER(*empty*) returns 1 for all rows, since all rows
       are peers to each other
                          PGCon2009, 21-22 May 2009 Ottawa
• Returns dividing bucket number

   SELECT val, ntile(3) OVER (ORDER BY val DESC) FROM tbl;

   val                               ntile(3)
   5                                 1                                    4%3=1

   5                                 1
   3                                 2
   1                                 3
    The results are the divided positions, but if there’s remainder add
    row from the head
    Note: ntile() OVER (*empty*) returns same values as above, since
    ntile() doesn’t care the frame but works against the partition
                        PGCon2009, 21-22 May 2009 Ottawa
      • Returns value of row above

SELECT val, lag(val) OVER (ORDER BY val DESC) FROM tbl;

val                            lag(val)
5                              NULL
5                              5
3                              5
1                              3
            Note: lag() only acts on a partition.

                  PGCon2009, 21-22 May 2009 Ottawa
• Returns value of the row below

SELECT val, lead(val) OVER (ORDER BY val DESC) FROM tbl;

val                            lead(val)
5                              5
5                              3
3                              1
1                              NULL

         Note: lead() acts against a partition.

                  PGCon2009, 21-22 May 2009 Ottawa
• Returns the first value of the frame

    SELECT val, first_value(val) OVER (ORDER BY val DESC) FROM tbl;

    val                            first_value(val)
    5                              5
    5                              5
    3                              5
    1                              5

                      PGCon2009, 21-22 May 2009 Ottawa
• Returns the last value of the frame
    SELECT val, last_value(val) OVER

    val                            last_value(val)
    5                              1
    5                              1
    3                              1
    1                              1
    Note: frame clause is necessary since you have a frame between
    the first row and the current row by only the order clause

                      PGCon2009, 21-22 May 2009 Ottawa
• Returns the n-th value of the frame
    SELECT val, nth_value(val, val) OVER

    val                               nth_value(val, val)
    5                                 NULL
    5                                 NULL
    3                                 3
    1                                 5
   Note: frame clause is necessary since you have a frame between
   the first row and the current row by only the order clause
                      PGCon2009, 21-22 May 2009 Ottawa
aggregates(all peers)
• Returns the same values along the frame

    SELECT val, sum(val) OVER () FROM tbl;

    val                            sum(val)
    5                              14
    5                              14
    3                              14
    1                              14
   Note: all rows are the peers to each other

                      PGCon2009, 21-22 May 2009 Ottawa
cumulative aggregates
• Returns different values along the frame

    SELECT val, sum(val) OVER (ORDER BY val DESC) FROM tbl;

    val                            sum(val)
    5                              10
    5                              10
    3                              13
    1                              14
    Note: row#1 and row#2 return the same value since they are the peers.
    the result of row#3 is sum(val of row#1…#3)

                      PGCon2009, 21-22 May 2009 Ottawa
Recursion and Trees, Oh My:
Common Table Expressions
Thanks to:
•   The ISO SQL Standards Committee

•   Yoshiyuki Asaba

•   Ishii Tatsuo

•   Jeff Davis

•   Gregory Stark

•   Tom Lane

•   etc., etc., etc.
Recursion in General

Initial Condition
Recursion step
E1 List Table
CREATE TABLE employee(
    boss_id INTEGER,
    UNIQUE(id, boss_id)/*, etc., etc. */

INSERT INTO employee(id, boss_id)
VALUES(1,NULL), /* El capo di tutti capi */
Tree Query Initiation
WITH RECURSIVE t(node, path) AS (
   SELECT id, ARRAY[id] FROM employee WHERE boss_id IS NULL
   /* Initiation Step */
    SELECT, t.path || ARRAY[]
    FROM employee e1 JOIN t ON (e1.boss_id = t.node)
    WHERE id NOT IN (t.path)
    CASE WHEN array_upper(path,1)>1 THEN '+-' ELSE '' END ||
    REPEAT('--', array_upper(path,1)-2) ||
    node AS "Branch"
ORDER BY path;
Tree Query Recursion
WITH RECURSIVE t(node, path) AS (
   SELECT id, ARRAY[id] FROM employee WHERE boss_id IS NULL
    SELECT, t.path || ARRAY[] /* Recursion */
    FROM employee e1 JOIN t ON (e1.boss_id = t.node)
    WHERE id NOT IN (t.path)
    CASE WHEN array_upper(path,1)>1 THEN '+-' ELSE '' END ||
    REPEAT('--', array_upper(path,1)-2) ||
    node AS "Branch"
ORDER BY path;
Tree Query
WITH RECURSIVE t(node, path) AS (
    SELECT id, ARRAY[id] FROM employee WHERE boss_id IS NULL
    SELECT, t.path || ARRAY[]
    FROM employee e1 JOIN t ON (e1.boss_id = t.node)
    WHERE id NOT IN (t.path) /* Termination Condition */
    CASE WHEN array_upper(path,1)>1 THEN '+-' ELSE '' END ||
    REPEAT('--', array_upper(path,1)-2) ||
    node AS "Branch"
ORDER BY path;
Tree Query Display
WITH RECURSIVE t(node, path) AS (
   SELECT id, ARRAY[id] FROM employee WHERE boss_id IS NULL
    SELECT, t.path || ARRAY[]
    FROM employee e1 JOIN t ON (e1.boss_id = t.node)
    WHERE id NOT IN (t.path)
    CASE WHEN array_upper(path,1)>1 THEN '+-' ELSE '' END ||
    REPEAT('--', array_upper(path,1)-2) ||
    node AS "Branch" /* Display */
ORDER BY path;
Tree Query Initiation
        (16 rows)
Travelling Salesman Problem
Given a number of cities and the costs of travelling
from any city to any other city, what is the least-
cost round-trip route that visits each city exactly
once and then returns to the starting city?
TSP Schema
    from_city TEXT NOT NULL,
    to_city TEXT NOT NULL,
    distance INTEGER NOT NULL,
    PRIMARY KEY(from_city, to_city),
    CHECK (from_city < to_city)
TSP Data
    ('Bari','Reggio Calabria',464),
TSP Program:
     Symmetric Setup

 WITH RECURSIVE both_ways(
 )           /* Working Table */
 AS (
           to_city AS "from_city",
           from_city AS "to_city",
TSP Program:
     Symmetric Setup

 WITH RECURSIVE both_ways(
 AS (/* Distances One Way */
          to_city AS "from_city",
          from_city AS "to_city",
TSP Program:
    Symmetric Setup

 WITH RECURSIVE both_ways(
 AS (
 UNION ALL /* Distances Other Way */
           to_city AS "from_city",
           from_city AS "to_city",
TSP Program:
      Path Initialization Step
paths (
AS (
          ARRAY[from_city] AS "path"
          both_ways b1
          b1.from_city = 'Roma'
TSP Program:
             Path Recursion Step

         p.distance + b2.distance,
         p.path || b2.from_city
         both_ways b2
         paths p
         ON (
              p.to_city = b2.from_city
              b2.from_city <> ALL (p.path[
              ]) /* Prevent re-tracing */
              array_upper(p.path,1) < 6
TSP Program:
          Timely Termination Step

         p.distance + b2.distance,
         p.path || b2.from_city
         both_ways b2
         paths p
         ON (
              p.to_city = b2.from_city
              b2.from_city <> ALL (p.path[
              ]) /* Prevent re-tracing */
              array_upper(p.path,1) < 6 /* Timely Termination */
TSP Program:
               Filter and Display

     path || to_city AS "path",
     to_city = 'Roma'
     ARRAY['Milan','Firenze','Napoli'] <@ path
ORDER BY distance, path
TSP Program:
                 Filter and Display

davidfetter@tsp=# i travelling_salesman.sql
               path               | distance
 {Roma,Firenze,Milan,Napoli,Roma} |     1553
(1 row)

Time: 11679.503 ms
TSP Program:
                                                   Gritty Details

                                                                 QUERY PLAN
 Limit (cost=44.14..44.15 rows=1 width=68) (actual time=16131.326..16131.327 rows=1 loops=1)
   CTE both_ways
     -> Append (cost=0.00..4.64 rows=132 width=19) (actual time=0.011..0.090 rows=132 loops=1)
            -> Seq Scan on pairs (cost=0.00..1.66 rows=66 width=19) (actual time=0.010..0.026 rows=66 loops=1)
            -> Seq Scan on pairs (cost=0.00..1.66 rows=66 width=19) (actual time=0.004..0.029 rows=66 loops=1)
   CTE paths
     -> Recursive Union (cost=0.00..38.72 rows=31 width=104) (actual time=0.102..10631.385 rows=1092883 loops=1)
            -> CTE Scan on both_ways b1 (cost=0.00..2.97 rows=1 width=68) (actual time=0.100..0.251 rows=11 loops=1)
                  Filter: (from_city = 'Roma'::text)
            -> Hash Join (cost=0.29..3.51 rows=3 width=104) (actual time=180.125..958.459 rows=182145 loops=6)
                  Hash Cond: (b2.from_city = p.to_city)
                  Join Filter: (b2.from_city <> ALL (p.path[2:array_upper(p.path, 1)]))
                  -> CTE Scan on both_ways b2 (cost=0.00..2.64 rows=132 width=68) (actual time=0.001..0.056 rows=132 loops=5)
                  -> Hash (cost=0.25..0.25 rows=3 width=68) (actual time=172.292..172.292 rows=22427 loops=6)
                        -> WorkTable Scan on paths p (cost=0.00..0.25 rows=3 width=68) (actual time=123.167..146.659 rows=22427 loops=6)
                              Filter: (array_upper(path, 1) < 6)
   -> Sort (cost=0.79..0.79 rows=1 width=68) (actual time=16131.324..16131.324 rows=1 loops=1)
          Sort Key: paths.distance, ((paths.path || paths.to_city))
          Sort Method: top-N heapsort Memory: 17kB
          -> CTE Scan on paths (cost=0.00..0.78 rows=1 width=68) (actual time=36.621..16127.841 rows=4146 loops=1)
                Filter: (('{Milan,Firenze,Napoli}'::text[] <@ path) AND (to_city = 'Roma'::text))
 Total runtime: 16180.335 ms
(22 rows)
          WTF Dimension
Z(Ix, Iy, Cx, Cy, X, Y, I)
AS (
     SELECT Ix, Iy, X::float, Y::float, X::float, Y::float, 0
          (SELECT -2.2 + 0.031 * i, i FROM generate_series(0,101) AS i) AS xgen(x,ix)
          (SELECT -1.5 + 0.031 * i, i FROM generate_series(0,101) AS i) AS ygen(y,iy)
     SELECT Ix, Iy, Cx, Cy, X * X - Y * Y + Cx AS X, Y * X * 2 + Cy, I + 1
     FROM Z
     WHERE X * X + Y * Y < 16::float
     AND I < 100
Zt (Ix, Iy, I) AS (
    SELECT Ix, Iy, MAX(I) AS I
    FROM Z
    GROUP BY Iy, Ix
    ORDER BY Iy, Ix

SELECT array_to_string(
             ' .,,,-----++++%%%%@@@@#### ',
Thank You!
