Window Functions

Ordinary aggregate functions can only compute results within a row or aggregate all rows into a single result. In contrast, window functions can perform calculations across rows and populate the result in each row.

By applying a window function to certain portions of the rows filtered through queries, functionality akin to the Aggregate Functions is achieved, allowing aggregate functions to also serve as window functions. Window functions can scan all rows, simultaneously displaying both raw data and aggregated analytical results.

Precautions

Column-store tables currently only support the window functions rank(expression) and row_number(expression), along with the aggregate functions sum, count, avg, min, and max. There are no such restrictions for row-store tables.
A single query may include one or multiple window function expressions.
Window functions can only appear in output columns. To use a window function's value for conditional filtering, it must be nested within a subquery, employing the alias of the window function expression at the outer level for filtering conditions. Example:
```
SELECT classid, id, score FROM(SELECT *, avg(score) OVER(PARTITION BY classid) as avg_score FROM score) WHERE score >= avg_score;
```
Query blocks containing window functions support de-duplication via the GROUP BY expression. It is required that the PARTITION BY clause in the window function be a subset of the GROUP BY expression to ensure the window function operates on the de-duplicated outcome of the GROUP BY column. Additionally, the expression in the ORDER BY clause should either be a subset of the GROUP BY expression or an aggregate function from an aggregation operation. Example:
```
SELECT classid,rank() OVER(PARTITION BY classid ORDER BY sum(score)) as avg_score FROM score GROUP BY classid, id;
```

Syntax

Window functions use the specialized OVER clause to define the triggering scope of the window function. The OVER clause groups data and orders elements within those groups. Window functions assign serial numbers to values within a group.

    
         function_name ([expression [, expression ... ]]) OVER ( window_definition ) 
function_name ([expression [, expression ... ]]) OVER window_name
function_name ( * ) OVER ( window_definition ) 
function_name ( * ) OVER window_name

Options for the window_definition clause include:

    
         [ existing_window_name ] 
[ PARTITION BY expression [, ...] ]
[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
[ frame_clause ]

PARTITION BY: Groups rows sharing identical PARTITION BY expression values.

ORDER BY: Dictates the processing sequence of rows by the window function. Following ORDER BY, a column name is mandatory. If a numeral follows, it is treated as a constant, rendering sorting ineffective on the target column.

For the frame_clause options:

    
         [ RANGE | ROWS ] frame_start 
[ RANGE | ROWS ] BETWEEN frame_start AND frame_end

Specifying a window to compute results for all rows within a group entails defining the starting and ending points of the window range. Supported modes are RANGE and ROWS, with ROWS delineating windows by physical units (rows) and RANGE designating them as logical offsets.

Within RANGE and ROWS, boundaries can be set using BETWEEN frame_start AND frame_end. If only frame_start is specified, frame_end is CURRENT ROW by default.

Options for frame_start and frame_end:

CURRENT ROW: Refers to the present row.
N PRECEDING: Indicates the n-th row prior to the current one.
UNBOUNDED PRECEDING: Indicates the initial row of the current partition.
N FOLLOWING: Indicates the n-th row subsequent to the current one.
UNBOUNDED FOLLOWING: Indicates the final row of the current partition.

frame_start cannot be UNBOUNDED FOLLOWING, nor can frame_end be UNBOUNDED PRECEDING. Moreover, frame_end cannot be earlier than frame_start. For example, RANGE BETWEEN CURRENT ROW AND N PRECEDING is invalid.

The LAST_VALUE function accommodates the IGNORE NULLS syntax, returning the last non-null value within its window. If all values are null, null is returned. Its specific format is:

    
         LAST_VALUE (expression [IGNORE NULLS]) OVER (window_definition)

Currently, IGNORE NULLS is only compatible with the following window ranges: ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING and ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

RANK()

Description: The RANK function generates skip sequence numbers for values in each group. The same values have the same sequence number, but the same values occupy multiple sequence numbers.

Return type: bigint.

Example:

Given the table score(id, classid, score), each row represents the student ID, class ID, and exam score.

Use the RANK function to sort students' scores.

    
         CREATE EXTERNAL TABLE score(id int,classid int,score int);
INSERT INTO score VALUES(1,1,95),(2,2,95),(3,2,85),(4,1,70),(5,2,88),(6,1,70);

SELECT id, classid, score,RANK() OVER(ORDER BY score DESC) FROM score;
 id | classid | score | rank
----+---------+-------+------
  1 |       1 |    95 |    1
  2 |       2 |    95 |    1
  6 |       1 |    70 |    5
  4 |       1 |    70 |    5
  5 |       2 |    88 |    3
  3 |       2 |    85 |    4
(6 rows)

ROW_NUMBER()

Description: The ROW_NUMBER function generates consecutive sequence numbers for values in each group. The same values have different sequence numbers.

Return type: bigint.

Example:

    
         SELECT id, classid, score,ROW_NUMBER() OVER(ORDER BY score DESC) FROM score ORDER BY score DESC;
 id | classid | score | row_number
----+---------+-------+------------
  1 |       1 |    95 |          1
  2 |       2 |    95 |          2
  5 |       2 |    88 |          3
  3 |       2 |    85 |          4
  6 |       1 |    70 |          5
  4 |       1 |    70 |          6
(6 rows)

DENSE_RANK()

Description: The DENSE_RANK function generates consecutive sequence numbers for values in each group. The same values have the same sequence number, and the same values occupy only one sequence number.

Return type: bigint.

Example:

    
         SELECT id, classid, score,DENSE_RANK() OVER(ORDER BY score DESC) FROM score;
 id | classid | score | dense_rank
----+---------+-------+------------
  1 |       1 |    95 |          1
  2 |       2 |    95 |          1
  5 |       2 |    88 |          2
  3 |       2 |    85 |          3
  6 |       1 |    70 |          4
  4 |       1 |    70 |          4
(6 rows)

PERCENT_RANK()

Description: The PERCENT_RANK function generates a relative sequence number for the corresponding value in each group, which is calculated using the formula (Rank – 1)/(Total rows – 1). Here, Rank is the corresponding sequence number generated based on the RANK function, and Total rows is the total number of elements in the group.

Return type: double precision.

Example:

    
         SELECT id, classid, score,PERCENT_RANK() OVER(ORDER BY score DESC) FROM score;
 id | classid | score | percent_rank
----+---------+-------+--------------
  1 |       1 |    95 |            0
  2 |       2 |    95 |            0
  3 |       2 |    85 |           .6
  4 |       1 |    70 |           .8
  5 |       2 |    88 |           .4
  6 |       1 |    70 |           .8
(6 rows)

CUME_DIST()

Description: The CUME_DIST function generates cumulative distribution sequence numbers for the corresponding values in each group. This is calculated using the formula (Number of data rows less than or equal to the current value)/(Total number of rows in the group).

Return type: double precision.

Example:

    
         SELECT id,classid,score,CUME_DIST() OVER(ORDER BY score DESC) FROM score;
 id | classid | score |    cume_dist
----+---------+-------+------------------
  1 |       1 |    95 | .333333333333333
  2 |       2 |    95 | .333333333333333
  5 |       2 |    88 |               .5
  3 |       2 |    85 | .666666666666667
  4 |       1 |    70 |                1
  6 |       1 |    70 |                1
(6 rows)

NTILE(num_buckets integer)

Description: The NTILE function evenly distributes an ordered dataset into the number of buckets specified by num_buckets and assigns a bucket number to each row. The distribution should be as balanced as possible.

Return type: integer.

Example:

    
         SELECT id,classid,score,NTILE(3) OVER(ORDER BY score DESC) FROM score;
 id | classid | score | ntile
----+---------+-------+-------
  1 |       1 |    95 |     1
  2 |       2 |    95 |     1
  5 |       2 |    88 |     2
  3 |       2 |    85 |     2
  4 |       1 |    70 |     3
  6 |       1 |    70 |     3
(6 rows)

LAG(value any [, offset integer [, default any ]])

Description: The LAG function generates lagged values for the corresponding values in each group. It uses the value from the row obtained by shifting the current value's row back by offset positions as the sequence number. If the shifted row does not exist, the result defaults to the default value. If unspecified, offset defaults to 1 and default defaults to NULL.