The GROUP BY clause of a SELECT statement defines subsets of rows based on matching values in one or more columns. Aggregate functions (such as COUNT, AVG, and SUM) can be applied, yielding a separate value for each subset. In other words, the GROUP BY clause returns a single summary row of information for each group of rows that share identical values in the GROUP BY columns.
This topic includes the following sections:
GROUP BY will be familiar to anyone who has written SQL queries for RDBMS tables. For streaming data, the behavior is similar. But there is an important difference: At least one column name or expression in the GROUP BY list must be monotonically increasing. Most often, this will be based on ROWTIME.
Without a monotonic column or expression in the GROUP BY clause, s-Server will not allow the query. The following query, for example, will be rejected:
SELECT STREAM prodId, COUNT(*) FROM Orders GROUP BY prodId
That query aims to compute the number of orders for each product, yielding running totals. However, Orders is a stream, which may continue indefinitely. Consequently, s-Server can never know that it has seen all orders for a given product and therefore can never complete a particular row’s total. Rather than allow a query that can never emit a row, s-Server’s validator rejects the query.
To rehabilitate the failed query above, we can include a time-based expression in the GROUP BY clause. That way, closure for the grouping is guaranteed; and results can be emitted. For example, the following query computes the product counts per hour:
SELECT STREAM FLOOR(Orders.ROWTIME TO HOUR) AS theHour, prodId, COUNT(*) FROM Orders GROUP BY FLOOR(Orders.ROWTIME TO HOUR), prodId;
This yields 1 output row per 1-hour period for each product ID. The expression FLOOR(Orders.ROWTIME TO HOUR) creates a tumbling window. Essentially, timestamps are rounded down to the nearest hour, allowing dissimilar values that belong to a common 1-hour period to be grouped together. Note: This expression is guaranteed to be monotonic because ROWTIME itself is always monotonic.
Note: There are 2 important exceptions to this rule involving monotonicity:
In a streaming context, one of the GROUP BY columns or expressions must be monotonically increasing. Often, though not always, this column will be ROWTIME with a FLOOR, CEIL / CEILING, or STEP function applied. As a result, rows are first grouped by a time period (such as ten seconds, five minutes, or twelve hours). Then, within the time-based grouping, rows can be grouped by 1 or more additional columns.
Note: For some real-world applications, you might need to use a timestamp that is not perfectly monotonic. In that case, please see the workaround solution below.
In s-Server, data is implicitly ordered by source ROWTIME, and physical order is maintained within ROWTIME. By definition, ROWTIME is monotonically increasing and can always be used in a streaming GROUP BY. Duplicate ROWTIME values are permitted in a stream. The GROUP BY operation will keep accumulating rows as long as the ROWTIME value remains the same and emit a row only once the ROWTIME value changes.
If you know that an expression is monotonic, you can declare it so by using the MONOTONIC function. However, if the actual data are not monotonic, the resulting system behavior may not be as expected or desired. See the topic MONOTONIC function in this guide for more details.
Note: For greater flexibility in defining time intervals, we recommend using STEP instead of FLOOR or CEILING. (See below.)
In layman’s terms, FLOOR rounds down whereas CEIL / CEILING rounds up. When called with a date, time, or timestamp expression, these functions specify the precision for rounding by means of TO followed by a time unit such as MINUTE or SECOND. FLOOR returns the largest value less than or equal to the input, subject to the indicated precision. Conversely, CEILING or CEIL returns the smallest value greater than or equal to the input, subject to the indicated precision.
SELECT STREAM AVG("speed") AS "avg_speed","bearing" FROM "StreamLab_Output_buses_demo"."guide_1_out_step_1" AS s GROUP BY FLOOR(s.ROWTIME TO MINUTE), "bearing";
The example above collects data in 1-minute intervals established by FLOOR, then groups rows within each 1-minute interval according to their value in the “bearing” column, calculating the average “speed” for each unique combination of “bearing” and time interval. Note: In this case, data is emitted once per minute, only including data for that most recent 1-minute interval.
The STEP function was created specifically for SQLstream. It performs arithmetic similar to FLOOR, but it allows more flexibility when defining time intervals to GROUP BY. For instance, FLOOR can round down to MINUTE (producing 1-minute intervals) or HOUR (producing 1-hour intervals). But defining intervals of 5 minutes or 30 seconds or 2 hours would not be simple using FLOOR. With STEP, you can customize the interval division using “steps” of any width.
The syntax for STEP differs slightly from FLOOR:
SELECT STREAM partition_id, SUM(measure) FROM s GROUP BY STEP(s.ROWTIME BY INTERVAL '5' MINUTE), partition_id;
The example above uses the STEP function to create a 5-minute tumbling window. Within each consecutive 5-minute interval, rows are grouped by “partition_id” to output a total “measure” for each grouping.
Note: The aggregator correctly generates punctuations when the STEP function is used in a GROUP BY clause. In the previous example, rows with ROWTIME between ‘2016-01-01 07:30:00’ and ‘2016-01-01 07:34.59.999’ would be aggregated as 1 or more output rows with a ROWTIME of
STEP(s.ROWTIME BY INTERVAL '5' MINUTE) + INTERVAL '5' MINUTE = '2016-01-01 07:35:00'
In other words, when aggregated results are emitted, the aggregator also emits a punctuation for the cutoff point of the interval (‘2016-01-01 07:35:00’), not the beginning (‘2016-01-01 07:30:00’). This reflects the closure of the interval, which is the moment when aggregation calculations can be performed.
Another example may be helpful. The following code subdivides ROWTIME according to 10-second intervals. Within each interval, GROUP BY organizes rows based on values in the “bearing” column, reporting the maximum “speed” for each combination of time interval and “bearing”.
SELECT STREAM MAX("speed") AS "max_speed","bearing" FROM "StreamLab_Output_buses_demo"."guide_1_out_step_1" AS s GROUP BY STEP(s.ROWTIME BY INTERVAL '10' SECOND), "bearing";
Result rows are emitted every 10 seconds, including 1 aggregate row for each distinct value of “bearing” that appeared during the last 10-second interval. In this case, there are 2 distinct values: “north” and “northeast”. Consequently, 2 aggregate rows are produced after each tumbling window closes. The output after several consecutive tumbling windows would appear as follows:
+-----------+-----------+ | max_speed | bearing | +-----------+-----------+ | 99 | north | | 117 | northeast | | 93 | north | | 81 | northeast | | 78 | north | | 94 | northeast | | 107 | northeast | | 55 | north | +-----------------------+
Although it is possible to GROUP BY both a larger and a smaller unit simultaneously, such “nesting” does not yield any benefit and should be avoided. Nested grouping will return results based on the more fine-grained GROUP BY expression. For example, if you use *FLOOR( ROWTIME TO MINUTE) and also FLOOR( ROWTIME TO HOUR), the results will be the same as if only the former were used. (When there is more than one grouping key, the effect is to group by the intersection of the groups.)
You should apply GROUP BY to the smallest unit of time you want grouped, and to that unit only.
When you need to GROUP BY a timestamp column that isn’t perfectly monotonically increasing, you can use a WITHIN clause to await late-arriving rows and accept data out of sequence. Essentially, this allows a bounded grace period before emitting the aggregation results. A similar method is used when T-Sorting stream input.
SELECT STREAM ... FROM ... GROUP BY FLOOR( (EVENT_TIME WITHIN INTERVAL '5' MINUTE) TO HOUR);
In the example above, rows can arrive out of order, up to 5 minutes late. s-Server will not emit the result for 01:00:00 until at least 5 minutes after the end of the period, once it has seen a row (or rowtime bound) timed at or after 01:05:00. Rows delayed by more than 5 minutes will be treated as late rows and discarded. They will not be included in the aggregates.
Effectively, this GROUP BY using WITHIN will emit the same results as if you had used a T-Sort on the detail records using the ORDER BY … WITHIN clause, then promoted the event timestamp to ROWTIME, and finally performed a GROUP BY based on ROWTIME.
For more information on the features and benefits of this functionality see Combining T-Sort with GROUP BY.
When you aggregate rows using GROUP BY, the emitted columns will contain only the GROUP BY key expressions, plus all the aggregate function expressions you define using functions such as COUNT, AVG, SUM, MIN, and MAX.
It may be useful to complement that output with additional column values that provide meaningful context, even if these are not part of the GROUP BY clause. You can use FIRST_VALUE or LAST_VALUE as aggregate functions with GROUP BY to return accompanying information.
SELECT STREAM AVG("speed") AS "avg_speed", "bus_id", LAST_VALUE("highway") AS "highway" FROM "StreamLab_Output_buses-data"."dashboard_pipe1_step_3" AS "input" GROUP BY STEP(s.ROWTIME BY INTERVAL '15' MINUTE), "bus_id";
The example above emits each bus_id together with its average speed during the previous 15-minute tumbling window. Additionally, a value for highway is included even though this column is not part of the GROUP BY key. It is possible that several different values of highway were recorded for a given bus_id during that 15-minute period. This query simply returns the most recent, since LAST_VALUE is used.
You can use IGNORE NULLS to ensure that a value is returned for FIRST_VALUE or LAST_VALUE. For example, suppose a weather station is broadcasting weather data in a spotty way. A given row can include a pressure, a temperature, neither, or both. The following query retrieves the most recent report of temperature and pressure from each weather station, doing so every 15 minutes:
SELECT STREAM station_id, LAST_VALUE(temp) IGNORE NULLS AS temp, LAST_VALUE(pressure) IGNORE NULLS AS pressure FROM weather_data AS s GROUP BY STEP(s.ROWTIME BY INTERVAL '15' MINUTE), station_id;
Using IGNORE NULLS ensures that a non-null value for both pressure and temperature will be reported for the previous 15-minute window, as long as at least 1 non-null value was received during that time for each of the two, even if the reported pressure and temperature values were not received simultaneously in a single row.
Sessions are partitioned windows whose beginning and end are determined by the data itself. Whether cited in a WINDOW clause or in a GROUP BY clause, a session is defined through some combination of START WHEN, END WHEN, and TIMEOUT AFTER expressions. In the case of GROUP BY, however, a session should be defined using a SESSION() function call rather than SESSION ON. The syntax is as follows:
SESSION( <comma separated expression list> [ START WHEN <boolean expression> ] [ END WHEN <boolean expression> ] [ TIMEOUT AFTER <interval expression> ] )
In the following example, drawn from telecommunciations, a session key is created for a given device_imsi (representing a unique user of a cellular network) as soon as status = “CALL_STARTED”; and the session ends upon receipt of status = “CALL_ENDED” or else after 10 minutes of inactivity.
SELECT STREAM device_imsi, SUM(bandwidth_used) FROM rowtime_promotion_view AS s GROUP BY device_imsi, SESSION(device_imsi START WHEN status = 'CALL_STARTED' END WHEN status = 'CALL_ENDED' TIMEOUT AFTER INTERVAL '10' MINUTE)
For the query above, as soon as the session ends, s-Server will output the total bandwidth_used during the session by the device_imsi in question. In general, whenever a session ends, s-Server calculates the GROUP BY aggregations that include that session key and emits the session expiry time as ROWTIME.
Notice that GROUP BY with SESSION does not require any expression based on ROWTIME. Normally, streaming GROUP BY requires a monotonically increasing column as part of the GROUP BY key. In this case, the session id serves that purpose. Although ROWTIME is not required within the GROUP BY clause, expressions based on ROWTIME can be combined with SESSION, as in the following example:
SELECT STREAM device_imsi, SUM(bandwidth_used) FROM rowtime_promotion_view AS s GROUP BY FLOOR(s.ROWTIME TO HOUR), device_imsi, SESSION(device_imsi END WHEN status = 'CALL_ENDED' TIMEOUT AFTER INTERVAL '10' MINUTE);
For the query above, s-Server emits an aggregation row whenever the session ends due to “CALL_ENDED” or timeout after 10 minutes of inactivity. That is similar to the previous example. Additionally, as a result of using FLOOR to define a 1-hour interval based on ROWTIME, this query will output an aggregation row for each device_imsi on an hourly basis, even during an ongoing session.
Note: Only a single SESSION call in a GROUP BY or PARTITION BY clause is allowed. Also, s-Server does not currently support combining SESSION with a WITHIN clause, which can be used to GROUP BY a Non-Monotonic Timestamp.