Aggregate Functions

An aggregate function returns a result calculated not from a single row but rather aggregated from data contained in a set of rows, or from information about a set of rows. An aggregate function may appear in

  • the <selection list> portion of a SELECT clause,
  • an ORDER BY clause, or
  • a HAVING clause.

s-Server features the following aggregate functions:

Aggregate functions differ from analytic functions, which are always evaluated relative to a window that you need to specify, and so cannot appear in a HAVING clause. Other differences are described in the below.

Note: You can use Naive Bayes Classifiers with aggregate functions. See Using the Naive Bayes Classifier with s-Server Aggregate Functions below.

Aggregate functions operate slightly differently in aggregate queries on tables than when used in aggregate queries on streams:

If an aggregate query on tables contains a GROUP BY clause, then the aggregate function returns one result per group in the set of input rows. Lacking an explicit GROUP BY clause is equivalent to GROUP BY (), and returns only one result for the entire set of input rows.

On streams, an aggregate query must contain an explicit GROUP BY clause on a monotonically increasing expression based on rowtime. Without one, the sole group is the whole stream, which never ends, preventing any result from being reported. Adding a GROUP BY clause based on a monotonically increasing expression breaks the stream into finite sets of rows, contiguous in time, and each such set can then be aggregated and reported.

Whenever a row arrives that changes the value of the monotonically increasing grouping expression, a new group is started and the previous group is considered complete. The s-Server then outputs the value of the aggregate functions. Note that the GROUP BY clause may also include other non-monotonically-increasing expressions, in which case more than one result per set of rows may be produced.

Performing an aggregate query on streams is often referred to as streaming aggregation, as distinct from the ‘windowed aggregation’ discussed here. Both are discussed in the topic Application Design Considerations in the s-Server Concepts Guide. Stream-to-stream joins are discussed at Stream to Stream Joins; Time-based and row-based windows are discussed in the Glossary.

If an input row contains a null in a column used as an input to a data analysis function, the data analysis function ignores the row (except for COUNT).

Differences Between Aggregate and Analytic Functions

Function Type Outputs Rows or Windows Used Notes
Aggregate Functions One output row per group of input rows All output columns are calculated over the same window or same group of rows COUNT DISTINCT is not allowed in streaming aggregation. That is, statements of the following type are not allowed: SELECT COUNT(DISTINCT x)… FROM… GROUP BY…
Analytic Functions One output row for each input row. Each output column may be calculated using a different window or partition. COUNT DISTINCT can be used as an analytic function and in windowed aggregation.

Streaming Aggregation and Rowtime bounds

Normally, an aggregate query generates a result when a row arrives that changes the value of the monotonically increasing expression in the GROUP BY. For example, if the query is grouped by FLOOR(rowtime TO MINUTE ), and the rowtime of the current row is 9:59.30, then a new row with a rowtime of 10:00.00 will trigger the result.

Alternately, you can use a rowtime bound to advance the monotonically increasing expression and enable the query to return a result. For example, if the query is grouped by FLOOR(rowtime TO MINUTE ), and the rowtime of the current row is 9:59.30, then an incoming rowtime bound of 10:00.00 will also trigger the result.

See below for examples of rowtime bounds in streaming aggregation, and the overview of rowtime bounds in the Concepts Guide for more information on rowtime bounds in general.

Examples: Aggregate queries on tables

Assume the following data in table SALES.EMPS:

NAME DEPTNO AGE GENDER
Amy 10 30 F
Beth 20 25 F
Charles 30 40 M
David 10 20 M
Eric 10 40 M
Fiona 30 25 F
Gregory 40 60 M
Harriet 40 55 F

The following SQL uses the AVG aggregate function as part of a query to find the average age of all employees:

SELECT
AVG(AGE) AS AVERAGE_AGE
FROM SALES.EMPS;

Result:

| AVERAGE_AGE |
| 38 |

To find the average age of employees in each department, we can add an explicit GROUP BY clause to the query:

SELECT DEPTNO, AVG(AGE) AS AVERAGE_AGE
FROM SALES.EMPS
GROUP BY DEPTNO;

Returns:

DEPTNO AVERAGE_AGE
10 30
20 25
30 40
40 57

Examples: Aggregate Queries on Streams (Streaming Aggregation)

Assume the following information flowing through the stream WEATHERSTREAM:

ROWTIME CITY TEMP
2018-11-01 01:00:00.0 Denver 29
2018-11-01 01:00:00.0 Anchorage 2
2018-11-01 06:00:00.0 Miami 65
2018-11-01 07:00:00.0 Denver 32
2018-11-01 09:00:00.0 Anchorage 9
2018-11-01 13:00:00.0 Denver 50
2018-11-01 17:00:00.0 Anchorage 10
2018-11-01 18:00:00.0 Miami 71
2018-11-01 19:00:00.0 Denver 43
2018-11-02 01:00:00.0 Anchorage 4
2018-11-02 01:00:00.0 Denver 39
2018-11-02 07:00:00.0 Denver 46
2018-11-02 09:00:00.0 Anchorage 3
2018-11-02 13:00:00.0 Denver 56
2018-11-02 17:00:00.0 Anchorage 2
2018-11-02 19:00:00.0 Denver 50
2018-11-03 01:00:00.0 Denver 36
2018-11-03 01:00:00.0 Anchorage 1

If you want to find the minimum and maximum temperature recorded anywhere each day (globally regardless of city), the minimum and maximum temperature can be calculated using the aggregate functions MIN and MAX respectively. To indicate that we want this information on a per-day basis (and to provide a monotonically increasing expression as the argument of the GROUP BY clause), we use the FLOOR function to round each row’s rowtime down to the nearest day:

SELECT STREAM
FLOOR(WEATHERSTREAM.ROWTIME to DAY) AS FLOOR_DAY,
MIN(TEMP) AS MIN_TEMP,
MAX(TEMP) AS MAX_TEMP
FROM WEATHERSTREAM
GROUP BY FLOOR(WEATHERSTREAM.ROWTIME TO DAY);

The result of the aggregate query is:

FLOOR_DAY MIN_TEMP MAX_TEMP
2018-11-01 00:00:00.0 2 71
2018-11-02 00:00:00.0 2 56

Note that there is no row for 2018-11-03, even though the example data does include temperature measurements on that day. This is because the rows for 2018-11-03 cannot be aggregated until all rows for that day are known to have arrived, and that will only happen when either a row with a rowtime of 2018-11-04 00:00:00.0 (or later) or a rowtime bound of 2018-11-04 00:00:00.0 (or later) arrives.If and when either did arrive, the next result would be:

FLOOR_DAY MIN_TEMP MAX_TEMP
2018-11-03 00:00:00.0 1 36

Let’s say that instead of finding the global minimum and maximum temperatures each day, we want to find the minimum, maximum, and average temperature for each city each day. To do this, we use the SUM and COUNT aggregate functions to compute the average, and add CITY to the GROUP BY clause:

SELECT STREAM
FLOOR(WEATHERSTREAM.ROWTIME TO DAY) AS FLOOR_DAY,
CITY,
MIN(TEMP) AS MIN_TEMP,
MAX(TEMP) AS MAX_TEMP,
SUM(TEMP)/COUNT(TEMP) AS AVG_TEMP
FROM WEATHERSTREAM
GROUP BY FLOOR(WEATHERSTREAM.ROWTIME TO DAY), CITY;

The result of the aggregate query is:

FLOOR_DAY CITY MIN_TEMP MAX_TEMP AVG_TEMP
2018-11-01 00:00:00.0 Anchorage 2 10 7
2018-11-01 00:00:00.0 Denver 29 50 38
2018-11-01 00:00:00.0 Miami 65 71 68
2018-11-02 00:00:00.0 Anchorage 2 4 3
2018-11-02 00:00:00.0 Denver 39 56 47

In this case, the arrival of rows for a new day’s temperature measurements triggers the aggregation of the previous day’s data, grouped by CITY, which then results in one row being produced per city included in the day’s measurements.

Here again, a rowtime bound 2018-11-04 00:00:00.0 could be used to prompt a result for 2018-11-03 prior to any actual measurements for 2018-11-04 coming in:

FLOOR_DAY CITY MIN_TEMP MAX_TEMP AVG_TEMP
2018-11-03 00:00:00.0 Anchorage 1 1 1
2018-11-03 00:00:00.0 Denver 36 36 36

Using the Naive Bayes Classifier with s-Server Aggregate Functions

Naive Bayes classifiers are used as part of machine learning. Using a collection of features assumed to have strong (naive) independence, these classifiers use Bayes’s theorem to assign class labels to problem instances.

For more information, see https://en.wikipedia.org/wiki/Naive_Bayes_classifier

This topic describes how you can implement the Naive Bayes Classifier using SQL aggregate functions that are computed on sliding time windows to perform Naive Bayes classification on streaming data with no latency.

Example: Sex Classification

This topic uses the example described at [ https://en.wikipedia.org/wiki/Naive_Bayes_classifier# Sex_classification](https://en.wikipedia.org/wiki/Naive_Bayes_classifier# Sex_classification)

This example performs sex classification based on height, weight and the foot size of each person. These factors are assumed to have strong independence.

Training

To use a Naive Bayes classifier in s-Server, you both train and test the classifier on streaming data on a continuous basis.

Training the classifier involves computing the average and standard deviation for each of height, weight & foot size for males as well as females.

Shown below are examples SQL for average and standard deviations.

SELECT STREAM *,
AVG(CASE WHEN gender = 'M' THEN height ELSE NULL END) OVER w AS avg_male_height,
STDDEV_SAMP(CASE WHEN gender = 'M' THEN height ELSE NULL END) OVER w stddev_male_height,
AVG(CASE WHEN gender = 'F' THEN height ELSE NULL END) OVER w AS avg_female_height,
STDDEV_SAMP(CASE WHEN gender = 'F' THEN height ELSE NULL END) OVER w stddev_female_height,
AVG(CASE WHEN gender = 'M' THEN weight ELSE NULL END) OVER w AS avg_male_weight,
STDDEV_SAMP(CASE WHEN gender = 'M' THEN weight ELSE NULL END) OVER w stddev_male_weight,
AVG(CASE WHEN gender = 'F' THEN weight ELSE NULL END) OVER w AS avg_female_weight,
STDDEV_SAMP(CASE WHEN gender = 'F' THEN weight ELSE NULL END) OVER w stddev_female_weight,
AVG(CASE WHEN gender = 'M' THEN foot_size ELSE NULL END) OVER w AS avg_male_foot_size,
STDDEV_SAMP(CASE WHEN gender = 'M' THEN foot_size ELSE NULL END) OVER w stddev_male_foot_size,
AVG(CASE WHEN gender = 'F' THEN foot_size ELSE NULL END) OVER w AS avg_female_foot_size,
STDDEV_SAMP(CASE WHEN gender = 'F' THEN foot_size ELSE NULL END) OVER w stddev_female_foot_size

FROM data_stream
-- WINDOW w AS(RANGE INTERVAL '1' DAY PRECEDING) -- limited training data size
WINDOW w AS (RANGE UNBOUNDED PRECEDING); -- unbounded window means entire history of the data stream is the training data size.

Each streaming row generates up-to-date values for AVG and STDDEV_SAMP that are based on the complete history of the data stream.

Testing

To use Naive Bayes, you take the data above and plug it into the formulae below, where:

Where:

𝛔 is STDDEV_SAMP() calculated above for each sex. and 𝞵 is AVG() calculated above for each sex.

Similarly for females

Evidence is calculated as follows

And