(new in SQLstream s-Server version 7.2.1)
File-based T-Sort is a feature of the File-VFS source plugin that mimics the T-Sort mechanism in SQLstream for files instead of rows. It relies on a priority queue to sort files by implied timestamp as described below.
This topic contains the following subtopics:
Term | Description |
---|---|
comparison_tuple | depending on the value of SORT_FIELD: MODIFIED_FILE_TIME: (last modified time of the file, filename) or TIME_IN_FILENAME: the time extracted from the filename : (timestamp from filename, filename) |
Thead | The comparison tuple of the file currently at the head of the queue |
Ttail | The comparison tuple of the file currently at the tail of the queue |
Tread | The comparison tuple of the last file popped from the queue. i.e. the file that was the last read, or is currently being read from |
Tnew | The comparison tuple of the new file that is a candidate to be added to the queue |
The T-Sort mechanism when INGRESS_DELAY_INTERVAL >= 0 is as follows:
Before adding a file to the queue check that the new file should not precede (have an earlier timestamp/filename than) the current file being read.
The first file at the head of the queue will be processed only if there exists a file with a comparison_tuple (time component) >= to the comparison_tuple (time component) of the first file plus the INGRESS_DELAY_INTERVAL.
This system ensures that files are processed in the correct order and that any late-arriving files are either sorted into the right order (if they arrive within the delay interval) or dropped (if they are too late).
Below is a simple example to show how File-based T-sort works. Let us assume that the INGRESS_DELAY_INTERVAL for this use case is 10 minutes and for a file file_0000.csv
the value of the tuple T will be (00:00,file_0000.csv).
Prior queue state (head, ... , tail) | Prior delay interval (minutes) | New file | New file added to queue? | Is a file read (popped) from queue? |
---|---|---|---|---|
<empty> | 0 | file_0000.csv | yes | |
file_0000.csv | 0 | file_0004.csv | yes | |
file_0000.csv, file_0004.csv | 4 | file_0008.csv | yes | |
file_0000.csv, file_0004.csv, file_0008.csv | 8 | file_0006.csv (late file) | yes, in time order | |
file_0000.csv, file_0004.csv, file_0006.csv, file_0008.csv | 8 | file_0012.csv | yes | |
file_0000.csv, file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv | 12 | file_0000.csv can now be popped as soon as file_0012.csv has arrived | ||
file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv | 8 | file_0002.csv (late file) | yes, in time order, at head of queue | |
file_0002.csv, file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv | 10 | as soon as file_0002.csv is added to the queue, the delay interval becomes 10 minutes and we can pop the head of the queue - file_0002.csv | ||
file_0004.csv, file_0006.csv, file_0008.csv, file_0012.csv | 8 | file_0001.csv (late file) | no - file_0001.csv is rejected as its timestamp is earlier than the most recently read file_0002.csv |