High Availability through Parallel Redundancy and Checkpointing

Guavus SQLstream enables zero latency recovery from scenarios where one or more processing nodes fail, such as when a server crashes. In order to do this, every pipeline is configured to run on at least 3 separate physical servers.

With at least 3 distinct nodes processing any stream operation, Guavus SQLstream accepts data from multiple producers and fires on the first available input, discarding straggling copies with deduplication. The redundancy and deduplication is possible because Guavus SQLstream is a declarative, deterministic system that synchronizes via timestamps. Computation proceeds by asynchronously “racing” each redundant computation pipeline against the others accepting the first arriver at each node in the dataflow execution graph.

See the topic Adding and Removing Processing Nodes for Kafka in the Integrating with Other Systems Guide for more details.

Automatic Recovery Through Checkpointing

Guavus SQLstream provides checkpointing capabilities for streaming data processing, leveraging transacted source positions from sources such as Apache Kafka, AWS Kinesis, and ActiveMQ, as well as leveraging the transactional capabilities of a variety of external databases. Apache Kafka, AWS Kinesis and AMQP each support subscription to a message queue with “starting_time” set to a given “message timestamp”. SQLstream promotes this “message timestamp” for each message as ROWTIME. The ROWTIME column acts as a “natural watermark” in the Guavus SQLstream server.

Commits made to sinks (data lakes) of streaming pipelines ensure those happen at the ROWTIME boundaries. During recovery, “starting_time” for Apache Kafka is set to a millisecond more than the last ROWTIME committed to the sink of the streaming pipeline. This ensures “exactly once” semantics.

The following example shows streaming access to data from Kafka messages.

CREATE OR REPLACE FOREIGN TABLE "WATERMARKTABLE"(
SINK_NAME VARCHAR(1024) NOT NULL,
POSITION_KEY VARCHAR(2000) NOT NULL
)
SERVER "SQLServer_JDBC"
OPTIONS (
TRANSACTION_ROWTIME_LIMIT '0',
TRANSACTION_ROW_LIMIT '1',
SCHEMA_NAME 'dbo',
TABLE_NAME 'SQLSTREAM_WATERMARKS'
);

-- The Data Sink in Microsoft SQL Server.
CREATE OR REPLACE FOREIGN TABLE "ForeignDailyTransactions"(
transaction_time TIMESTAMP,
transactional.id BIGINT,
rate DOUBLE,
amt DECIMAL(14,2)
)
SERVER "SQLServer_JDBC"
OPTIONS (
TRANSACTION_ROWTIME_LIMIT '1000',
TRANSACTION_ROW_LIMIT '0',
SCHEMA_NAME 'dbo',
TABLE_NAME 'DailyTransactionsByUsage'
);

--The Streaming data source, a Kafka topic
CREATE OR REPLACE FOREIGN STREAM "USAGE_DETAIL_LOG_STREAM" (
transaction_time TIMESTAMP,
transactional.id BIGINT,
rate DOUBLE,
amt DECIMAL(14,2),
SQLSTREAM_POSITION_KEY VARCHAR(512)
)
SERVER "KafkaReaderServer"
OPTIONS (
TOPIC 'test',
"SEED_BROKERS" 'localhost',
"PORT" '9092',
WATERMARKED_SINK '"test"."ForeignDailyTransactions"',
-- Read the last committed watermark from SQL Server
-- as STARTING_TIME for kafka subscription.
WATERMARKS 'LOCALDB.test.WATERMARKTABLE',
PARSER 'CSV'
);

“SQLSTREAM_POSITION_KEY” is a watermark column that is pushed through the stream processing pipeline. When the results of the stream processing pipeline are stored in a Microsoft SQL Server table called “DailyTransactions”, the watermark columns are also saved to a table called “SQLSTREAM_WATERMARKS” in Microsoft SQL Server.

During recovery, the query on foreign stream “USAGE_DETAIL_LOG_STREAM” starts with the “initial position” as saved in the Microsoft SQL Server table “SQLSTREAM_WATERMARKS” at the time of the crash.