Working with Partitioned Sources

Some sources may be pipelines that are distributed across multiple nodes in a federation of s-Server instances. (You can achieve better performance and throughput by carefully distributing, for example, an aggregation state across duplicated pipelines.)

In these cases, if the pipeline contains any stateful operations such as aggregations, joins, custom Java functions then the developer needs ensure that the state is properly partitioned across all instances of the pipeline or the downstream pipeline is configured to roll up aggregations to produce the final result of the pipeline.

Note: StreamLab does not currently support partitioned access for sources such as files and RDBMS sources.

The distributed option works seamlessly for stateless operations, such as CAST or Rename.

A stateful distributed pipeline can use Streamlab templates that involve stateful operations (such as aggregation) as long as the application developer takes the responsibility to ensure the proper distribution of the aggregation state. The distributed pipelines option is intended for advanced SQL developers who know how to run stateful aggregations in a distributed manner.

Best practice is to set up stateless ingestion/enrichment pipelines for non-partitioned and/or non-rewindable (Files, RDBMS, Sockets, websockets, http) sources and then egress ingested and/or enriched, time-sorted data through partitioned Kafka topics for time-series analytics (such as aggregations & stream-stream joins).

If you are doing aggregations or other operations that involve making calculations across data from the entire Kafka topic, you will need to either:

  1. Route results of duplicated pipelines to a native stream sink and then create a new pipeline using this sink as a source. You can then perform aggregations in the new pipeline.
  2. Perform partial aggregations through distributed pipelines and route partially aggregated results into a native stream and add roll-up aggregation steps in a new pipeline.