Context, problem, and forces
Our reactive, cloud-native systems are composed of bounded isolated components which rely on event streaming for inter-component communication. We have chosen to leverage value-added cloud services to implement our event streaming and stream processors. This empowers self-sufficient, full-stack teams to focus their efforts on the requirements of their components, but stream processor logic will still encounter bugs because developers are human. We endeavor to eliminate all inter-component synchronous communication, but stream processors ultimately need to perform intra-component synchronous communication to component resources. These resources can become unavailable for brief or extended periods.
Stream processors consume and process events in micro-batches and create checkpoints to mark their progress through the stream of events. When an unhandled error is encountered, a stream processor will not advance the checkpoint and it will retry at the same position in the stream until it either succeeds or the bad event expires from the stream. For intermittent errors, this is likely appropriate behavior because it affords the failing resource time to self-heal and thus minimize manual intervention. Even if the resource frequently produces intermittent errors, the processor will at least be making forward progress, provided the volume can still be processed before the events expire. However, if the error is unrecoverable, such as a bug, then the problem will not self-heal and the processor will retry until the event expires after one to seven days. At this point, the consumer will drop the failing event along with any other events that expire before the processor is able to catch up. Now keep in mind that the data lake will be recording all events so the events are not lost. But the processing of legitimate events is significantly held up and manual intervention is necessary to replay all the dropped events.
Errors usually occur at the most inopportune time. When processing a micro-batch, some events in the batch will process successfully before a later event in the batch fails. In this case, if the error goes unhandled, then the entire batch will be retried. This means that those events which were processed successfully will be reprocessed again and again until the problem is resolved and the checkpoint is advanced. There are plenty of approaches to minimize the probability of this situation, such as tuning the batch size, but none guarantees that this situation will not occur; many make the code overly complex and most have a negative impact on throughput.