Issue

We are using Kafka as messaging system between micro-services. We have a kafka consumer listening to a particular topic and then publishing the data into another topic to be picked up by Kafka Connector who is responsible for publishing it into some data storage.

We are using Apache Avro as serialization mechanism.

We need to enable the DLQ to add the fault tolerance to the Kafka Consumer and the Kafka Connector.

Any message could move to DLQ due to multiple reasons:

Bad format
Bad Data
Throttling with high volume of messages , so some message could move to DLQ
Publish to Data store failed due to connectivity.

For the 3rd and 4th points as above, we would like to re-try message again from DLQ.

What is the best practice on the same. Please advise.

Solution

Only push to DLQ records that cause non-retryable errors, that is: point 1 (bad format) and point 2 (bad data) in your example. For the format of the DLQ records, a good approach is to:

push to DLQ the exact same kafka record value and key as the original one, do not wrap it inside any kind of envelope. This makes it much easier to reprocess with other tools during troubleshooting (e.g. with a new version of a deserializer or so).
add a bunch of Kafka header to communicate meta-data about the error, a few typical examples would be:
- original topic name, partition, offset and Kafka timestamp of this record
- exception or error message
- name and version of the application that failed to process that record
- time of the error

Typically I use one single DLQ topic per service or application (not one per inbound topic, not a shared one across services). That tends to keep things independent and manageable.

Oh, and you probably want to put some monitoring and alert on the inbound traffic to the DLQ topic ;)

Point 3 (high volume) should, IMHO, be dealt with some sort of auto-scaling, not with a DLQ. Try to always over-estimate (a bit) the number of partitions of the input topic, since the maximum number of instances you can start of your service is limited by that. A too high number of messages is not going to overload your service, since the Kafka consumers are explicitly polling for more messages when they decide to, so they're never asking for more than the app can process. What happens if there is a peak of messages is simply they'll keep piling up in the upstream kafka topic.

Point 4 (connectivity) should be retried directly from the source topic, without any DLQ involved, since the error is transient. Dropping the message to a DLQ and picking up the next one is not going to solve any issue since, well, the connectivity issue will still be present and the next message would likely be dropped as well. Reading, or not reading, a record from Kafka is not making it go away, so a record stored there is easy to read again later. You can program your service to move forward to the next inbound record only if it successfully writes a resulting record to the outbound topic (see Kafka transactions: reading a topic is actually involving a write operation since the new consumer offsets need to be persisted, so you can tell your program to persist new offsets and the output records as part of the same atomic transaction).

Kafka is more like a storage system (with just 2 operations: sequential reads and sequential writes) than a messaging queue, it's good at persistence, data replication, throughput, scale... (...and hype ;) ). It tends to be really good for representing data as a sequence of events, as in "event sourcing". If the needs of this microservice setup is mostly asynchronous point-to-point messaging, and if most scenarios would rather favor super low latency and choose to drop messages rather than reprocessing old ones (as seems suggested by the 4 points listed), maybe a lossy in-memory queuing system like Redis queues is more appropriate?

Answered By - Svend

Answer Checked By - Terry (PHPFixing Volunteer)

Monday, November 14, 2022

[FIXED] What is the best practice to retry messages from Dead letter Queue for Kafka

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Monday, November 14, 2022

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To