Author: Josh Patterson
Date: July 16th, 2018
In this blog article we give a list of tips and tricks on the basics of using Apache Avro with Apache Kafka. Topics Include:
The description from the Apache Avro website describes Avro as a data serialization system providing rich data structures in a compact binary data format (and more). The data in Avro is described in a language-independent schema (which are usually JSON format) and binary files are generally how the files are serialized. Many times the schema is embedded in the data files themselves. The Apache Kafka book goes on to state:
"Apache Avro, which is a serialization framework originally developed for Hadoop. Avro provides a compact serialization format; schemas that are separate from the message payloads and that do not require code to be generated when they change; and strong data typing and schema evolution, with both backward and forward compatibility."
Narkhede, Shapira, and Palino, "Kafka: The Definitive Guide"
We can see an example of an avro schem below:
We can see from the Avro example above that there are four fields (outside of the fields themselves in the "fields" entry):
The home website for Avro offers:
The Apache Kafka book sets forth serialization as a key part of building a robust data platform as described in the excerpt below:
While protobuff and raw JSON are considered popular for Kafka in some circles1, many large enterprises prefer Avro as the serialization framework of choice to use with Apache Kafka as outlined in Jay Kreps blog article. Jay makes the argument for Avro on Confluent's platform (their distro of Apache Kafka) based on the following reasons:"A consistent data format is important in Kafka, as it allows writing and reading messages to be decoupled. When these tasks are tightly coupled, applications that subscribe to messages must be updated to handle the new data format, in parallel with the old format. Only then can the applications that publish the messages be updated to utilize the new format. By using well-defined schemas and storing them in a common repository, the messages in Kafka can be understood without coordination."
Narkhede, Shapira, and Palino, "Kafka: The Definitive Guide"
So now that we've made the argument for using Avro for serialization on Kafka, we need to dig into "how" part of doing this. Avro has many subtlies to it, and saying "just use avro" can prove daunting to new Kafka users. This article is meant to provide some notes on basic usage of Avro across producers, consumers, and streaming applications on Kafka. The key aspects of Avro usage in Kafka we'll focus on for this article are:
I felt a bit of deja vu when reading Gwen Shapira's story about building ETL pipelines for Data Warehouse Offloading while working on Hadoop projects (side note: we both worked in the field division at Cloudera previously). Schema-on-read (or really "whatever schema was embedded in the MapReduce code...") was the theme of the times (around 2010), and the world needed flexibility to handle the masses of heterogenous types of data being offloaded into Hadoop. However, over time the world found that schema-on-read had some drawbacks and began to think about bringing order to the wild west of data pipelines. Enter the concept of the Schema Registry and schema management.
The nice aspects of the schema registry is that it watches messages passing through for new schemas and archives each new schema. Specifically the class KafkaAvroSerializer does this, as explained in the Confluent documentation:
The schema is sent to the schema registry for retrieval later by other processes needing to decode the payload of a message using the same schema. The schema registry acts as a centralized authority on schema management for the Kafka platforn."When sending a message to a topic t, the Avro schema for the key and the value will be automatically registered in the Schema Registry under the subject t-key and t-value, respectively, if the compatibility test passes."
We need the serializer used to produce messages to the Kafka cluster to match the deserializer that will be used when consuming messages (otherwise bad things happen and exceptions are thrown). The behind-the-scenes details of storing the schema in the registry and pulling it up when required is performed by the serializers and deserializers in Kafka (which is pretty handy). Having to manually keep track of these when dealing with lots of concurrent streams on a well-worked Kafka cluster quickly becomes a challenge. Being able to leverage Avro in conjunction with the Confluent Schema Registry solves a lot of issues easily that most enterprises don't take a lot of time to think about (side note: data platforms in the enterprise commonly overlook many of the small issues like these which can end up taking most of the engineering time, ironically enough).
So we've established a solid argument for not only using Avro on Kafka but also basing our schema management on the Confluent Schema Registry. Now let's take a look at design patterns for Avro schema design and then ways to encode messages with Avro for Kafka: Generic Records and Specific Records.
We'll briefly highlight a few notes from this article about best practices for Avro schema design relative to usage in Kafka:
The Avro SpecificRecord API does require us to know in advance of compile time what all of our schema fields will be, unlike the Avro GenericRecord API. If we need more flexibility at runtime, then we likely should consider the GenericRecord API instead. The high-level comparison between the two Avro API paths:
A quick guide with examples of each API for the 3 types of Kafka processing nodes commonly seen used with Avro. We're listing this out explicitly here because its not always clear which API to use and where you will need the .avsc file (or not).
In this example we see a basic producer that is using the SpecificRecord API to and the Maven Avro plugin to generate the Avro message class at compile time with the included .avsc file shown below:
In the code below we see the producer code reading log lines from a file on disk and producing Kafka messages to a Kafka cluster using the class (JavaSessionize.avro.LogLine) generated at compile from the .avsc file shown above. A few lines of note in the example code above:
Now let's take a look at using generic Avro objects rather than generated Avro objects in the context of a Kafka producer example.
In this article we started out introducing the basic concept of Avro, made the case for the use of Avro with the Confluent Schema Registry as a best practice for Kafka, and then provided some best practices for Specific vs Generic Avro Record API usage. We hope this information helps as an introduction to the concepts for any new Kafka practitioners out there.