Change Data Capture (CDC) With Embedded Debezium and Spring Boot

While working with data or replicating data sources, you probably have heard the term Change Data Capture (CDC). As the name suggests, “CDC” is a design pattern that continuously identifies and captures incremental changes to data. This pattern is used for real-time data replication across live databases to analytical data sources or read replicas. It can also be used to trigger events based on data changes, such as the OutBox pattern.

Most modern databases support CDC through transaction logs. A transaction log is a sequential record of all changes made to the database while the actual data is contained in a separate file.

What Is Debezium?

Debezium is a distributed platform built for CDC. It uses database transaction logs and creates event streams on row-level changes. Applications listening to these events can perform needed actions based on incremental data changes.

Debezium provides a library of connectors, supporting a variety of databases available today. These connectors can monitor and record row-level changes in the schemas of a database. They then publish the changes on to a streaming service like Kafka.

Normally, one or more connectors are deployed into a Kafka Connect cluster and are configured to monitor databases and publish data-change events to Kafka. A distributed Kafka Connect cluster provides the fault tolerance and scalability needed, ensuring that all the configured connectors are always running.

What Is Embedded Debezium?

Applications that don’t need the level of fault tolerance and reliability Kafka Connect offers or want to minimize the cost of using them to run the entire platform, can run Debezium connectors within the application. This is done by embedding the Debezium engine and configuring the connector to run within the application. On data change events, the connectors send them directly to the application.

Running Debezium With Spring Boot

Keeping the example simple, let’s have a Spring Boot application, “Student CDC Relay,” running embedded Debezium and tailing the transaction logs of a Postgres database that houses the “Student” table. The Debezium connector configured within the Spring Boot application invokes a method within the application when a database operation like Insert/Update/Delete are made on the “Student” table. The method acts on these events and syncs the data within the Student index on ElasticSearch.

Installing Tools

All the needed tools can be installed running the docker-compose file below. This starts the Postgres database on port 5432 and Elastic Search on port 9200(HTTP) and 9300(Transport).

version: “3.5”

services:
# Install postgres and setup the student database.
postgres:
container_name: postgres
image: debezium/postgres
ports:
— 5432:5432
environment:
— POSTGRES_DB=studentdb
— POSTGRES_USER=user
— POSTGRES_PASSWORD=password

# Install Elasticsearch.
elasticsearch:
container_name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:6.8.0
environment:
— discovery.type=single-node
ports:
— 9200:9200
— 9300:9300

We use the image debezium/postgres because it comes prebuilt with the logical decoding feature. This is a mechanism that allows the extraction of changes that were committed to the transaction log, making CDC possible.

We use the image debezium/postgres because it comes prebuilt with the logical decoding feature. This is a mechanism that allows the extraction of changes that were committed to the transaction log, making CDC possible. The documentation for installing the plugin to Postgres can be found here.

Postgres Logical Decoding Concepts

47.2.1. Logical Decoding

Logical decoding is the process of extracting all persistent changes to a database’s tables into a coherent, easy to understand format which can be interpreted without detailed knowledge of the database’s internal state.

In PostgreSQL, logical decoding is implemented by decoding the contents of the write-ahead log, which describe changes on a storage level, into an application-specific form such as a stream of tuples or SQL statements.

47.2.2. Replication Slots

In the context of logical replication, a slot represents a stream of changes that can be replayed to a client in the order they were made on the origin server. Each slot streams a sequence of changes from a single database.

47.2.3. Output Plugins

Output plugins transform the data from the write-ahead log’s internal representation into the format the consumer of a replication slot desires.

47.2.4. Exported Snapshots

Understanding the Code

https://github.com/sohangp/embedded-debezium

The first step is to define the Maven dependencies in the pom.xml for debezium-embedded and debezium-connector. The sample reads the changes from Postgres, so we use the Postgres connector.

<dependency>
<groupId>io.debezium</groupId>
<artifactId>debezium-embedded</artifactId>
<version>${debezium.version}</version>
</dependency>
<dependency>
<groupId>io.debezium</groupId>
<artifactId>debezium-connector-postgres</artifactId>
<version>${debezium.version}</version>
</dependency>

Then, we configure the connector, which listens to changes on the Student table. We use the class PostgresConnector for the connector.class setting, which is provided by Debezium. This is the name of the Java class for the connector, which tails the source database.

More details on https://dzone.com/articles/change-data-capture-cdc-with-embedded-debezium-and

Developer, melomaniac, philospher, tea lover