3 Methods for Implementing Change Data Capture

Janaka Ekanayake 30 Jul, 2021 • 6 min read
This article was published as a part of the Data Science Blogathon

MsSQL | Oracle | Triggers and Short procedures

Change Data Capture 1
Photo by Edward Howell on Unsplash
If you are familiar with Change Data Capture AKA Change Data Tracking, you can just skip the introduction below and get straight to the implementation section. But, in case you’re not, let me introduce the concept first.

American Writer Mark Twain once said: “Data is like garbage”. Today some people say: data is the new oil. But I would say: data is a blooming flower.

Like a flower, data goes through different stages, and takes time to develop, or burst into bloom. Just as the time taken may differ from flower to flower, so different data sets develop at different rates. A single flower is less striking, but when it is part of a bouquet, it becomes really eye-catching. And so it is with data — that’s how companies like Google or Facebook are able to make a profit, even without charging for their services -they simply collect data ‘flowers’ and turn them into data ‘bouquets’.

However, this article is not about flowers 😊 so, let’s move right on to our topic. As I explained previously, the state of data is continuously changing over time. Change Data Capture (CDC) is a set of technologies that enable you to identify and capture the previous states of the data so that later, you have a snapshot of past data that you can refer to when taking necessary action. See the example below:

Change Data Capture in MsSQL
Illustration
With the increasing demand for big-data technologies, Microsoft introduced CDC with MsSQL Server 2008. Today, CDC is available in almost all popular database servers, including MsSQL, Oracle, CockroachDB, MongoDB, etc. But, manual implementation is just one way to get the job done. We can also automate the CDC using triggers and short procedures. In this article, I will discuss how to implement CDC both ways.

Why Change Data Capture is important

Yes, you guessed right, this technology is mostly used in the big-data domain to keep timely snapshots of streaming data in data warehouses. Here, we go through a process called ELT/ETL to insert our data into the data warehouse. This process is efficient with historic data, but when it comes to real-time data it causes too much latency to run complex queries. The easiest way to deal with real-time data is CDC, which enables us to keep our data warehouse up to date and make business decisions faster.

Let’s Get our Hands Dirty with Some Practical Examples

1. Implementing CDC with MsSQL

First, we need to create a new database.

CREATE DATABASE cdcDB

Next, I select the created database from the available databases dropdown, and create a few simple tables – just to demonstrate CDC implementation.

CREATE TABLE Student
(
StudentID int NOT NULL PRIMARY KEY,
FirstName varchar(255) NOT NULL,
LastName varchar(255) NOT NULL,
Age int,
ContactNo char(10) NOT NULL
)

After creating tables, I enable CDC.

EXEC sys.sp_cdc_enable_db

After CDC is enabled for the database, I then enable it on each table.

EXEC sys.sp_cdc_enable_table
@source_schema = 'dbo',
@source_name = 'Student',
@role_name = NULL,
@supports_net_changes = 1

Finally, I insert some values to my table by the following query.

INSERT INTO Student(StudentID,FirstName,LastName,Age,ContactNo)
VALUES
(10638389, 'Indrajith', 'Ekanayake', 21, 0713101658),
(10637382, 'Kamal', 'Suriyaarachchi', 22, 0765432210),
(10622388, 'Kasun', 'Chamara', 28, 0708998123),
(10638812, 'Chamara', 'Hettiarachchi', 20, 0772134446);

To confirm whether CDC has been properly implemented, we can just check either at our hierarchy panel or Jobs tab, under the SQL Server Agent.

To understand how and what data we are storing in the CDC tables, we can just update a few rows.

UPDATE Student
SET FirstName = 'Janaka', Age= '23'
WHERE StudentID = 10638389;

So, we are now looking at the CDC CT(Change Table) and we can identify that there are additional records, as seen below.

student id
CDC CT(Change Table)

2. Implementing CDC with Oracle

First, I need to mention that I’m running Oracle 11g Enterprise Edition using docker image and I’m writing queries using Oracle SQL Developer’s latest version on the host machine. The server and Oracle SQL developer are connected via port: 1521.

To implement CDC, I begin by creating a tablespace called “ts_cdcindrajith” in my cdcAssignment folder, inside my F drive.

create tablespace ts_cdcindrajith datafile 'F:cdcAssignment' size 300m;

Then, I create a new user called “cdcindrajith”, and grant all permissions to this user.

CREATE USER cdcindrajith IDENTIFIED by cdcindrajith DEFAULT TABLESPACE ts_cdcindrajith QUOTA UNLIMITED ON SYSTEM QUOTA UNLIMITED ON SYSAUX;
GRANT ALL PRIVILEGES TO cdcindrajith;

Next, I create a table called employees.

CREATE TABLE cdcindrajith.employees
(
EmpID int NOT NULL PRIMARY KEY,
EmpName varchar(255) NOT NULL,
Age int NOT NULL,
ContactNo char(10) NOT NULL
)

Then, I implement the CDC feature.

BEGIN
DBMS_CAPTURE_ADM.PREPARE_TABLE_INSTANTIATION(TABLE_NAME => 'cdcindrajith.employees');
END;

BEGIN
DBMS_CDC_PUBLISH.CREATE_CHANGE_SET(
change_set_name    => 'employees_set',
description        => 'Change set for employees change info',
change_source_name => 'SYNC_SOURCE');
END;

BEGIN
DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE(
owner             => 'cdcindrajith',
change_table_name => 'employees_ct',
change_set_name   => 'employees_set',
source_schema     => 'cdcindrajith',
source_table      => 'employees',
column_type_list  => 'EmpID int,
EmpName varchar(255) ,
Age int,
ContactNo char(10)',
capture_values    => 'both',
rs_id             => 'y',
row_id            => 'n',
user_id           => 'n',
timestamp         => 'n',
object_id         => 'n',
source_colmap     => 'y',
DDL_MARKERS =>       'n',
target_colmap     => 'y',
options_string    => 'TABLESPACE ts_cdcindrajith');
END;

I create a subscription called “employees_sub”.

BEGIN
DBMS_CDC_SUBSCRIBE.CREATE_SUBSCRIPTION(
change_set_name   => 'employees_set',
description       => 'Change data for employees',
subscription_name => 'employees_sub');
END;

After that, I create a View called “employees_view” and then activated the above-created subscription.

BEGIN
DBMS_CDC_SUBSCRIBE.SUBSCRIBE(
subscription_name => 'employees_sub',
source_schema     => 'cdcindrajith',
source_table      => 'employees',
column_list       => 'EmpID, EmpName, Age, ContactNo',
subscriber_view   => 'employees_view');
END;

BEGIN
DBMS_CDC_SUBSCRIBE.ACTIVATE_SUBSCRIPTION(
subscription_name => 'employees_sub');
END;

BEGIN
DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW(
subscription_name => 'employees_sub');
END;

Finally, I insert data and also update some data, to check that the CDC has been correctly implemented.

INSERT INTO cdcindrajith.employees(EmpID,EmpName,Age,ContactNo)
VALUES
(10638389, 'Indrajith', 21, 0713101658);
INSERT INTO cdcindrajith.employees(EmpID,EmpName,Age,ContactNo)
VALUES
(10638390, 'Kumara', 27, 0711226661);
UPDATE cdcindrajith.employees
SET EmpName = 'Janaka'
WHERE EmpID = 10638389;

Towards the end, I check the results of my CDC table “employees_ct” to make sure that the CDC is correctly implemented.

SELECT EmpID, EmpName FROM cdcindrajith.employees_ct;
employee ct table
employees_ct table

3. Implementing CDC using Trigger (MsSQL)

First, we need to create a new database.

CREATE DATABASE empDB

Then, I create two tables: one is to maintain employee salary records, the other is to update the salary log, which does the change data capturing.

CREATE TABLE Salary
(
ID int identity(1,1) primary key NOT NULL,
SalDate datetime default GETDATE() NOT NULL,
Task BIGINT NULL,
PaidPerTask BIGINT NULL,
EmpName NCHAR(100) NOT NULL
)
CREATE TABLE SalaryLogs
(
ID INT PRIMARY KEY IDENTITY(1,1) NOT NULL,
SalDate DATETIME DEFAULT GETDATE() NOT NULL,
Query NCHAR(6) NOT NULL,
OldTask BIGINT NULL,
NewTask BIGINT NULL,
OldPaidPerTask BIGINT NULL,
NewPaidPerTask BIGINT NULL,
EmpName NCHAR(100) NOT NULL
)

I implement a Trigger called “salary_change” to save the salary logs.

GO
CREATE TRIGGER salary_change
ON Salary
AFTER INSERT, UPDATE, DELETE
AS
BEGIN
DECLARE @operation CHAR(6)
SET @operation = CASE
WHEN EXISTS(SELECT * FROM inserted) AND EXISTS(SELECT * FROM deleted)
THEN 'Update'
WHEN EXISTS(SELECT * FROM inserted)
THEN 'Insert'
WHEN EXISTS(SELECT * FROM deleted)
THEN 'Delete'
ELSE NULL
END
IF @operation = 'Delete'
INSERT INTO SalaryLogs (Query, SalDate, OldTask, OldPaidPerTask, EmpName)
SELECT @operation, GETDATE(), d.Task, d.PaidPerTask, USER_Name()
FROM deleted d
IF @operation = 'Insert'
INSERT INTO SalaryLogs (Query, SalDate, NewTask, NewPaidPerTask, EmpName)
SELECT @operation, GETDATE(), i.Task, i.PaidPerTask, USER_Name()
FROM inserted i
IF @operation = 'Update'
INSERT INTO SalaryLogs (Query, SalDate, NewTask, OldTask, NewPaidPerTask,OldPaidPerTask, EmpName)
SELECT @operation, GETDATE(), d.Task, i.Task, d.PaidPerTask, i.PaidPerTask, USER_Name()
FROM deleted d, inserted i
END
GO

Finally, I insert some data into the salary table and then try to update and delete queries as well.

INSERT INTO Salary(SalDate,Task,PaidPerTask,EmpName)
VALUES
(2020-11-20, 4, 4000 , 'Indrajith'),
(2020-10-20, 11, 10000 , 'Kusum');
UPDATE Salary
SET EmpName = 'Janaka'
WHERE EmpName = 'Indrajith';
DELETE FROM Salary WHERE EmpName = 'Kusum';

Towards the end, let’s check the salary logs table to understand whether the log files are updated or not.

towards
SalaryLogs table
We see here that the log files are up to date.

Summary

In short, CDC identifies and captures data that has changed in tables of a source database as a result of CRUD operations. This is useful to people who need to export their data into a data warehouse or a business intelligence application. Changed data is maintained in CDC tables in the source database.

In this article, we briefly discussed what is Change Data Capture, why CDC is useful, and most importantly illustrated three methods for implementing CDC. I believe that the rapid evolution of big-data will have many more CDC implications in the future.

Thanks for reading!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Janaka Ekanayake 30 Jul 2021

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear