Cloudera Operational Database Replication in a Nutshell

Cloudera Operational Database Replication in a Nutshell

In this previous blog post we provided a high-level overview of Cloudera Replication Plugin, explaining how it brings cross-platform replication with little configuration. In this post, we will cover how this plugin can be applied in CDP clusters and explain how the plugin enables strong authentication between systems which do not share mutual authentication trust.

Using Operational Database Replication Plugin

The Operational Database Replication Plugin is available both as a standalone plugin as well as installed automatically via Cloudera Replication Manager.  The plugin enables customers to set up near-real time replication of HBase data from CDH/HDP/AWS EMR/Azure HDInsight clusters to CDP Private Cloud Base and/or CDP Operational Database (COD) in the Public Cloud. It is also automatically deployed when using Cloudera Replication Manager to set up replication between CDP Private Cloud Base and COD or between COD instances in the Public Cloud. Cloudera Replication Manager also allows for combining the HBase snapshot feature together with this plugin to also manage replication of pre-existing data in a single setup.

For installation instructions, please refer to HBase replication policy topic on Replication Manager official documentation.

For legacy CDH/HDP versions, the plugin is provided as a parcel to be installed in the legacy cluster only. 

  • CDH 5.x
  • CDH 6.x
  • HDP 2.6
  • HDP 3.1
  • EMR 5.x & 6.x

The parcel is version locked with the version specific binaries. For each of the versions mentioned above, it should be acquired on a per-cluster basis. Contact your Cloudera sales team if you are interested in obtaining any of those.

Implementation Details

The impediment solved by Operational Database Replication Plugin is the mutual authentication between clusters under different security configurations. Recalling this previous blog post, HBase default replication requires that both clusters are either not configured for security at all, or are both configured with security. In case of the latter, both clusters must be either in the same kerberos realm, or have cross realm authentication set on the kerberos system. This would be an extra challenge in the context of CDP, where each environment runs on a self contained security realm. To understand this in more detail, we need to review how Apache HBase security is implemented.

Using SASL to establish trust

In HBase replication, RegionServers in the source cluster contact RegionServers in the target cluster via RPC connections. When security is enabled, authentication is performed at the RPC connection establishment phase using the Simple Authentication and Security Layer framework (SASL). HBase already provides the following builtin SASL authentication mechanisms: kerberos, digest and simple. When kerberos is enabled, credentials from the source cluster will be expected by the target cluster, which will then validate these credentials against its own KDC, using the SASL kerberos mechanism. This relies on kerberos GSSAPI implementation for authenticating the provided credentials against the target cluster KDC, therefore the trust for the source cluster principal must had been implemented at the kerberos system level, by either having both clusters credentials on the same realm, or making target cluster KDC trust the credentials from source cluster realm (an approach commonly known as cross-realm authentication). 

Extending HBase SASL authentication 

Luckily, SASL is designed to allow for custom authentication implementations. That means a SASL based solution could be designed, if an additional SASL mechanism could be plugged into the set of the builtin options mentioned above. With that aim, Cloudera proposed a refactoring of HBase’s RPC layer, which has been reviewed and accepted by the Apache HBase community in HBASE-23347.

Pluggable SASL Mechanism

With the changes introduced by HBASE-23347, additional SASL authentication mechanisms can be defined via HBase configuration to be used by the RPC layer. Incoming RPC connections define the specific SASL type in the header, then the RPC server selects the specific implementation to perform the actual authentication:

Operational Database Replication Plugin

Operational Database Replication Plugin implements its custom SASL mechanism, allowing clusters on different kerberos realms to communicate with seamless configuration efforts (without the need for kerberos cross-realm). It extends HBase replication so that source creates a SASL token of Replication Plugin custom type, with credentials from a pre-defined machine user on the target COD cluster. This type of user can be easily created from Cloudera Management Console UI, and then propagated to the COD cluster underlying kerberos authentication authority. Detailed instructions about creating replication machine users are covered in the pre-requirement steps section of Cloudera Replication Manager documentation.

When the RPC server in the target reads the token and identifies it’s a Replication Plugin type, related credentials are parsed from the token and used for authentication.

Operational Database Replication Plugin

Operational Database Replication Plugin uses PAM authentication to validate the machine user credentials. COD clusters are always provisioned with PAM authenticating against the CDP environment FreeIPA security domain.

Securing Machine User Credentials

A critical issue in this solution is that the source cluster must obtain the credentials from the machine user of the target cluster. For obvious reasons, that shouldn’t be exposed by any means on the source configuration. These credentials are also sent over the wire in the SASL token within the RPC connection, so it must be encrypted prior to transmission. The Replication Plugin provides its own tool to generate a jceks file storing the machine user credentials, encrypted. Once this file is created, it must be copied to both clusters and made readable by the hbase user only.  Below diagram shows a deployment overview of Operational Database Replication Plugin components integrating to the standard HBase replication classes in the context of RegionServers. The pink boxes represent the replication and RPC connection code already provided by HBase, whilst the yellow boxes show the abstraction layer introduced within HBASE-23347. Finally, the orange classes highlight the relevant artifacts implementing Operational Database Replication Plugin logic.

Cloudera Operational Database Replication plugin

Conclusion

Replication is a valuable tool for implementing DR and DC migration solutions for HBase. It has some caveats, as shown here, when dealing with clusters’ security configurations. Yet, the ability to migrate data from current “on-prem” deployments to CDP clusters on the cloud is imperative. Cloudera Operational Database Replication plugin brings flexibility when integrating secured clusters, together with better maintainability for this security integration, since it’s entirely implemented at HBase level, in contrast with kerberos cross-realm, which requires changes on the kerberos system definition, often a responsibility of a complete different team, with its own restrictive policies. 

Try out the Operational Database template in Cloudera Data Platform (CDP)!

Wellington Chevreuil
More by this author
Josh Elser
More by this author
Krishna Maheshwari
Director of Product Management
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.