Safely remove Kafka brokers from Amazon MSK provisioned clusters

Today, we are announcing broker removal capability for Amazon Managed Streaming for Apache Kafka (Amazon MSK) provisioned clusters, which lets you remove multiple brokers from your provisioned clusters. You can now reduce your cluster’s storage and compute capacity by removing sets of brokers, with no availability impact, data durability risk, or disruption to your data streaming applications. Amazon MSK is a fully managed Apache Kafka service that makes it easy for developers to build and run highly available, secure, and scalable streaming applications. Administrators can optimize the costs of their Amazon MSK clusters by reducing broker count and adapting the cluster capacity to the changes in the streaming data demand, without affecting their clusters’ performance, availability, or data durability.

You can use Amazon MSK as a core foundation to build a variety of real-time streaming applications and high-performance event-driven architectures. As business needs and traffic patterns change, cluster capacity is often adjusted to optimize costs. Amazon MSK provides flexibility and elasticity for administrators to right-size MSK clusters. You can increase broker count or the broker size to manage the surge in traffic during peak events or decrease the instance size of brokers of the cluster to reduce capacity. However, to reduce the broker count, earlier you had to undertake effort-intensive migration to another cluster.

With the broker removal capability, you can now remove multiple brokers from your provisioned clusters to meet the varying needs of your streaming workloads. During and post broker removal, the cluster continues to handle read and write requests from the client applications. MSK performs the necessary validations to safeguard against data durability risks and gracefully removes the brokers from the cluster. By using broker removal capability, you can precisely adjust MSK cluster capacity, eliminating the need to change the instance size of every broker in the cluster or having to migrate to another cluster to reduce broker count.

How the broker removal feature works

Before you execute the broker removal operation, you must make some brokers eligible for removal by moving all partitions off of them. You can use Kafka admin APIs or Cruise Control to move partitions to other brokers that you intend to retain in the cluster.

You choose which brokers to remove and move the partitions from those brokers to other brokers using Kafka tools. Alternatively, you may have brokers that are not hosting any partitions. Then use Edit number of brokers feature using the AWS Management Console, or the Amazon MSK API UpdateBrokerCount. Here are details on how you can use this new feature:

You can remove a maximum of one broker per Availability Zone (AZ) in a single broker removal operation. To remove more brokers, you can call multiple broker removal operations consecutively after the prior operation has been completed. You must retain at least one broker per AZ in your MSK cluster.
The target number of broker nodes in the cluster must be a multiple of the number of availability zones (AZs) in the client subnets parameter. For example, a cluster with subnets in two AZs must have a target number of nodes that is a multiple of two.
If the brokers you removed were present in the bootstrap broker string, MSK will perform the necessary routing so that the client’s connectivity to the cluster is not disrupted. You don’t need to make any client changes to change your bootstrap strings.
You can add brokers back to your cluster anytime using AWS Console, or the UpdateBrokerCount API.
Broker removal is supported on Kafka versions 2.8.1 and above. If you have clusters in lower versions, you must first upgrade to version 2.8.1 or above and then remove brokers.
Broker removal doesn’t support the t3.small instance type.
You will stop incurring costs for the removed brokers once the broker removal operation is completed successfully.
When brokers are removed from a cluster, their associated local storage is removed as well.

Considerations before removing brokers

Removing brokers from an existing Apache Kafka cluster is a critical operation that needs careful planning to avoid service disruption. When deciding how many brokers you should remove from the cluster, determine your cluster’s minimum broker count by considering your requirements around availability, durability, local data retention, and partition count. Here are a few things you should consider:

Check Amazon CloudWatch BytesInPerSec and BytesOutPerSec metrics for your cluster. Look for the peak load over a period of 1 month. Use this data with MSK sizing Excel file to identify how many brokers you need to handle your peak load. If the number of brokers listed in the Excel file is higher than the number of brokers that would remain after removing brokers, do not proceed with this operation. This indicates that removing brokers would result in too few brokers for the cluster, which can lead to availability impact for your cluster or applications.
Check UserPartitionExists metrics to verify that you have at least 1 empty broker per AZ in your cluster. If not, make sure to remove partitions from at least one broker per AZ before invoking the operation.
If you have more than one broker per AZ with no user partitions on them, MSK will randomly pick one of those during the removal operation.
Check the PartitionCount metrics to know the number of partitions that exist on your cluster. Check per broker partition limit. The broker removal feature will not allow the removal of brokers if the service detects that any brokers in the cluster have breached the partition limit. In that case, check if any unused topics could be removed instead to free up broker resources.
Check if the estimated storage in the Excel file exceeds the currently provisioned storage for the cluster. In that case, first provision additional storage on that cluster. If you are hitting per-broker storage limits, consider approaches like using MSK tiered storage or removing unused topics. Otherwise, avoid moving partitions to just a few brokers as that may lead to a disk full issue.
If the brokers you are planning to remove host partitions, make sure those partitions are reassigned to other brokers in the cluster. Use the kafka-reassign-partitions.sh tool or Cruise Control to initiate partition reassignment. Monitor the progress of reassignment to completion. Disregard the __amazon_msk_canary, __amazon_msk_canary_state internal topics, because they are managed by the service and will be automatically removed by MSK while executing the operation.
Verify the cluster status is Active, before starting the removal process.
Check the performance of the workload on your production environment after you move those partitions. We recommend monitoring this for a week before you remove the brokers to make sure that the other brokers in your cluster can safely handle your traffic patterns.
If you experience any impact on your applications or cluster availability after removing brokers, you can add the same number of brokers that you removed earlier by using the UpdateBrokerCount API, and then reassign partitions to the newly added brokers.
We recommend you test the entire process in a non-production environment, to identify and resolve any issues before making changes in the production environment.

Conclusion

Amazon MSK’s new broker removal capability provides a safe way to reduce the capacity of your provisioned Apache Kafka clusters. By allowing you to remove brokers without impacting availability, data durability, or disrupting your streaming applications, this feature enables you to optimize costs and right-size your MSK clusters based on changing business needs and traffic patterns. With careful planning and by following the recommended best practices, you can confidently use this capability to manage your MSK resources more efficiently.

Start taking advantage of the broker removal feature in Amazon MSK today. Review the documentation and follow the step-by-step guide to test the process in a non-production environment. Once you are comfortable with the workflow, plan and execute broker removal in your production MSK clusters to optimize costs and align your streaming infrastructure with your evolving workload requirements.

About the Authors

Vidhi Taneja is a Principal Product Manager for Amazon Managed Streaming for Apache Kafka (Amazon MSK) at AWS. She is passionate about helping customers build streaming applications at scale and derive value from real-time data. Before joining AWS, Vidhi worked at Apple, Goldman Sachs and Nutanix in product management and engineering roles. She holds an MS degree from Carnegie Mellon University.

Anusha Dasarakothapalli is a Principal Software Engineer for Amazon Managed Streaming for Apache Kafka (Amazon MSK) at AWS. She started her software engineering career with Amazon in 2015 and worked on products such as S3-Glacier and S3 Glacier Deep Archive, before transitioning to MSK in 2022. Her primary areas of focus lie in streaming technology, distributed systems, and storage.

Masudur Rahaman Sayem is a Streaming Data Architect at AWS. He works with AWS customers globally to design and build data streaming architectures to solve real-world business problems. He specializes in optimizing solutions that use streaming data services and NoSQL. Sayem is very passionate about distributed computing.

AWS Big Data Blog

Safely remove Kafka brokers from Amazon MSK provisioned clusters

How the broker removal feature works

Considerations before removing brokers

Conclusion

About the Authors

Resources

Follow