Building a highly visualizable RDS monitoring stack

Kun Yang
Insight
Published in
7 min readJun 8, 2020

--

Mayhem is a gaming analytics company that offers gamers a platform to hold tournaments and see their own statistics inside each game. As gamers play together in a large community, they need to log in simultaneously. The challenge for Mayhem becomes handling that traffic properly. If latency of the gaming experience becomes too high, the result is lost site users.

Mayhem’s platform

Login requests in large numbers will spike the requirement of a database service in a short time, and this is a challenge that applies much more broadly than just in gaming communities. A robust monitoring solution is the first step to solve this problem, because it tells us what is going on with your system in time.

As an Insight Fellow, I partnered with Mayhem to build a Relational Database Service (RDS) monitoring stack with a 15-second updating speed that can also be re-deployed onto almost all the systems that involve AWS RDS systems. Read on to learn how it was built.

Current limitations in monitoring RDS

The RDS service offered by AWS, CloudWatch, is the most widely used cloud database service on the market. To ensure its security, AWS doesn’t allow you to install other services on the RDS server. You can only access and visualize the data flow through CloudWatch. Most of the time, the CloudWatch monitoring stack is sufficient for day-to-day usage. However, it does become a problem when you find out the default setting of CloudWatch can not satisfy your requirements. Below are several limitations that I’ve found with using CloudWatch.

CloudWatch cannot visualize detailed metrics

The most critical issue with CloudWatch is the low refresh rate the default RDS monitoring metrics have — they only update every 3 minutes. In Mayhem’s case, they needed to fire a new RDS instance quickly, so the 3-minute update speed posed a real challenge.

AWS does allow you to open up the enhanced monitoring service, which has many detailed metrics and a refresh time of 15 seconds. However, as the following figure shows, these metrics are stored as JSON logging files and are not capable of further visualization. This is the major reason to create our own monitoring stack. We need to export all of this important information and visualize it elsewhere.

The enhanced metrics In CloudWatch.

CloudWatch has limitations on its graph choices

Another issue with CloudWatch is that the plot choices they offer are limited. Compare the new dashboard panel between CloudWatch and Grafana, you can easily see which has better visualizations.

Comparison Between CloudWatch and Grafana

Whole System Structure

CloudWatch cannot fully handle the monitoring task so, to solve this problem, I built a monitoring stack for RDS using Prometheus and Grafana. My entire system follows the structure illustrated in the figure below. In this schematic, CloudWatch and RDS are the basics we rely on, and all data must go through CloudWatch. We will walk through this in detail in the next section.

System Structure

My monitoring stack contains Prometheus, Prometheus exporters, AlertManager, and Grafana. The reproducibility group includes Packer and Terraform. Lastly, I used a single EC2 instance with Mysqlslap for testing. Next, I will break each of these down and discuss one by one.

Core Monitoring Stack

Before we can build our own monitoring stack, we need to identify the stumbling blocks. Unfortunately, what I mentioned earlier as an advantage, security of the service, is now the biggest problem. The most widely used solutions to monitor a database server all involve additional service implementation. But, as I mentioned earlier, RDS doesn’t allow you to install any additional services on their server, so we can’t use any of the traditional methods. The only way to build a monitoring stack is to stream the data out of CloudWatch.

To stream out, you can always use enterprise monitoring tools like Datadog. But, if you want to save money and utilize the open-source community, Prometheus and Grafana is your best choice.

The monitoring stack

We use two different exporters to export data out of CloudWatch. The rds_exporter, maintained by Percona, will get all the enhanced monitoring data out. Instead of the official guidance, which just requires you to pull and use the exporter (didn’t work in my case), a more stable way is to use Go to compile the exporter.

The other exporter is the cloudwatch_exporter, officially maintained by Prometheus. The biggest difficulty of using multiple exporters is the configuration because each is developed with different languages and different configuring styles. To connect all of these together, doing unit testing constantly and following the official guidance from Prometheus can be helpful.

Connecting Prometheus with Grafana is straightforward. Prometheus is an existing choice of Grafana, and all you need to do is open the Prometheus port and put it into Grafana.

Last but not least, I used AlertManager to send whatever my monitoring system spotted to the engineering team through Slack or PagerDuty.

Metrics

I ran both large query number testing and slow query testing ten times to find the best metrics to catch the spike-like traffic. Based on my observation, these are the most critical metrics:

CPU Usage

CPU usage spikes up with the smallest latency every time whenever there is spike-like traffic. You can see that the CPU usage spikes up roughly one cycle quicker, which is at least 15 seconds. Sometimes these 15 seconds can save tons of money.

Spike-like traffic

Average load number

The average load of the RDS system tells you how the connections get processed, you cannot turn off a temporary MySQL instance before all the tail connections get processed. So this one is the key factor to recognize the end of spike-like traffic.

Logging Stream

The logging stream will not only tell you about the errors, slow queries by the corresponding metrics. It can also give you an overall summary of the system behavior by the general logging. The general logging grows whenever the load of the whole system grows, not only because of the spike-like traffic. For example, the general logging stream will grow when a large number of slow queries come. However, these slow queries may not consume that much of the system resource, so the CPU usage and average load won’t change much.

Logging stream

Reproducibility Group

To make the system re-deployable, I used HashiCorp Packer to pack my whole stack into an AMI image. In AWS, an AMI image is an image of your EC2 instance that can be launched directly. That means you can easily redeploy the monitoring stack by launching this image. Specifically, I used bash scripts for provisioning.

I used Terraform to launch all my basic structures. The Terraform configuration includes setting IAM roles, launching and configuring RDS instances, setting security groups and VPC, and launching the configured AMI image.

Testing

To test the effectiveness of my whole stack, I launched another EC2 instance that can do stress testing. To address Mayhem’s pain points — users who surge in within a short period of time, and users who create slow query — this dummy server runs mysqlslap which sends massive query requests in one minute, as well as slow query requests.

Key takeaways

This whole system enables the Grafana dashboard to monitor the RDS metrics with a 15-second updating speed with more details. It can also be re-deployed onto almost all the systems that involve AWS RDS systems. The system also has a solution for the engineering team to get its engineers warned and prepared.

The system is now up and running on Mayhem’s server as their first step to overcome the spike-like traffic. And here is how the dashboard looks:

The Dashboard

Although my stack can now stably get all the spurs inside the RDS system, this is only an early step to fully resolve this complicated issue. In the future, the monitoring will ideally serve as a primary trigger, whenever a certain threshold is met, for the proxy to auto-scale the necessary sub-systems. To do this, the next step should be to add a proxy that can automatically scale the launch and destruction of temporary database servers and replicas.

To get the code and try my whole system, go to this GitHub repo.

I would like to thank Insight and Mayhem.gg for supporting me to build this monitoring stack. I would also like to thank Rahil Bhatnagar, my program director, for giving me tons of valuable ideas.

Kun Yang, Insight Fellow, Ph.D. Candidate at the University of Virginia

Interested in seeing more content like this? Sign up for our newsletter, and get updates on the latest resources and upcoming events.

--

--