AWS Big Data Blog

Introducing persistent buffering for Amazon OpenSearch Ingestion

Amazon OpenSearch Ingestion is a fully managed, serverless pipeline that delivers real-time log, metric, and trace data to Amazon OpenSearch Service domains and OpenSearch Serverless collections.

Customers use Amazon OpenSearch Ingestion pipelines to ingest data from a variety of data sources, both pull-based and push-based. When ingesting data from pull-based sources, such as Amazon Simple Storage Service (Amazon S3) and Amazon MSK using Amazon OpenSearch Ingestion, the source handles the data durability and retention. Push-based sources, however, stream records directly to ingestion endpoints, and typically don’t have a means of persisting data once it is generated.

To address this need for such sources, a common architectural pattern is to add a persistent standalone buffer for enhanced durability and reliability of data ingestion. A durable, persistent buffer can mitigate the impact of ingestion spikes, buffer data during downtime, and reduce the need to expand capacity using in-memory buffers which can overflow. Customers use popular buffering technologies like Apache Kafka or RabbitMQ to add durability to their data flowing through their Amazon OpenSearch Ingestion pipelines. However, these tools add complexity to the data ingestion pipeline architecture and can be time consuming to setup, right-size, and maintain.

Solution overview

Today we’re introducing persistent buffering for Amazon OpenSearch Ingestion to enhance data durability and simplify data ingestion architectures for Amazon OpenSearch Service customers. You can use persistent buffering to ingest data for all push-based sources supported by Amazon OpenSearch Ingestion without the need to set up a standalone buffer. These include HTTP sources and OTEL sources for logs, traces and metrics. Persistent buffering in Amazon OpenSearch Ingestion is serverless and scales elastically to meet the throughput needs of even the most demanding workloads. You can now focus on your core business logic when ingesting data at scale in Amazon OpenSearch Service without worrying about the undifferentiated heavy lifting of provisioning and managing servers to add durability to your ingest pipeline.

Walkthrough

Enable persistent buffering

You can turn on the persistent buffering for existing or new pipelines using the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SDK. If you choose not to enable persistent buffering, then the pipelines continue to use an in-memory buffer.

By default, persistent data is encrypted at rest with a key that AWS owns and manages for you. You can optionally choose your own customer managed key (KMS key) to encrypt data by selecting the checkbox labeled Customize encryption settings and selecting Choose a different AWS KMS key. Please note that if you choose a different KMS key, your pipeline needs additional permission to decrypt and generate data keys. The following snippet shows an example AWS Identity and Access Management (AWS IAM) permission policy that needs to be attached to a role used by the pipeline.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "KeyAccess",
            "Effect": "Allow",
            "Action": [
              "kms:Decrypt",
              "kms:GenerateDataKeyWithoutPlaintext"
            ],
            "Resource": "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab"
        }
    ]
}

Provision for persistent buffering

Once persistent buffering is enabled, data is retained in the buffer for 72 hours. Amazon OpenSearch Ingestion keeps track of the data written into a sink and automatically resumes writing from the last successful check point should there be an outage in the sink or other issues that prevents data from being successfully written. There are no additional services or components needed for persistent buffers other than minimum and maximum OpenSearch compute Units (OCU) set for the pipeline. When persistent buffering is turned on, each Ingestion-OCU is now capable of providing persistent buffering along with its existing ability to ingest, transform, and route data. Amazon OpenSearch Ingestion dynamically allocates the buffer from the minimum and maximum allocation of OCUs that you define for the pipelines.

The number of Ingestion-OCUs used for persistent buffering is dynamically calculated based on the source, the transformations on the streaming data, and the sink that the data is written to. Because a portion of the Ingestion-OCUs now applies to persistent buffering, in order to maintain the same ingestion throughput for your pipeline, you need to increase the minimum and maximum Ingestion-OCUs when turning on persistent buffering. This amount of OCUs that you need with persistent buffering depends on the source that you are ingesting data from and also on the type of processing that you are performing on the data. The following table shows the number of OCUs that you need with persistent buffering with different sources and processors.

Sources and processors Ingestion-OCUs with buffering Compared to number of OCUs without persistent buffering needed to achieve similar data throughput
HTTP with no processors 3 times
HTTP with Grok 2 times
OTel Logs 2 times
OTel Trace 2 times
OTel Metrics 2 times

You have complete control over how you want to set up OCUs for your pipelines and decide between increasing OCUs for higher throughput or reducing OCUs for cost control at a lower throughput. Also, when you turn on persistent buffering, the minimum OCUs for a pipeline go up from one to two.

Availability and pricing

Persistent buffering is available in the all the AWS Regions where Amazon OpenSearch Ingestion is available as of November 17 2023. These includes US East (Ohio), US East (N. Virginia), US West (Oregon), US West (N. California), Europe (Ireland), Europe (London), Europe (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Sydney), Asia Pacific (Singapore), Asia Pacific (Mumbai), Asia Pacific (Seoul), and Canada (Central).

Ingestion-OCUs remains at the same price of $0.24 cents per hour. OCUs are billed on an hourly basis with per-minute granularity. You can control the costs OCUs incur by configuring maximum OCUs that a pipeline is allowed to scale.

Conclusion

In this post, we showed you how to configure persistent buffering for Amazon OpenSearch Ingestion to enhance data durability, and simplify data ingestion architecture for Amazon OpenSearch Service. Please refer to the documentation to learn other capabilities provided by Amazon OpenSearch Ingestion to a build sophisticated architecture for your ingestion needs.


About the Authors

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Arjun Nambiar is a Product Manager with Amazon OpenSearch Service. He focusses on ingestion technologies that enable ingesting data from a wide variety of sources into Amazon OpenSearch Service at scale. Arjun is interested in large scale distributed systems and cloud-native technologies and is based out of Seattle, Washington.

Jay is Customer Success Engineering leader for OpenSearch service. He focusses on overall customer experience with the OpenSearch. Jay is interested in large scale OpenSearch adoption, distributed data store and is based out of Northern Virginia.

Rich Giuli is a Principal Solutions Architect at Amazon Web Service (AWS). He works within a specialized group helping ISVs accelerate adoption of cloud services. Outside of work Rich enjoys running and playing guitar.