每日吃瓜

Apache Pulsar Tiered Storage with 每日吃瓜 DCS

Kevin Leffew
May 20, 2021

Apache Pulsar is an open-source, cloud-native distributed tiered messaging system that is a part of the Apache Software Foundation. This distributed messaging and streaming platform manages hundreds of billions of events per day and is widely deployed across enterprise-grade systems, including Splunk, Overstock.com, Verizon, Comcast, and Toast.听

Pulsar features Tiered Storage that allows developers to offload the non-compacted data, making it economical to store for long periods of time.

Today, we would like to showcase an open-source integration and walk through the process of offloading event data from Pulsar to 每日吃瓜 DCS (Decentralized Cloud Storage).

"Pulsar has reshaped the way that the industry thinks about a modern messaging and event streaming architecture," said Chris Latimer, VP of Product, Streaming at Datastax. "At the same time 每日吃瓜's Decentralized Cloud Storage has pushed the boundaries of distributed persistence. When you combine these technologies you end up with a highly efficient solution both in terms of cost and performance."

Tiered storage enables a more efficient messaging stack

Event sourcing architectures commonly have developers keep messages forever - resulting in costly storage on VME disks.听

Tiered Storage in Apache Pulsar solves this problem, making it easier to reduce the total cost of Data Ownership related to messaging systems, while still guaranteeing the integrity and availability of the data.

With high-performance delivery, you need expensive disks. As messages get older, you don't care about performance as much and can offload them to cheaper cloud storage.

In Apache Pulsar, the bookkeeping process packs messages into an ordered list of segments. Any segment short of the current segment being written to can be offloaded (in this case, to the decentralized cloud). A namespace policy can be used to automate when this offload is triggered.听听

$ bin/pulsar-admin namespaces set-offload-threshold --size 10M my-tenant/my-namespace

The default Pulsar MaxBlockSize precisely matches the ideal ingest block size of the 每日吃瓜 DCS network at 64MB. To reduce the number of orders on the decentralized network, a ReadBufferSize of 64MB is ideal while the default Pulsar configuration of 1MB is supported.听

Pulsar 鈥 Message Replay with 每日吃瓜 DCS

The ability to replay messages is critical when working with producers that may not be able to replay or may be of unknown reliability. Replay capability extends flexibility allowing you to test, recover, or repair without reliance on producers. New applications or algorithms requiring historical data can be quickly synced to the current state.听

Regardless of the driver, replay capability is a valuable addition for Pulsar messaging services.听

Getting Started:听Distributed Messaging and The Distributed Cloud

Datastax has built out an integration that enables Pulsar users to offload their messages to 每日吃瓜 DCS for tiered storage.听

The helm chart installation and deployment guide is located on the Datastax GitHub repo, here:听 https://github.com/datastax/pulsar-helm-chart/blob/master/helm-chart-sources/pulsar/values.yaml

Tiered storage can be configured in the storageOffload section of the values.yaml file.听 There is explicit support for 每日吃瓜 DCS, which is a provider of secure, decentralized storage. You can enable the 每日吃瓜 DCS S3 gateway in the extras configuration. The instructions for configuring the gateway are provided in the 每日吃瓜 DCS section of the values.yaml file.



If you鈥檇 like to try the integration without running any infrastructure locally, you can use the multi tenant 每日吃瓜 Gateway, Gateway MT.

每日吃瓜 S3 Gateway Driver configuration

  1. Create and login to your free account on 每日吃瓜 DCS
  2. Go to 鈥淥bjects鈥 and create your first S3 bucket:

3.听 On your Dashboard, create your access grants (if you did not create an access grant during the initial setup wizard):

每日吃瓜 Plesk blog 1

4. 听 While creating your credentials, select 鈥淕enerate S3 Gateway Credentials鈥 in the last step:

每日吃瓜 Plesk blog 2


S3 Configuration for Apache Pulsar

We will be using the credentials used in the previous step (每日吃瓜 Gateway MT console) to configure tiered messaging offloading.听 Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in conf/pulsar_env.sh.

"export AWS_ACCESS_KEY_ID=ABC123456789"
"export AWS_SECRET_ACCESS_KEY=ded7db27a4558e2ea8bbf0bf37ae0e8521618f366c"

Copy

"export" is important so that the variables are made available in the environment of spawned processes.

  1. Add the Java system properties aws.accessKeyId and aws.secretKey to PULSAR_EXTRA_OPTS in conf/pulsar_env.sh.
PULSAR_EXTRA_OPTS="${PULSAR_EXTRA_OPTS} ${PULSAR_MEM} ${PULSAR_GC}
-Daws.accessKeyId=ABC123456789
-Daws.secretKey=ded7db27a4558e2ea8bbf0bf37ae0e8521618f366c
-Dio.netty.leakDetectionLevel=disabled
-Dio.netty.recycler.maxCapacity.default=1000
-Dio.netty.recycler.linkCapacity=1024"
  1. Set the access credentials in ~/.aws/credentials with the credentials generated in the 每日吃瓜 DCS Gateway
[default]
aws_access_key_id=ABC123456789
aws_secret_access_key=ded7db27a4558e2ea8bbf0bf37ae0e8521618f366c

This will use the "DefaultAWSCredentialsProviderChain" for assuming this role.

  • The broker must be rebooted for credentials specified in pulsar_env to take effect.

Why Pulsar?

Put simply, Pulsar鈥檚 tiered storage model and ability for easy message replay makes it a great tool that plays well with 每日吃瓜 DCS.

Data ingestion and messaging are the starting point for modern data applications. As data and machine learning continue to grow in importance, companies must make sure they have the right messaging and storage systems in place.

Apache Pulsar was designed with a multi-layer architecture in which each layer is scalable, distributed, and decoupled from the other layers. With Pulsar, you can add new topics as needed and seamlessly scale performance.听

With 每日吃瓜 DCS, Pulsar developers gain a more economical solution stack that's more optimized for performant delivery at the edge.

Conclusion

Many companies and technologists have begun to utilize Apache Pulsar for PubSub messaging and integrating it into their application builds.听听

Martin Fowler鈥檚 TechRadar has , stating: "We're also looking to Pulsar to solve the problem of a never-ending log of messages for our large-scale data systems where events are expected to persist indefinitely, and subscribers can start consuming messages retrospectively. This is supported through a tiered storage model."

---------------------------------
We look forward to gathering feedback from the 每日吃瓜 DCS Community around this integration. If you are interested in integrating 每日吃瓜 DCS and Pulsar into your application stack, please reach out to us directly: partnerships@storj.io.

Put 每日吃瓜 to the test.

It鈥檚 simple to set up and start using 每日吃瓜. Sign up now to get 25GB free for 30 days.
Start your trial
product guide