I am curious about how to connect the sharded multi-node clickhouse

Clickhouse sharding and replication are now completed in multi-node state. I know that after sharding, I need to create a shared database and send queries to that database. But in order to query this database I know I need cluster_name. When I looked up the settings on the Snuba side, it seems that there is an environment variable that can only respond to the clickhouse of a single node.

How do I use a clustered clickhouse? Looking for more repo, I checked the following multi node cluster in the snuba test code.

Can I get an environment variable for cluster_name that I can use when querying the clickhouse from snuba? Or is there another way?

When working with a single node setting without an existing cluster_name, a table is created only in one clickhouse, which was set as an endpoint, and sharding or replication was not performed on the other clickhouse nodes.

Basically, I see and apply environment variables that can be used in snuba images from the link below.
An environment variable corresponding to cluster_name is required so that it can be used in a clustered clickhouse.
Currently, when executing install.sh, tables are not created in clickhouses on other nodes except for the clickhouse corresponding to the endpoint.

I found a way to create a table with sharded engine during migration in sentry code.

But I can’t understand the instructions in the comments.

“”"
Provides the SQL for create table operations.
If the operation is being performed on a cluster marked as multi node, the
replicated versions of the tables will be created instead of the regular ones.
The shard and replica values will be taken from the macros configuration for
multi node clusters, so these need to be set prior to running the migration.
If unsharded=True is passed, data will be replicated on every shard of the cluster.
“”"

How can I check the sharding and replication options and then migrate to use the table with the engine applied?

I also checked the clickhouse related issue in snuba

Is there any way to add a replicated or distributed table to this?

I’ll ping @lynnagara as she’s our expert on Snuba & Clickhouse.

Hi @seungjinlee. Unfortunately we only support single node ClickHouse installations out of the box currently. Some parts of the snuba codebase may refer to multi node clusters - this is because it’s a feature we started to build out and are planning to support in the future. However this isn’t on our immediate roadmap currently so I can’t give you a timeframe for it right now.

If you need to run replicated or distributed tables, the only way to do so currently is manually create all of the ClickHouse tables yourself (and keep them up to date each time you update Snuba) - you will not be able to use Snuba’s migration system.

Oh, I see. Still, I managed to solve the problem by creating a separate table like the way you said.
Here’s how I solved the problem.

  1. Add “CLICKHOUSE_DATABASE” environment variable on the SNUBA side and receive all sentry schema and data created or migrated to a specific database

  2. You can get the Create query for each table through tabix (I recommend this method) or you can work based on the metadata in the storage connected to the host.

  3. When creating Replicated and Distributed tables, always add “on Cluster” so that all shards and replicas are also created at the same time.

  4. After creating a separate DATABASE, create the Replicated~, Distributed tables. The default mergeTreeFamily has a corresponding ReplicatedFamily for each, so move them one by one. For example, in the case of ReplacingMergeTree, change it to ReplicatedReplacingMergeTree, etc., and in the case of a simple merge table, MATERIALIZED VIEW, it is kept as it is. (At this time, the combined table is based on the distributed table)

  5. In the case of replicated tables, wrap once more with distributed tables for sharding. Each Distributed table points to a replicated table. In the case of me, the shardKey is the projectId

  6. You can see that it is resharded by inserting the sentry data previously received in the distributed table after all the relevant tasks are finished. (At this time, the name of the distributed table must be the same as the original sentry table name)

  7. Finally, change the “CLICKHOUSE_DATABASE” environment variable that was changed on the snuba side to the newly created database.

  8. Make sure the data is well sharded and replicated

Of course, as you mentioned, when the version is upgraded, the table schema is changed. Maybe I have to do this every time to upgrade. However, in the production stage, it was determined that sharding was essential, so we proceeded with the work and confirmed that it works without problems. Thank you for answer!

1 Like

For anyone looking for sharding, I’m leaving the clickhouse cluster github as a reference.

Table creation query example)

CREATE TABLE sentrylab.groupassignee_local_rep ON CLUSTER ‘company_cluster’ (
offset UInt64,
record_deleted UInt8,
project_id UInt64,
group_id UInt64,
date_added Nullable(DateTime),
user_id Nullable(UInt64),
team_id Nullable(UInt64)
) ENGINE = ReplicatedReplacingMergeTree(’/clickhouse/tables/{cluster}/{shard}/groupassignee_local’, ‘{replica}’, offset)
ORDER BY
(project_id, group_id) SETTINGS index_granularity = 8192

CREATE TABLE sentrylab.groupassignee_local ON CLUSTER ‘company_cluster’ AS sentrylab.groupassignee_local_rep
ENGINE = Distributed(‘company_cluster’, sentrylab, groupassignee_local_rep, project_id);

INSERT INTO sentrylab.groupassignee_local SELECT * FROM sentry.groupassignee_local

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.