I am curious about how to connect the sharded multi-node clickhouse

seungjinlee · December 12, 2020, 10:15pm

Clickhouse sharding and replication are now completed in multi-node state. I know that after sharding, I need to create a shared database and send queries to that database. But in order to query this database I know I need cluster_name. When I looked up the settings on the Snuba side, it seems that there is an environment variable that can only respond to the clickhouse of a single node.

How do I use a clustered clickhouse? Looking for more repo, I checked the following multi node cluster in the snuba test code.

github.com

getsentry/snuba/blob/16109210ba80131ada67d42eb706a3046eefbd26/tests/migrations/test_table_engines.py

import pytest

from snuba.clusters.cluster import ClickhouseCluster
from snuba.clusters.storage_sets import StorageSetKey
from snuba.migrations import table_engines


single_node_cluster = ClickhouseCluster(
    host="host_1",
    port=9000,
    user="default",
    password="",
    database="default",
    http_port=8123,
    storage_sets={"events"},
    single_node=True,
)

multi_node_cluster = ClickhouseCluster(
    host="host_2",

This file has been truncated. show original

Can I get an environment variable for cluster_name that I can use when querying the clickhouse from snuba? Or is there another way?

When working with a single node setting without an existing cluster_name, a table is created only in one clickhouse, which was set as an endpoint, and sharding or replication was not performed on the other clickhouse nodes.

seungjinlee · December 12, 2020, 10:20pm

Basically, I see and apply environment variables that can be used in snuba images from the link below.
An environment variable corresponding to cluster_name is required so that it can be used in a clustered clickhouse.
Currently, when executing install.sh, tables are not created in clickhouses on other nodes except for the clickhouse corresponding to the endpoint.

github.com

getsentry/snuba/blob/ee4dffd6fa9b81efa80a0ed1f9507e10723c501d/snuba/settings.py

import os
from typing import Any, Mapping, MutableMapping, Optional, Sequence, Set


LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO")
LOG_FORMAT = "%(asctime)s %(message)s"

TESTING = False
DEBUG = True

PORT = 1218

DEFAULT_DATASET_NAME = "events"
DISABLED_DATASETS: Set[str] = set()

# Clickhouse Options
CLICKHOUSE_MAX_POOL_SIZE = 25

CLUSTERS: Sequence[Mapping[str, Any]] = [
    {

This file has been truncated. show original

seungjinlee · December 13, 2020, 6:37pm

I found a way to create a table with sharded engine during migration in sentry code.

But I can’t understand the instructions in the comments.

“”"
Provides the SQL for create table operations.
If the operation is being performed on a cluster marked as multi node, the
replicated versions of the tables will be created instead of the regular ones.
The shard and replica values will be taken from the macros configuration for
multi node clusters, so these need to be set prior to running the migration.
If unsharded=True is passed, data will be replicated on every shard of the cluster.
“”"

How can I check the sharding and replication options and then migrate to use the table with the engine applied?

github.com

getsentry/snuba/blob/16109210ba80131ada67d42eb706a3046eefbd26/snuba/migrations/table_engines.py

from abc import ABC, abstractmethod

from typing import Mapping, Optional

from snuba.clickhouse.escaping import escape_string
from snuba.clusters.cluster import ClickhouseCluster
from snuba.clusters.storage_sets import StorageSetKey


class TableEngine(ABC):
    """
    Represents a ClickHouse table engine, such as any table or view.

    Any changes made to these classes should ensure the result of every migration
    does not change.
    """

    @abstractmethod
    def get_sql(self, cluster: ClickhouseCluster, table_name: str) -> str:
        raise NotImplementedError

This file has been truncated. show original

seungjinlee · December 14, 2020, 7:23am

I also checked the clickhouse related issue in snuba

Is there any way to add a replicated or distributed table to this?

BYK · December 16, 2020, 10:00am

I’ll ping @lynnagara as she’s our expert on Snuba & Clickhouse.

lynnagara · December 16, 2020, 7:35pm

Hi @seungjinlee. Unfortunately we only support single node ClickHouse installations out of the box currently. Some parts of the snuba codebase may refer to multi node clusters - this is because it’s a feature we started to build out and are planning to support in the future. However this isn’t on our immediate roadmap currently so I can’t give you a timeframe for it right now.

If you need to run replicated or distributed tables, the only way to do so currently is manually create all of the ClickHouse tables yourself (and keep them up to date each time you update Snuba) - you will not be able to use Snuba’s migration system.

seungjinlee · December 17, 2020, 2:51am

Oh, I see. Still, I managed to solve the problem by creating a separate table like the way you said.
Here’s how I solved the problem.

Add “CLICKHOUSE_DATABASE” environment variable on the SNUBA side and receive all sentry schema and data created or migrated to a specific database
You can get the Create query for each table through tabix (I recommend this method) or you can work based on the metadata in the storage connected to the host.
When creating Replicated and Distributed tables, always add “on Cluster” so that all shards and replicas are also created at the same time.
After creating a separate DATABASE, create the Replicated~, Distributed tables. The default mergeTreeFamily has a corresponding ReplicatedFamily for each, so move them one by one. For example, in the case of ReplacingMergeTree, change it to ReplicatedReplacingMergeTree, etc., and in the case of a simple merge table, MATERIALIZED VIEW, it is kept as it is. (At this time, the combined table is based on the distributed table)
In the case of replicated tables, wrap once more with distributed tables for sharding. Each Distributed table points to a replicated table. In the case of me, the shardKey is the projectId
You can see that it is resharded by inserting the sentry data previously received in the distributed table after all the relevant tasks are finished. (At this time, the name of the distributed table must be the same as the original sentry table name)
Finally, change the “CLICKHOUSE_DATABASE” environment variable that was changed on the snuba side to the newly created database.
Make sure the data is well sharded and replicated

Of course, as you mentioned, when the version is upgraded, the table schema is changed. Maybe I have to do this every time to upgrade. However, in the production stage, it was determined that sharding was essential, so we proceeded with the work and confirmed that it works without problems. Thank you for answer!

seungjinlee · December 17, 2020, 2:54am

For anyone looking for sharding, I’m leaving the clickhouse cluster github as a reference.

Table creation query example)

CREATE TABLE sentrylab.groupassignee_local_rep ON CLUSTER ‘company_cluster’ (
offset UInt64,
record_deleted UInt8,
project_id UInt64,
group_id UInt64,
date_added Nullable(DateTime),
user_id Nullable(UInt64),
team_id Nullable(UInt64)
) ENGINE = ReplicatedReplacingMergeTree(’/clickhouse/tables/{cluster}/{shard}/groupassignee_local’, ‘{replica}’, offset)
ORDER BY
(project_id, group_id) SETTINGS index_granularity = 8192

CREATE TABLE sentrylab.groupassignee_local ON CLUSTER ‘company_cluster’ AS sentrylab.groupassignee_local_rep
ENGINE = Distributed(‘company_cluster’, sentrylab, groupassignee_local_rep, project_id);

INSERT INTO sentrylab.groupassignee_local SELECT * FROM sentry.groupassignee_local

system · January 14, 2021, 11:09am

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Snuba and clickhouse On-Premise	3	3354	August 7, 2020
Secure Clickhouse connection with on premise [snuba] On-Premise	5	4463	November 21, 2021
Sentry AWS cluster doesn't share events On-Premise	5	1011	October 26, 2020
Sentry with Clickhouse v21.8 On-Premise	4	2121	January 17, 2022
Sentry 10 error after updating Docker images On-Premise	8	4348	March 11, 2020

I am curious about how to connect the sharded multi-node clickhouse

Related topics