Isn't ES fragrant? Why do you want ClickHouse?

"

Elasticsearch is a real-time distributed search and analysis engine. Its bottom layer is built on Lucene. In short, Lucene has distributed functions by expanding its search capabilities.

ES usually provides end-to-end log / search analysis together with the other two open source components Logstash (log collection) and Kibana (dashboard), which is often referred to as ELK.

Clickhouse is a relational database for column storage developed by Russian search giant Yandex. Clickhouse is the most popular in the field of OLAP in the past two years and was open source in 2016.

ES is the most popular big data log and search solution. But in recent years, its Kwai Fu status has been challenged. Many companies have begun to migrate their log solutions from ES to Clickhouse.

Comparison of architecture and design

Lucenc is the bottom layer of ES, mainly to solve the problem of search. Search is a common problem to be solved in the field of big data, that is, how to find the required data according to the conditions in the massive amount of data. The core technology of search is inverted index and bloom filter.

Through distributed technology, ES directly solves the problems of search performance and high availability in cluster by using fragmentation and replica mechanism.

ElasticSearch is designed for distributed and has good scalability. In a typical distributed configuration, each node can be configured into different roles.

As shown in the figure above:

  • Client Node, which is responsible for API and data access, does not store / process data.

  • The Data Node is responsible for data storage and indexing.

  • Master Node, the management node, is responsible for the coordination of nodes in the Cluster and does not store data.

ClickHouse is a distributed ROLAP (relational OLAP) analysis engine based on MPP architecture. Each node has the same responsibility and is responsible for some data processing (no content is shared).

ClickHouse is a real columnar database management system (DBMS). In ClickHouse, data is always stored by column, including the process performed by vectors (vectors or column blocks).

The simplest and effective way to make query faster is to reduce the data scanning range and the size of data transmission, and column storage and data compression can help achieve the above two points.

Clickhouse uses log merge tree, sparse index and CPU functions (such as SIMD single instruction multiple data) to give full play to the advantages of hardware and realize efficient computing.

Clickhouse uses Zookeeper to coordinate among distributed nodes.

To support search, Clickhouse also supports Bloom filters.

Query and contrast actual combat

In order to compare the basic query capabilities of ES and Clickhouse, I wrote some codes to verify:

https://github.com/gangtao/esvsch

The test architecture is as follows:

The architecture is mainly composed of four parts:

①ES stack

The ES stack consists of a single node elastic container and a Kibana container. Elastic is one of the tested targets, and Kibana is used as a verification and auxiliary tool.

The deployment code is as follows:

version: '3.7'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.4.0
    container_name: elasticsearch
    environment:
      - xpack.security.enabled=false
      - discovery.type=single-node
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    cap_add:
      - IPC_LOCK
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
      - 9300:9300
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4096M
        reservations:
          memory: 4096M

  kibana:
    container_name: kibana
    image: docker.elastic.co/kibana/kibana:7.4.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - 5601:5601
    depends_on:
      - elasticsearch

volumes:
  elasticsearch-data:
    driver: local

②Clickhouse stack

The Clickhouse stack has a single node Clickhouse service container and a TabixUI as the client of the Clickhouse.

The deployment code is as follows:

version: "3.7"
services:
  clickhouse:
    container_name: clickhouse
    image: yandex/clickhouse-server
    volumes:
      - ./data/config:/var/lib/clickhouse
    ports:
      - "8123:8123"
      - "9000:9000"
      - "9009:9009"
      - "9004:9004"
    ulimits:
      nproc: 65535
      nofile:
        soft: 262144
        hard: 262144
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "localhost:8123/ping"]
      interval: 30s
      timeout: 5s
      retries: 3
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4096M
        reservations:
          memory: 4096M

  tabixui:
    container_name: tabixui
    image: spoonest/clickhouse-tabix-web-client
    environment:
      - CH_NAME=dev
      - CH_HOST=127.0.0.1:8123
      - CH_LOGIN=default
    ports:
      - "18080:80"
    depends_on:
      - clickhouse
    deploy:
      resources:
        limits:
          cpus: '0.1'
          memory: 128M
        reservations:
          memory: 128M

③ Data import stack

The data import part uses vector The vector developed by dev, which is similar to fluent D, can realize the flexible data import of data pipeline.

④ Test control stack

Test control I used Jupyter and the Python SDK of ES and Clickhouse to test the query.

After starting the stack of ES and Clickhouse with Docker compose, we need to import data. We use the generator function of Vector to generate syslog and import es and Clickhouse at the same time.

Before that, we need to create a table on the Clickhouse. The index of ES does not have a fixed pattern, so there is no need to create an index in advance.

The code for creating the table is as follows:

CREATE TABLE default.syslog(
    application String,
    hostname String,
    message String,
    mid String,
    pid String,
    priority Int16,
    raw String,
    timestamp DateTime('UTC'),
    version Int16
) ENGINE = MergeTree()
    PARTITION BY toYYYYMMDD(timestamp)
    ORDER BY timestamp
    TTL timestamp + toIntervalMonth(1);

After creating the table, we can start the vector and write data to the two stack s. The data pipeline of vector is defined as follows:

[sources.in]
  type = "generator"
  format = "syslog"
  interval = 0.01
  count = 100000

[transforms.clone_message]
  type = "add_fields"
  inputs = ["in"]
  fields.raw = "{{ message }}"

[transforms.parser]
  # General
  type = "regex_parser"
  inputs = ["clone_message"]
  field = "message" # optional, default
  patterns = ['^<(?P<priority>\d*)>(?P<version>\d) (?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z) (?P<hostname>\w+\.\w+) (?P<application>\w+) (?P<pid>\d+) (?P<mid>ID\d+) - (?P<message>.*)$']

[transforms.coercer]
  type = "coercer"
  inputs = ["parser"]
  types.timestamp = "timestamp"
  types.version = "int"
  types.priority = "int"

[sinks.out_console]
  # General
  type = "console"
  inputs = ["coercer"] 
  target = "stdout" 

  # Encoding
  encoding.codec = "json" 


[sinks.out_clickhouse]
  host = "http://host.docker.internal:8123"
  inputs = ["coercer"]
  table = "syslog"
  type = "clickhouse"

  encoding.only_fields = ["application", "hostname", "message", "mid", "pid", "priority", "raw", "timestamp", "version"]
  encoding.timestamp_format = "unix"

[sinks.out_es]
  # General
  type = "elasticsearch"
  inputs = ["coercer"]
  compression = "none" 
  endpoint = "http://host.docker.internal:9200" 
  index = "syslog-%F"

  # Encoding

  # Healthcheck
  healthcheck.enabled = true

Here is a brief introduction to the pipeline:

  • source.in: generate simulation data of syslog, generate 10w pieces, generation interval and 0.01 seconds.

  • transforms.clone_message: copy the original message so that the extracted information can retain the original message at the same time.

  • transforms.parser: use regular expressions to extract the fields of application, hostname, message, mid, pid, priority, timestamp and version according to the definition of syslog.

  • transforms.coercer: data type conversion.

  • sinks.out_console: print the generated data to the console for development and debugging.

  • sinks.out_clickhouse: send the generated data to the Clickhouse.

  • sinks.out_es: send the generated data to ES.

Run the Docker command to execute the pipeline:

docker run \
        -v $(mkfile_path)/vector.toml:/etc/vector/vector.toml:ro \
        -p 18383:8383 \
        timberio/vector:nightly-alpine

After the data is imported, let's make a comparison for the following queries. ES uses its own query language to query. Clickhouse supports SQL. I simply tested some common queries and compared their functions and performance.

Return all records:

# ES
{
  "query":{
    "match_all":{}
  }
}

# Clickhouse 
"SELECT * FROM syslog"

Match a single field:

# ES
{
  "query":{
    "match":{
      "hostname":"for.org"
    }
  }
}

# Clickhouse 
"SELECT * FROM syslog WHERE hostname='for.org'"

Match multiple fields:

# ES
{
  "query":{
    "multi_match":{
      "query":"up.com ahmadajmi",
        "fields":[
          "hostname",
          "application"
        ]
    }
  }
}

# Clickhouse,
"SELECT * FROM syslog WHERE hostname='for.org' OR application='ahmadajmi'"

Word lookup to find a field containing a specific word:

# ES
{
  "query":{
    "term":{
      "message":"pretty"
    }
  }
}

# Clickhouse
"SELECT * FROM syslog WHERE lowerUTF8(raw) LIKE '%pretty%'"

Range query to find records with version greater than 2:

# ES
{
  "query":{
    "range":{
      "version":{
        "gte":2
      }
    }
  }
}

# Clickhouse
"SELECT * FROM syslog WHERE version >= 2"

Found a record with a field:

# ES
{
  "query":{
    "exists":{
      "field":"application"
    }
  }
}

# Clickhouse
"SELECT * FROM syslog WHERE application is not NULL"

ES is a document type database. The mode of each document is not fixed, so a field does not exist; The Clickhouse field is null.

Regular expression query: query the data matching a regular expression:

# ES
{
  "query":{
    "regexp":{
      "hostname":{
        "value":"up.*",
          "flags":"ALL",
            "max_determinized_states":10000,
              "rewrite":"constant_score"
      }
    }
  }
}

# Clickhouse
"SELECT * FROM syslog WHERE match(hostname, 'up.*')"

Aggregate count to count the number of occurrences of a field:

# ES
{
  "aggs":{
    "version_count":{
      "value_count":{
        "field":"version"
      }
    }
  }
}

# Clickhouse
"SELECT count(version) FROM syslog"

Aggregate non duplicate values and find the number of all non duplicate fields:

# ES
{
  "aggs":{
    "my-agg-name":{
      "cardinality":{
        "field":"priority"
      }
    }
  }
}

# Clickhouse
"SELECT count(distinct(priority)) FROM syslog "

I use Python SDK to run the above query on two stacks for 10 times, and then count the performance results of the query.

We draw the response time distribution of all queries:

The comparison of total query time is as follows:

Through the test data, we can see that Clickhouse is obviously better than Elastic in the performance of most queries.

In common search scenarios such as Regex query and Term query, it is not inferior.

In the aggregation scenario, Clickhouse performs exceptionally well, giving full play to the advantages of liecun engine.

Note that my test did not have any optimizations, nor did I turn on the bloom filter for Clickhouse. It can be seen that Clickhouse is indeed a very excellent database, which can be used in some search scenarios.

Of course, ES also supports very rich query functions. There are only some very basic queries here. Some queries may not be expressed in SQL.

summary

By testing some basic queries, this paper compares the functions and performance of Clickhouse and elastic search.

The test results show that Clickhouse performs very well in these basic scenarios and its performance is better than es, which also explains why many companies should switch from ES to Clickhouse.

Author: Gang Tao

Editor: 51CTO technology stack | Tao Jialong

Source: zhuanlan zhihu. com/p/353296392

Keywords: Database Big Data ElasticSearch Distribution clickhouse

Added by shalinik on Mon, 24 Jan 2022 15:50:02 +0200