StarRocks troubleshooting process – import

background

This paper mainly describes the troubleshooting ideas of import problems and the solutions to common problems. Here is a brief description of the processes of different import methods to facilitate everyone to understand the import process and troubleshoot problems. For details, please refer to the documents Import Chapter.

Troubleshooting process

Stream Load

Stream load internal call link

 

1796×1286 76.2 KB

 

Stream Load is a synchronous import method. The user sends a request through HTTP protocol to import the local file or data stream into StarRocks, and waits for the system to return the import result status, so as to judge whether the import is successful.

Pasting block s outside Docs is not supported

  1. Judge by return value
{
    "Status":"Fail",
    "BeginTxnTimeMs":1,
    "Message":"too many filtered rows",
    "NumberUnselectedRows":0,
    "CommitAndPublishTimeMs":0,
    "Label":"4682d766-0e53-4fce-b111-56a8d8bef390",
    "LoadBytes":69238389,
    "StreamLoadPutTimeMs":4,
    "NumberTotalRows":7077604,
    "WriteDataTimeMs":4350,
    "TxnId":33,
    "LoadTimeMs":4356,
    "ErrorURL":"http://192.168.10.142:8040/api/_load_error_log?file=__shard_2/error_log_insert_stmt_e44ae406-32c7-6d5c-a807-b05607a57cbf_e44ae40632c76d5c_a807b05607a57cbf",
    "ReadDataTimeMs":1961,
    "NumberLoadedRows":0,
    "NumberFilteredRows":7077604
}
  1. Return value Status: non Success,

a. Error URL exists

Curl errourls, for example

curl "http://192.168.10.142:8040/api/_load_error_log?file=__shard_2/error_log_insert_stmt_e44ae406-32c7-6d5c-a807-b05607a57cbf_e44ae40632c76d5c_a807b05607a57cbf"

b. Error URL does not exist

{
    "TxnId":2271727,
    "Label":"4682d766-0e53-4fce-b111-56a8d8bef2340",
    "Status":"Fail",
    "Message":"Failed to commit txn 2271727. Tablet [159816] success replica num 1 is less then quorum replica num 2 while error backends 10012",
    "NumberTotalRows":1,
    "NumberLoadedRows":1,
    "NumberFilteredRows":0,
    "NumberUnselectedRows":0,
    "LoadBytes":575,
    "LoadTimeMs":26,
    "BeginTxnTimeMs":0,
    "StreamLoadPutTimeMs":0,
    "ReadDataTimeMs":0,
    "WriteDataTimeMs":21,
    "CommitAndPublishTimeMs":0
}

View the imported load_id and scheduled be node

grep -w $TxnId fe.log|grep "load id"

#Output example:
2021-12-20 20:48:50,169 INFO (thrift-server-pool-4|138) [FrontendServiceImpl.streamLoadPut():809] receive stream load put request. db:ssb, tbl: demo_test_1, txn id: 1580717, load id: 7a4d4384-1ad7-b798-f176-4ae9d7ea6b9d, backend: 172.26.92.155

Check the specific reason in the corresponding be node

grep $load_id be.WARNING|less

Broker Load

 

1331×1673 78 KB

 

After the user submits the import task, the FE will generate the corresponding Plan and distribute the Plan to multiple BE for execution according to the current number of BE and the size of the file, and each BE will perform part of the import task. BE will pull data through the Broker during execution, and import the data into the system after data preprocessing. After all BE are imported, FE will finally judge whether the import is successful.

Pasting block s outside Docs is not supported

At present, the task process of a Broker Load will go through the process of PENDING – > loading – > finished (or canceled). When the status is canceled, it needs to be involved in troubleshooting.

  1. Show load check the task status, and follow up when the status is CANCELLED
  2. If the URL is not empty, use curl $URL to view the specific error information
  3. If the URL is empty, check the load id and be through the fe log
  4. Check whether the hdfs file path is specified correctly. It can be specified to specific files or all files in a directory
  5. hdfs import, please check whether there is k8s authentication and configure it
grep $JobnId fe.log
  1. View specific exceptions in be
grep $load_id be.INFO

Value of type in ErrorMsg:

  • USER-CANCEL: task cancelled by user

  • ETL-RUN-FAIL: import task failed in ETL phase

  • ETL-QUALITY-UNSATISFIED: the data quality is unqualified, that is, the error data rate exceeds the max filter ratio

  • LOAD-RUN-FAIL: import task failed in LOADING phase

  • TIMEOUT: the import task did not complete within the TIMEOUT

  • UNKNOWN: UNKNOWN import error

Routine Load

                 +-----------------+
fe schedule job  |  NEED_SCHEDULE  |  user resume job
     +-----------+                 | <---------+
     |           |                 |           |
     v           +-----------------+           ^
     |                                         |
+------------+   user(system)pause job +-------+----+
|  RUNNING   |                         |  PAUSED    |
|            +-----------------------> |            |
+----+-------+                         +-------+----+
|    |                                         |
|    |           +---------------+             |
|    |           | STOPPED       |             |
|    +---------> |               | <-----------+
|   user stop job+---------------+    user stop job
|
|
|               +---------------+
|               | CANCELLED     |
+-------------> |               |
system error    +---------------+

The figure above shows the task state machine of routine load

show routine load for db.job_name
MySQL [load_test]> SHOW ROUTINE LOAD\G;
*************************** 1. row ***************************
                  Id: 14093
                Name: routine_load_wikipedia
          CreateTime: 2020-05-16 16:00:48
           PauseTime: 2020-05-16 16:03:39
             EndTime: N/A
              DbName: default_cluster:load_test
           TableName: routine_wiki_edit
               State: PAUSED
      DataSourceType: KAFKA
      CurrentTaskNum: 0
       JobProperties: {"partitions":"*","columnToColumnExpr":"event_time,channel,user,is_anonymous,is_minor,is_new,is_robot,is_unpatrolled,delta,added,deleted","maxBatchIntervalS":"10","whereExpr":"*","maxBatchSizeBytes":"104857600","columnSeparator":"','","maxErrorNum":"1000","currentTaskConcurrentNum":"1","maxBatchRows":"200000"}
DataSourceProperties: {"topic":"starrocks-load","currentKafkaPartitions":"0","brokerList":"localhost:9092"}
    CustomProperties: {}
           Statistic: {"receivedBytes":162767220,"errorRows":132,"committedTaskNum":13,"loadedRows":2589972,"loadRowsRate":115000,"abortedTaskNum":7,"totalRows":2590104,"unselectedRows":0,"receivedBytesRate":7279000,"taskExecuteTimeMs":22359}
            Progress: {"0":"13824771"}
ReasonOfStateChanged: ErrorReason{code=errCode = 100, msg='User root pauses routine load job'}
        ErrorLogUrls: http://172.26.108.172:9122/api/_load_error_log?file=__shard_54/error_log_insert_stmt_e0c0c6b040c044fd-a162b16f6bad53e6_e0c0c6b040c044fd_a162b16f6bad53e6, http://172.26.108.172:9122/api/_load_error_log?file=__shard_55/error_log_insert_stmt_ce4c95f0c72440ef-a442bb300bd743c8_ce4c95f0c72440ef_a442bb300bd743c8, http://172.26.108.172:9122/api/_load_error_log?file=__shard_56/error_log_insert_stmt_8753041cd5fb42d0-b5150367a5175391_8753041cd5fb42d0_b5150367a5175391
            OtherMsg:
1 row in set (0.01 sec)

When the task status is PAUSED or CANCELLED, it is necessary to intervene in the troubleshooting

When the task status is PAUSED:

  1. You can check ReasonOfStateChanged and locate the reason, such as "Offset out of range"
  2. If ReasonOfStateChanged is empty, check ErrorLogUrls to view the specific error information
curl ${ErrorLogUrls}

Spark Load

 

596×520 5.6 KB

 

Insert Into

Insert into is also the import method that you encounter many problems at present. Currently, insert into supports the following two methods:

  • Method 1: Insert into table values();
  • Method 2: Insert into table1 xxx select xxx from table2

Method 1 is not recommended for online use

Since the insert into import method is synchronous, the result will be returned immediately after execution. The import success or failure can be judged by the returned results.

Flink-connector

Writing StarRocks is encapsulated stream load, and the internal process can refer to stream load import

Unable to copy content in load

Since the bottom layer of Flink connector adopts the way of stream load, you can refer to the way of stream load troubleshooting.

  1. First, search the keyword "_stream_load" from Flink log to confirm that the stream load task has been successfully initiated
  2. Then check and search the label corresponding to stream load, and search the import result of the label, as shown in the following figure
{
    "Status":"Fail",
    "BeginTxnTimeMs":1,
    "Message":"too many filtered rows",
    "NumberUnselectedRows":0,
    "CommitAndPublishTimeMs":0,
    "Label":"4682d766-0e53-4fce-b111-56a8d8bef390",
    "LoadBytes":69238389,
    "StreamLoadPutTimeMs":4,
    "NumberTotalRows":7077604,
    "WriteDataTimeMs":4350,
    "TxnId":33,
    "LoadTimeMs":4356,
    "ErrorURL":"http://192.168.10.142:8040/api/_load_error_log?file=__shard_2/error_log_insert_stmt_e44ae406-32c7-6d5c-a807-b05607a57cbf_e44ae40632c76d5c_a807b05607a57cbf",
    "ReadDataTimeMs":1961,
    "NumberLoadedRows":0,
    "NumberFilteredRows":7077604
}
  1. Next, refer to the stream load troubleshooting process

Flink-CDC

Writing StarRocks is encapsulated stream load, and the internal process can refer to stream load import

 

1354×326 9.45 KB

 

  1. When the Flink task doesn't report an error

Step 1: confirm whether binlog is enabled. You can use SHOW VARIABLES LIKE 'log_bin 'view;

Step 2: confirm whether the versions of flink, flink CDC, flink starrods connector and MySQL (MySQL version is 5.7 and 8.0.X) meet the requirements. The large versions of flink, flink CDC and flink starrods connector need to be consistent, for example, they are all version 1.13

Step 3: step by step determine whether to check the source table or write starlocks. Here, use the following sql file to demonstrate that the file is Flink-cdc The Flink create generated in step 7 of 1.sql

CREATE DATABASE IF NOT EXISTS `test_db`;

CREATE TABLE IF NOT EXISTS `test_db`.`source_tb` (
  `id` STRING NOT NULL,
  `score` STRING NULL,
  PRIMARY KEY(`id`)
 NOT ENFORCED
) with (
  'username' = 'root',
  'password' = 'xxx',
  'database-name' = 'test',
  'table-name' = 'test_source',
  'connector' = 'mysql-cdc',
  'hostname' = '172.26.92.139',
  'port' = '8306'
);

CREATE TABLE IF NOT EXISTS `test_db`.`sink_tb` (
  `id` STRING NOT NULL,
  `score` STRING NULL
  PRIMARY KEY(`id`)
 NOT ENFORCED
) with (
  'load-url' = 'sr_fe_host:8030',
  'sink.properties.row_delimiter' = '\x02',
  'username' = 'root',
  'database-name' = 'test_db',
  'sink.properties.column_separator' = '\x01',
  'jdbc-url' = 'jdbc:mysql://sr_fe_host:9030',
  'password' = '',
  'sink.buffer-flush.interval-ms' = '15000',
  'connector' = 'starrocks',
  'table-name' = 'test_tb'
);

INSERT INTO `test`.`sink_tb` SELECT * FROM `test_db`.`source_tb`;

Execute the following statement under the installed Flink directory to enter Flink SQL

bin/sql-client.sh

First, verify whether reading the source table is normal

#Paste in the above sql to determine whether it is the problem of querying the source table or writing to starrocks
CREATE DATABASE IF NOT EXISTS `test_db`;

CREATE TABLE IF NOT EXISTS `test_db`.`source` (
  `id` STRING NOT NULL,
  `score` STRING NULL,
  PRIMARY KEY(`id`)
 NOT ENFORCED
) with (
  'username' = 'root',
  'password' = 'xxx',
  'database-name' = 'test',
  'table-name' = 'test_source',
  'connector' = 'mysql-cdc',
  'hostname' = '172.26.92.139',
  'port' = '8306'
);

#Verify whether the source is normal
select * from `test_db`.`source_tb`;

Then verify whether writing to starrocks is normal

CREATE TABLE IF NOT EXISTS `test_db`.`sink_tb` (
  `id` STRING NOT NULL,
  `score` STRING NULL
  PRIMARY KEY(`id`)
 NOT ENFORCED
) with (
  'load-url' = 'sr_fe_host:8030',
  'sink.properties.row_delimiter' = '\x02',
  'username' = 'root',
  'database-name' = 'test_db',
  'sink.properties.column_separator' = '\x01',
  'jdbc-url' = 'jdbc:mysql://sr_fe_host:9030',
  'password' = '',
  'sink.buffer-flush.interval-ms' = '15000',
  'connector' = 'starrocks',
  'table-name' = 'test_tb'
);

INSERT INTO `test`.`sink_tb` SELECT * FROM `test_db`.`source_tb`;
  1. Flink task error

Step 1: confirm whether the flink cluster is started. Some students may not start the flink downloaded locally, so you need to/ bin/start-cluster.sh start the next flick

Step 2: make specific analysis according to specific error reports

DataX

Writing StarRocks is encapsulated stream load, and the internal process can refer to stream load import

Unable to copy content in load

Since the bottom layer of DataX also adopts the stream load method, you can refer to the stream load troubleshooting method.

  1. Start with dataX / log / yyyy MM DD / xxx Search the keyword "_stream_load" in the log to confirm that the stream load task has been successfully initiated

A. If no stream load is generated, please check dataX / log / yyyy MM DD / xxx Log, analyze and solve exceptions

B. If stream load is generated, it can be found in dataX / log / yyyy MM DD / xxx Log search the label corresponding to stream load, and search the import result of the label, as shown in the following figure

{
    "Status":"Fail",
    "BeginTxnTimeMs":1,
    "Message":"too many filtered rows",
    "NumberUnselectedRows":0,
    "CommitAndPublishTimeMs":0,
    "Label":"4682d766-0e53-4fce-b111-56a8d8bef390",
    "LoadBytes":69238389,
    "StreamLoadPutTimeMs":4,
    "NumberTotalRows":7077604,
    "WriteDataTimeMs":4350,
    "TxnId":33,
    "LoadTimeMs":4356,
    "ErrorURL":"http://192.168.10.142:8040/api/_load_error_log?file=__shard_2/error_log_insert_stmt_e44ae406-32c7-6d5c-a807-b05607a57cbf_e44ae40632c76d5c_a807b05607a57cbf",
    "ReadDataTimeMs":1961,
    "NumberLoadedRows":0,
    "NumberFilteredRows":7077604
}
  1. Next, refer to the stream load troubleshooting process

common problem

  1. "Failed to commit txn 2271727. Tablet [159816] success replica num 1 is less then quorum replica num 2 while error backends 10012",

The specific cause of this problem needs to follow the above troubleshooting process in be View specific exceptions in warning

  1. close index channel failed/too many tablet versions

The import frequency is too fast, and the compaction fails to merge in time, resulting in too many versions. The default version is 1000

Reduce the frequency, adjust the compaction strategy and speed up the merging (after adjusting, you need to observe the memory and io) Modify the following in conf

base_compaction_check_interval_seconds = 10
cumulative_compaction_num_threads_per_disk = 4
base_compaction_num_threads_per_disk = 2
cumulative_compaction_check_interval_seconds = 2
  1. Reason: invalid value '202123098432'.

The type of a column in the import file is inconsistent with that in the table

  1. the length of input is too long than schema

The length of a column in the import file is incorrect. For example, the fixed length string exceeds the length set by the table creation, and the int type field exceeds 4 bytes.

  1. actual column number is less than schema column number

After a line of the import file is divided according to the specified separator, the number of columns is less than the specified number of columns, which may be due to the incorrect separator.

  1. actual column number is more than schema column number

After a line of the import file is divided according to the specified separator, the number of columns is greater than the specified number of columns, which may be due to the incorrect separator.

  1. the frac part length longer than schema scale

The decimal part of a decimal column in the import file exceeds the specified length.

  1. the int part length longer than schema precision

The integer part of a decimal column in the import file exceeds the specified length.

  1. the length of decimal value is overflow

The length of a decimal column in the import file exceeds the specified length.

  1. There is no corresponding partition for this key

The value of the partition column of a row in the imported file is not within the partition range.

  1. Caused by: org.apache.http.ProtocolException: The server failed to respond with a valid HTTP response

The Stream load port is incorrectly configured. It should be http_port

  1. flink demo, the test library table is established as required, and then the program does not have any error log, and the data cannot sink in. What is the troubleshooting idea

It may be due to the inability to access be. The stream load encapsulated by the current flash, fe will redirect $be: $HTTP after receiving the request_ Port. Generally, when debugging locally, you can access fe+http_port, but be + HTTP cannot be accessed_ Port, need to open access be+http_port firewall

  1. Transaction commit successfully,But data will be visible later

This status also indicates that the import has been completed, but the data may be delayed from being visible. The reason is that some publish times out. You can also turn up fe to configure publish_version_timeout_second

  1. get database write lock timeout

It may be that the number of threads of fe exceeds. It is suggested to adjust the be configuration: thrift_rpc_timeout_ms=10000 (default: 5000ms)

  1. failed to send batch or TabletWriter add batch with unknown id

Please refer to the chapter import Overview / general system configuration / BE configuration and modify query appropriately_ Timeout and streaming_load_rpc_max_alive_time_sec

  1. LOAD-RUN-FAIL; msg:Invalid Column Name:xxx
  • If the data is in Parquet or ORC format, it is necessary to keep the column name of the file header consistent with the column name in the StarRocks table, such as:
(tmp_c1,tmp_c2)
SET
(
   id=tmp_c2,
   name=tmp_c1
)

Indicates that the column with (tmp_c1, tmp_c2) as the column name in Parquet or ORC file is mapped to the (id, name) column in StarRocks table. If set is not set, the column in column is used as the mapping.

Note: if you use ORC files directly generated by some Hive versions, the header in the ORC file is not Hive meta data, but

(_col0, _col1, _col2, ...) , which may lead to Invalid Column Name error, you need to use set for mapping.

  1. Can't get Kerberos realm

A: First, check whether all the machines of broker s are configured

/etc/krb5.conf file.

If an error is still reported after configuration, it needs to be in the startup script of the broker

JAVA_ At the end of the opts # variable, add

-Djava.security.krb5.conf:/etc/krb5.conf .

  1. orc data import failed ErrorMsg: type:ETL_RUN_FAIL; msg:Cannot cast ‘<slot 6>’ from VARCHAR to ARRAY<VARCHAR(30)>

The column names on both sides of the imported source file and the starlocks are inconsistent. When set, there will be a type inference in the system, and then the cast fails. If the field names on both sides are set to be the same, there will be no cast without set, and the import can succeed

  1. No source file in this table

There are no files in the table

  1. cause by: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]

kerberos authentication failed. klist checks whether the authentication has expired and whether the account has access to the source data

  1. Reason: there is a row couldn't find a partition. src line: [];

The imported data has no partition specified in the starrocks table

Keywords: OLAP

Added by Yola on Mon, 07 Mar 2022 10:13:37 +0200