Sqoop shallow in and shallow out

Sqoop

A tool for efficient data transmission between Hadoop and relational database
Latest stable version 1.4.7 (Sqoop2 is not recommended for production)
Graduated from Apache

In essence, it is just a command-line tool
In production, the import and export of data are basically completed by splicing the Sqoop command
Bottom working mechanism: import and export are actually translated into MapReduce program at the bottom, mainly to customize MR InputFormat and OutputFormat

Learning premise: common linux commands RDBMS basic operation Hadoop purpose and basic operation

Install Sqoop
The subsequent jar packages can be placed in the lib directory

1, Sqoop import

1. Brief description

This command is basically the most used. In production, the data of relational database is generally imported into the data warehouse through sqoop import
sqoop import can only synchronize single table data to HDFS, and each row of the table corresponds to a row of records in HDFS
File format: the storage can be text files (one record per line), or Avro or SequenceFiles in binary representation
The sqoop import all tables command can synchronize multiple tables together. Sqoop Import & sqoop import is the same

2. Command parameters

--connect <jdbc-uri>	
--connection-manager <class-name>	Specify the connection manager class to use                     
--driver com.mysql.jdbc.Driver /  com.microsoft.jdbc.sqlserver.SQLServerDriver	
--hadoop-mapred-home <dir>	
--password-file	Although it is safe, it is not commonly used. Generally, clusters are intranet
-P	Not commonly used
--password <password>	Commonly used 
--username <username>	
--verbose	Print more information at work
--relaxed-isolation	Set connection transaction isolation to read uncommitted mapper (default read committed)         

Other parameters:
--append	Append data to HDFS Existing dataset in
--as-avrodatafile Import data into Avro data file
--as-sequencefile Import data into SequenceFiles
--as-parquetfile Import data into Parquet file
--as-textfile	 Import data as plain text (default)
--columns <col,col,col...> Columns that need to be imported from the table
--delete-target-dir	Delete the import destination directory (if any)
--direct	 If the database exists, use the direct connector
--fetch-size <n>  Number of entries read from the database at one time
-m,--num-mappers <n>	use n individual map Task parallel import
-e,--query <statement>	Query results of import statement
--table <table-name> Source table to import
--target-dir <dir>	 HDFS destination
--where <where clause>	 Used during import WHERE clause
-z,--compress	Enable compression
--compression-codec <c>	 use hadoop Codec (default) gzip)
--null-string <null-string>	 The string to write for the null value of the string column
--null-non-string <null-string>	 The string to write for the null value of a non string column
--split-by <column-name>	Table columns used to split work units
--split-limit <n>	The upper limit of each partition size. This applies only to integer and date columns. For a date or timestamp field, it is calculated in seconds.
--autoreset-to-one-mapper	The table has no primary key and you have no configuration --split-by Split the column. At this time, only a single table can be used map Read the data

--null-string and-null-non-string Parameters are optional. If not specified, the string is used by default"null"

3. Simple use

Demo 1: – table --columns --where

sqoop import \
--table xxx \
--columns a,b,c \
--where a > 20 

Demo 2: you can import the result set of any SQL query -- query. You must use – target dir to specify the target directory

2.1 if it is parallel, you must specify -- split by

$ sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  --split-by a.id --target-dir /user/foo/joinresults

2.2 if it is a single mapping task to execute a query and import it serially, it is not needed

$ sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  -m 1 --target-dir /user/foo/joinresults

Best practice: here -- split by splitting several parts of the data should be appropriate and uniform, and the split results can be combined with – split limit
Limit again. Generally, do not set the value of - m too large, which may fill up the CPU of your relational database, resulting in the failure of the database to work normally
Processing other query requests

Every time you start a SQOOP job, SQOOP will be in $SQOOP_ Copy jars to the job cache in the home / lib folder

Demo 3 -- warehouse dir vs -- target dir

$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \
This command will write/ shared / foo /A set of files in a directory.

$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \
This will import the file directly/ dest catalogue

4. Incremental import

SQOOP provides an incremental import mode that can be used to retrieve only newer rows instead of some previously imported rows.
Parameters:
– check column (column) specifies the column to check when determining the row to import
Two types are supported: incremental and incremental
– last value (value) specifies the maximum value of the previously imported Check column

Comparison of two modes:
The source table data will be updated: lastmodified
Source table data will not be updated: append

5. Import data to Hive

hive parameter:

--hive-import	Import data into Hive Table (use) Hive (if not set.)
--hive-overwrite	cover Hive Existing data in the table.
--hive-table <table-name>	target Hive surface
--hive-drop-import-delims	Import from to Hive When, from String Delete from field\ n,\ r and\ 01
--hive-delims-replacement	Replace from string field\ n,\ r and\ 01,User defined string import Hive
--hive-partition-key	appoint Hive Partition name for
--hive-partition-value <v>	appoint Hive Partition value for
--map-column-hive <map>	 from SQL Type overrides the default mapping to configured columns Hive type

If your database's rows contain default row separators with Hive (\ n and \ r characters) or column separators in columns (\ 01 characters), there will be problems with the data imported using SQOOM.
You can use the – hive drop import delims option to delete characters at the time of import
You can also use -- hive delims replacement
Of course, this is only necessary if your Hive table uses the default separator.

SQOOP imports null strings as null values by default. However, Hive uses String \ n to represent null values
The parameters - null string and - nnull non string:
–null-string '\N' --null-non-string '\N'

6. Practical cases

corp Database named EMPLOYEES Basic import of tables:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES

from EMPLOYEES Select a specific column from the table:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --columns "employee_id,first_name,last_name,job_title"
    
Control the import parallelism (using 8 parallel tasks):
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    -m 8

stay SequenceFiles Store data in and set the generated class name to com.foocorp.Employee: 
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --class-name com.foocorp.Employee --as-sequencefile

Specify the separator to use in text mode import:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --fields-terminated-by '\t' --lines-terminated-by '\n' \
    --optionally-enclosed-by '\"'

Import data into Hive: 
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --hive-import


Conditional import:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --where "start_date > '2010-01-01'"

Change split column from default primary key:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --split-by dept_id


In the first 100 of the imported tables,000 Perform incremental import of new data after rows:
$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \
    --where "id > 100000" --target-dir /incremental_dataset --append


sqoop-export

1. Brief description

The export tool exports a set of files from HDFS back to RDBMS
The target table must already exist in the database
The input file is read according to the separator specified by the user and parsed into a set of records
The default action is to convert these into a set of INSERT statements that inject records into the database.
In UPDATE mode, Sqoop generates an UPDATE statement that replaces existing records in the database

2. Basic export

sqoop export --connect jdbc:mysql://db.example.com/foo --table bar  \
    --export-dir /results/bar_data

sqoop-merge

1. Brief description

The merge tool allows you to merge two datasets, and the entries in one dataset should overwrite the entries in the old dataset
For example, an incremental import running in last modified mode will generate multiple datasets in HDFS,
The updated data appears in each data set in turn
The merge tool will "flatten" the two data sets into one, and obtain the latest available records for each primary key

The essence is that there are multiple pieces of data for the same key. sqoop merge helps us select the latest one

2. Command parameters

– class name specifies the name of the record specific class to use during the merge job.
– jar file specifies the name of the jar from which to load the record class.
Description: to parse the dataset and extract key columns, you must use the automatically generated classes in the previous import. You should use – class name and – jar file to specify the class name and jar file.

– merge key specifies the name of the column to be used as the merge key
–new-data Specifies the path to the newer dataset.
–onto Specifies the path to the old dataset.
–target-dir Specifies the destination path for the merge job output.
explain:
The merge tool runs a MapReduce job,
The job takes two directories as input: a newer dataset and an older dataset. These are specified with -- new data and -- onto, respectively.
The output of MapReduce job will be placed in the directory specified by -- target dir in HDFS

Sqoop merge is usually run after incremental import using date last modified mode (sqoop import --incremental lastmodified...)

3. Basic use

$ sqoop merge --new-data newer --onto older --target-dir merged
–jar-file datatypes.jar --class-name Foo --merge-key id
A MapReduce job will be run, in which the value in the id column of each row is used to connect rows; Rows in the new dataset take precedence over rows in the old dataset
Can be used with SequenceFile, Avro, and text-based incremental imports. The file types of old and new datasets must be the same.

Best practices

Full import:

sqoop-import \
--connect 'jdbc:mysql://host:3306/db?useSSL=false&useCursorFetch=true&tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull' \
--username xxx --password 'xxx' \
--target-dir /warehouse/ods/ods_db_order \
--delete-target-dir --null-string '\\N' --null-non-string '\\N' --fields-terminated-by "\001" --lines-terminated-by "\n" \
--hive-drop-import-delims \
--query 'SELECT id, created_time, updated_time , order_id FROM order WHERE $CONDITIONS ' \
--split-by 'id' \
--fetch-size 10000 \
--m 1

Incremental import 1:

To use incremental import for unchanged data:
Adopt: Daily full volume + minute increment (increment id is saved to zk for where, saved to hdfs://table/mode=append (under directory)

sqoop-import     \
--connect 'jdbc:mysql://host:3306/db?useSSL=false&useCursorFetch=true&tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull'     \
--username xx --password 'xx'     \
--target-dir cosn://earth-1212ddd/warehouse/ods/table/mode=append \
--delete-target-dir     \
--null-string '\\N' --null-non-string '\\N' --fields-terminated-by "\t" --lines-terminated-by "\n"     \
--hive-drop-import-delims     \
--query 'SELECT id,class_id FROM class_t  WHERE $CONDITIONS and  id > 666 '    \
--split-by 'id'     \
--fetch-size 10000     \
--m 4

Incremental import 2:

For unchanged data, use incremental import directly:

sqoop-import \
--connect 'jdbc:mysql://host:3306/ocpx_elf?useSSL=false&useCursorFetch=true&tinyInt1isBit=false' \
--username xx --password xx \
--target-dir /warehouse/ods/user_t \
--null-string '\\N' --null-non-string '\\N' --fields-terminated-by "\t" --lines-terminated-by "\n" \
--hive-drop-import-delims \
--query 'SELECT id, name FROM user_t t1 WHERE $CONDITIONS ORDER BY id DESC'\
--split-by 't1.id' \
--fetch-size 10000 \
--incremental append \
--check-column id \
--last-value 666 \  -- store in zk
--m 1 

Keywords: Big Data Hadoop sqoop

Added by ursvmg on Mon, 21 Feb 2022 12:40:21 +0200