Linux text processing

1. awk

1.1 grammar

awk [options] 'BEGIN {cmd1; cmd2; ...} {cmd1; cmd2; ...} END {cmd1; cmd2; ...}' input_file

Execute the script specified after BEGIN keyword before reading data; (optional)
Execute the script of the middle part for each line of text in the data flow;
After processing all data, execute the script specified after the END keyword. (optional)

You can also put the script in a file:

BEGIN {
	cmd1
	cmd2
    ...
}
{
	cmd1
    cmd2
    ...
}
END {
	cmd1
    cmd2
    ...
}

$ awk -f test.awk input_file

1.2 fields

$0		Represents the entire text line;
$1		Represents the first data field in the text line;
$2		Represents the second data field in the text line;
...

The default field separator is any white space character. You can use - Fsep to change the field separator to sep; You can also set the FS variable to modify:

$ awk -F: '{printf "%s\t\t\t%d\n", $1, $3}' /etc/passwd
	root			0
	daemon			1
	...

$ awk 'BEGIN {FS=":"} {printf "%s\t\t\t%d\n", $1, $3}' /etc/passwd
	root			0
	daemon			1
	...

1.3 variables

Common built-in variables

FS 				Enter field separator
RS 				Enter record separator
OFS 			Output field separator
ORS 			Output record separator
FIELDWIDTHS 	A column of numbers separated by spaces that defines the exact width of each data field
NF 				Total number of fields in the data file
...

The print command automatically places the value of the OFS variable (blank by default) between each field in the output.

The FIELDWIDTHS variable allows you to read records without relying on field separators. Once the FIELDWIDTHS variable is set, awk ignores the FS variable.

$ cat data.txt 
100200030000
400500060000

$ awk 'BEGIN {FIELDWIDTHS="3 4 5"} {print $1,$2,$3}' data.txt
100 2000 30000
400 5000 60000

1.4 mode

Match patterns can be used to limit which records the program script acts on.

regular expression

/pattern/{cmds}

The awk program will match all data fields in the record with regular expressions, including field separators.

$ cat data.txt 
hello world
hello linux

$ awk '/o w/{print $0}' data.txt 
hello world

Match operator

Allows you to restrict regular expression matches to specific data fields in the record.

Execute the script when the nth field matches the specified pattern:

$n ~ /pattern/{cmds}

Execute the script when the nth field does not match the specified pattern:

$n !~ /pattern/{cmds}

mathematical expression

You can use mathematical expressions in matching patterns.

x == y		value x be equal to y
x != y		value x Not equal to y
x <= y		value x Less than or equal to y
x < y		value x less than y
x >= y		value x Greater than or equal to y
x > y		value x greater than y

$ awk -F: '$4 == 0{print $1}' /etc/passwd
root

2. sed

2.1 line addressing

By default, the sed command works on all lines of text data. If you only want to apply commands to specific lines or lines, you must use line addressing.

address command

address {
    command1
    command2
    ...
    commandn
}

The following addressing modes are supported:

Digital mode

There are two ways:

n      # Line n             
n1,n2  # Interval [n1, n2]

The first line is represented by 1 and the last line is represented by $.

$ sed '2s/dog/cat/' data1.txt
$ sed '2,3s/dog/cat/' data1.txt
$ sed '2,$s/dog/cat/' data1.txt

Text mode

There are two ways:

# All rows matching pattern
/pattern/

# Start from the line matching pattern1 to the line matching pattern2 (including)
/pattern1/,/pattern2/

$ sed '/MyPattern/s/bash/csh/' /etc/passwd

2.2 replacement

s/pattern/replacement/flags

By default, it replaces only the first occurrence in each row.

flags value:

Number: replace the place where the pattern matches;
g: Replace all matches;
w file: write the replacement result to the file file.

$ sed 's/test/trial/' data4.txt
$ sed 's/test/trial/2' data4.txt
$ sed 's/test/trial/g' data4.txt
$ sed 's/test/trial/w test.txt' data5.txt

2.3 delete line

$ sed 'd' data1.txt
$ sed '3d' data6.txt

2.4 insert row

# Insert NewLine before the specified line
i\NewLine

# Insert NewLine after the specified line
a\NewLine

If you want to insert multiple lines of text, you must use a backslash on each line of the new text you want to insert until the last line. If you want to cross a line, the backslash needs to be placed at the end of each line.

$ cat file
hello1
hello2

$ sed '2i\line1\
> line2' file
hello1
line1
line2
hello2

2.5 line replacement

c\NewLine

If you want to replace with multiple lines of text, you must use a backslash on each line in the new text until the last line. If you want to cross a line, the backslash needs to be placed at the end of each line.

$ cat file
hello1
hello2

$ sed '1c\
> line1\
> line2' file
line1
line2
hello2

2.6 character mapping

y/inchars/outchars/

The first character in inchars will be converted to the first character in outchars, the second character will be converted to the second character in outchars, and so on.

2.7 document processing

write file

w filename

Writes all lines in the matching address range to the file specified by filename.

$ sed '1,2w test.txt' data6.txt

read file

r filename

Read the contents of the file and insert it after all lines in the matching address range.

$ sed '1,2r nums.txt' lines.txt
line 1
1
2
3
line 2
1
2
3
line 3

2.8 mode replacement

&Symbols can be used to represent the entire pattern that matches in the replace command.

$ echo "The cat sleeps in his hat." | sed 's/.at/"&"/g'
The "cat" sleeps in his "hat".

\n refers to the content of the nth matching group.

$ echo "The System Administrator manual" | sed 's/\(System\) Administrator/\1 User/'
The System User manual

2.9 using variables

Use the form of '$var' (double quotation marks in single quotation marks).

$ msg=hello
$ echo "world" | sed '1i\'"$msg"' '
hello 
world

3. grep

3.1 options

-i    ignore case
-v    Reverse search, that is, select the rows that do not match
-c    Output only the number of matching rows
-n    At the same time, the line number of the matching line is output

3.2 regularization

Note that some special characters need to be escaped.

Special characters

^	Start mark, or reverse
$	End tag
.	Any character
|	or
<	Left boundary of word
>	Word right boundary

$ grep "^abc" data.txt 
$ grep "abc\$" data.txt 
$ grep "a.c" data.txt 
$ grep "adc\|456" data.txt 
$ grep "\<hijk" data.txt 
$ grep "efg\>" data.txt

Repetition, scope

?						Match the previous character 0 or 1 times
*						Matches the previous character 0 or more times
+						Matches the previous character 1 or more times
{m},{m,n},{m,},{,n}    Match the previous character respectively m second,m reach n second,at least m second,at most n second
[]     Matches any one of the specified ranges

$ grep "a\?b" data.txt
$ grep "a*b" data.txt
$ grep "a\+b" data.txt
$ grep "a\{2,\}b" data.txt
$ grep "e[a-zA-Z0-9]" data.txt
$ grep "e[^a-zA-Z0-9]" data.txt

Standard character class

[:alnum:]	Letters and numbers, and[A-Za-z0-9]equivalence
[:alpha:]	Letters, and[A-Za-z]equivalence
[:digit:]	Numbers, and[0-9]equivalence
[:xdigit:]	Hexadecimal characters, and[0-9A-Fa-f equivalence]
[:blank:]	Spaces and tabs
[:graph:]	Visible characters, expanded by 33~126
[:lower:]	Lowercase letters
[:upper:]	capital
[:print:]	Printable character
[:space:]	White space character, equivalent to[\t\r\n\v\f]
[:punct:]	punctuation
[:cntrl:]	ASCII Control code, including character 0~31 And 127

$ grep "e[[:alpha:]]" data.txt
$ grep "e[[:alpha:][:digit:]]" data.txt

Keywords: awk grep sed

Added by Deadman2 on Wed, 09 Mar 2022 11:21:23 +0200

Programming VIP