Redisjason Chinese full text retrieval

Redisjason Chinese full text retrieval

RedisJson

  • Redisjason, which is popular on the Internet recently, I believe everyone is familiar with it. There is also a performance post that says yes Redisjason was born in the sky, with the performance of ES and Mongo! Of course, these hundreds of times improvements may be more objective. I am more concerned about redisjason's json support, full-text retrieval function and Chinese word segmentation

install

1. The official website has 30 days of free trial and 30M memory. You can create an instance for testing

  • Redis cli can be used for connection test
[root@server bin]# ./redis-cli -h redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com -p 17137 -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> 

2. You can install the reJson module yourself

Download path: https://redis.com/redis-enter...

Installation: https://oss.redis.com/redisjs...

[root@server bin]# ./redis-server --loadmodule /opt/thunisoft/redis/redisjson/rejson.so 
82538:C 29 Dec 2021 18:41:09.585 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
82538:C 29 Dec 2021 18:41:09.585 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=82538, just started
82538:C 29 Dec 2021 18:41:09.585 # Configuration loaded
82538:M 29 Dec 2021 18:41:09.587 * monotonic clock: POSIX clock_gettime
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 6.2.6 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                  
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 82538
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           https://redis.io       
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

82538:M 29 Dec 2021 18:41:09.589 # Server initialized
82538:M 29 Dec 2021 18:41:09.589 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> version: 20006 git sha: db3329c branch: HEAD
82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> Exported RedisJSON_V1 API
82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> Enabled diskless replication
82538:M 29 Dec 2021 18:41:09.591 * <ReJSON> Created new data type 'ReJSON-RL'
82538:M 29 Dec 2021 18:41:09.591 * Module 'ReJSON' loaded from /opt/thunisoft/redis/redisjson/rejson.so
82538:M 29 Dec 2021 18:41:09.602 * Loading RDB produced by version 6.2.6
82538:M 29 Dec 2021 18:41:09.602 * RDB age 98297 seconds
82538:M 29 Dec 2021 18:41:09.603 * RDB memory usage when created 0.77 Mb
82538:M 29 Dec 2021 18:41:09.603 # Done loading RDB, keys loaded: 2, keys expired: 0.
82538:M 29 Dec 2021 18:41:09.603 * DB loaded from disk: 0.011 seconds
82538:M 29 Dec 2021 18:41:09.603 * Ready to accept connections

Modify redis conf

/opt/thunisoft/redis/bin/redis.conf
--add to
loadmodule /opt/thunisoft/redis/redisjson/rejson.so

Then restart redis and JSON Set is already available

[root@server bin]# sh start.sh 
[root@server bin]# ./redis-cli   -a 123456 
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
127.0.0.1:6379> JSON.SET jsonkey   .  '{"a":"b","c":["1","2","3"]}'
OK
127.0.0.1:6379> JSON.GET jsonkey
"{\"a\":\"b\",\"c\":[\"1\",\"2\",\"3\"]}"
127.0.0.1:6379> JSON.GET jsonkey .a
"\"b\""

JSON usage

JSON.SET

127.0.0.1:6379>  JSON.SET doc . '{"a":2, "b": 3}'
OK
  • SON.SET is the json setting command
  • doc is key
  • . Is the root of the json document, followed by a string of specific json data values
  • If redisjson2 is used 0 + version, you can add Replace with JSON SET doc $ '{"a":2, "b": 3}'

JSON.GET

  • JSON.GET json value
127.0.0.1:6379>  JSON.GET doc
"{\"a\":2,\"b\":3}"
127.0.0.1:6379>  JSON.GET doc a
"2"
  • Nested structure to get json value
127.0.0.1:6379> JSON.SET doc $ '{"a":2, "b": 3, "nested": {"a": 4, "b": null},"c":{"b":4}}'
OK
127.0.0.1:6379> JSON.GET doc b
"3"
-- $..b Can get all b Value of
127.0.0.1:6379> JSON.GET doc $..b
"[3,null,4]"

JSON.STRAPPEND

  • JSON.STRAPPEND <key> [path] <json-string>
  • Append the JSON string value to the path in the string. If path is not provided, it defaults to root.
127.0.0.1:6379> JSON.SET doc $ '{"a":"foo", "nested": {"a": "hello"}, "nested2": {"a": 31}}'
OK
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":\"foo\",\"nested\":{\"a\":\"hello\"},\"nested2\":{\"a\":31}}]"
127.0.0.1:6379> 
127.0.0.1:6379> JSON.STRAPPEND doc $..a '"baz"'
1) (integer) 6
2) (integer) 8
3) (nil)
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":\"foobaz\",\"nested\":{\"a\":\"hellobaz\"},\"nested2\":{\"a\":31}}]"

JSON.DEL

127.0.0.1:6379>  JSON.SET doc $ '{"a": 1, "nested": {"a": 2, "b": 3}}'
OK
127.0.0.1:6379> JSON.get doc
"{\"a\":1,\"nested\":{\"a\":2,\"b\":3}}"
127.0.0.1:6379> 
127.0.0.1:6379> 
--delete
127.0.0.1:6379> JSON.DEL doc $..a
(integer) 2
127.0.0.1:6379> 
127.0.0.1:6379> JSON.get doc
"{\"nested\":{\"b\":3}}"

JSON.ARRAPPEND

Syntax: JSON ARRAPPEND <key> <path> <json> [json ...]

Append the json value after the last element of the path in the array.

127.0.0.1:6379> JSON.SET doc $ '{"a":[1], "nested": {"a": [1,2]}, "nested2": {"a": 42}}'
OK
127.0.0.1:6379> JSON.ARRAPPEND doc $..a 3 4
1) (integer) 3
2) (integer) 4
3) (nil)
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":[1,3,4],\"nested\":{\"a\":[1,2,3,4]},\"nested2\":{\"a\":42}}]"
  • A nested array in json contains multiple records, similar to a table

    127.0.0.1:6379> JSON.SET testarray .  '{"employees":[         {"name":"Alpha", "email":"alpha@gmail.com", "age":23},         {"name":"Beta", "email":"beta@gmail.com", "age":28},       {"name":"Gamma", "email":"gamma@gmail.com", "age":33},         {"name":"Theta", "email":"theta@gmail.com", "age":41}    ]}   '
    OK
    127.0.0.1:6379> 
    127.0.0.1:6379> 
    127.0.0.1:6379> 
    127.0.0.1:6379> JSON.get testarray
    "{\"employees\":[{\"name\":\"Alpha\",\"email\":\"alpha@gmail.com\",\"age\":23},{\"name\":\"Beta\",\"email\":\"beta@gmail.com\",\"age\":28},{\"name\":\"Gamma\",\"email\":\"gamma@gmail.com\",\"age\":33},{\"name\":\"Theta\",\"email\":\"theta@gmail.com\",\"age\":41}]}"
    

JSON.ARRINSERT

Syntax: JSON ARRINSERT <key> <path> <index> <json> [json ...]

Inserts a value into an array

127.0.0.1:6379> JSON.SET doc $ '{"a":[3], "nested": {"a": [3,4]}}'
OK
127.0.0.1:6379> JSON.ARRINSERT doc $..a 0 1 2 5
1) (integer) 4
2) (integer) 5
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":[1,2,5,3],\"nested\":{\"a\":[1,2,5,3,4]}}]"

There are many JSON operations available for reference: https://oss.redis.com/redisjs...

JSON full text retrieval

Working with documents: https://developer.redis.com/h...

You can see that by default, Chinese is not segmented, but is segmented by comma by default. English supports full-text retrieval

Query the data to know that redisjson can specify word segmentation when creating the index

FT.CREATE {index}
    [ON {data_type}]
       [PREFIX {count} {prefix} [{prefix} ...]
       [FILTER {filter}]
       [LANGUAGE {default_lang}]
       [LANGUAGE_FIELD {lang_attribute}]
       [SCORE {default_score}]
       [SCORE_FIELD {score_attribute}]
       [PAYLOAD_FIELD {payload_attribute}]
    [MAXTEXTFIELDS] [TEMPORARY {seconds}] [NOOFFSETS] [NOHL] [NOFIELDS] [NOFREQS] [SKIPINITIALSCAN]
    [STOPWORDS {num} {stopword} ...]
    SCHEMA {identifier} [AS {attribute}]
        [TEXT [NOSTEM] [WEIGHT {weight}] [PHONETIC {matcher}] | NUMERIC | GEO | TAG [SEPARATOR {sep}] [CASESENSITIVE] [SORTABLE [UNF]] [NOINDEX]] |
        [VECTOR {algorithm} {count} [{attribute_name} {attribute_value} ...]] ...
  • json create index
  • ON JSON, specify TEXT if it is TEXT
--Create a new index: i_index1
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   FT.CREATE i_index1 ON JSON LANGUAGE chinese SCHEMA $.title TEXT 
OK
--insert data
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   JSON.SET myDoc $ '{"title": "Panlong District, Kunming City, Yunnan Province", "content": "bar1"}'
OK
--Query Kunming, you can find the results
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "Kunming" LANGUAGE chinese 
1) (integer) 1
2) "myDoc"
3) 1) "$"
   2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"
  • Word segmentation method
  • From the following results, you can query Yunnan Province, Kunming City and Panlong District, but you can't query Kunming, Yunnan and kunpan.

    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "Yunnan Province" LANGUAGE chinese 
    1) (integer) 1
    2) "myDoc"
    3) 1) "$"
     2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "area" LANGUAGE chinese 
    1) (integer) 0
    
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "Yunnan Province" LANGUAGE chinese 
    1) (integer) 0
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "Kunming" LANGUAGE chinese 
    1) (integer) 1
    2) "myDoc"
    3) 1) "$"
     2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "Kunming" LANGUAGE chinese 
    1) (integer) 0
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>       FT.SEARCH i_index1 "Panlong District" LANGUAGE chinese 
    1) (integer) 1
    2) "myDoc"
    3) 1) "$"
     2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   FT.SEARCH i_index1 "Panlong" LANGUAGE chinese 
    1) (integer) 0
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>   FT.SEARCH i_index1 "area" LANGUAGE chinese 
    1) (integer) 0
    
  • Test Nanjing Yangtze River Bridge
  • You can see that the Nanjing Yangtze River Bridge, the query Nanjing, the Yangtze River and the bridge have no results, the query Nanjing, the Yangtze River Bridge has results, and the wipe test may be divided into Nanjing and the Yangtze River Bridge

    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> JSON.SET myDoc $ '{"title": "Nanjing Yangtze River Bridge", "content": "bar1"}'
    OK
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "Nanjing City" LANGUAGE chinese 
    1) (integer) 1
    2) "myDoc"
    3) 1) "$"
       2) "{\"title\":\"\xe5\x8d\x97\xe4\xba\xac\xe5\xb8\x82\xe9\x95\xbf\xe6\xb1\x9f\xe5\xa4\xa7\xe6\xa1\xa5\",\"content\":\"bar1\"}"
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>  FT.SEARCH i_index1 "Yangtze River" LANGUAGE chinese
    1) (integer) 0
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "Bridge" LANGUAGE chinese 
    1) (integer) 0
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "Yangtze River Bridge" LANGUAGE chinese 
    1) (integer) 1
    2) "myDoc"
    3) 1) "$"
       2) "{\"title\":\"\xe5\x8d\x97\xe4\xba\xac\xe5\xb8\x82\xe9\x95\xbf\xe6\xb1\x9f\xe5\xa4\xa7\xe6\xa1\xa5\",\"content\":\"bar1\"}"
    redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "Nanjing" LANGUAGE chinese 
    1) (integer) 0
    

You need to specify LANGUAGE chinese when creating an index

redisjson: https://oss.redis.com/redisea...

  • Supported languages for full-text retrieval:
arabic
armenian
danish
dutch
english
finnish
french
german
hungarian
italian
norwegian
portuguese
romanian
russian
serbian
spanish
swedish
tamil
turkish
yiddish
chinese (see below)

RediSearch uses Friso by default for Chinese word segmentation

Friso: Friso is an open source Chinese word splitter developed in ANSI C language, using the popular mmseg
Algorithm implementation. Completely based on modular design and implementation, it can be easily implanted into other programs, such as MySQL and PHP. The source code can be compiled and used on various platforms without modification. At the same time, it supports the segmentation of UTF-8/GBK coding.

Friso participle

  • Install Friso participle, and the test shows that it is indeed true
[root@server friso-1.6.1-release]# ./src/friso -init ./friso.ini 
Initialized in 0.340000sec
Mode: Complex
+-Version: 1.6.1 (UTF-8)
+-----------------------------------------------------------+
| friso - a chinese word segmentation writen by c.          |
| bug report email - chenxin619315@gmail.com.               |
| or: visit http://code.google.com/p/friso.                 |
|     java edition for http://code.google.com/p/jcseg       |
| type 'quit' to exit the program.                          |
+-----------------------------------------------------------+
friso>> Nanjing Yangtze River Bridge
 Word segmentation result:
Nanjing Yangtze River Bridge 
Done, cost < 0.000000sec

friso>> Panlong District, Kunming City, Yunnan Province
 Word segmentation result:
Panlong District, Kunming City, Yunnan Province 
Done, cost < 0.000000sec
friso>> 

Friso is implemented based on mmseg algorithm. It is mainly positive maximum matching, supplemented by a variety of disambiguation rules

mmseg participle: http://technology.chtsai.org/...

Each time from a complete sentence, identify a variety of different combinations of three words in the order from left to right; Then, according to the following four disambiguation rules, determine the best alternative word combination;
Select the first word in the alternative word combination as the word segmentation result of one iteration; The remaining 2 words continue the next round of word segmentation.
The advantage of this method is that the context information is added to the traditional forward maximum matching algorithm, which solves the problem that each word selection only considers the word itself and ignores the context related words.
4 Disambiguation rules include,
1)The sum of the lengths of alternative word combinations is the largest.
2)The average word length of alternative word combinations is the largest;
3)The change of word length of alternative word combination is the smallest;
4)In the combination of alternative words, the statistical value of the occurrence frequency of single word words is the highest.

Compare word segmentation of abase database (SCWS)

  • scws word segmentation, will be divided into very fine, basically covering the splitting of all phrases
postgres=# select to_tsvector('testzhcfg ',' Nanjing Yangtze River Bridge ');
                                      to_tsvector                                       
----------------------------------------------------------------------------------------
 'Nanjing':2 'Nanjing City':1 'large':9 'Bridge':6 'city':3 'Bridge':10 'river':8 'long':7 'Yangtze River':5 'Yangtze River Bridge':4
(1 row)

postgres=#  select to_tsvector('testzhcfg ',' Panlong District, Kunming City, Yunnan Province ');
                                             to_tsvector                                             
-----------------------------------------------------------------------------------------------------
 'area':12 'Loong':11 'Yunnan':2 'Yunnan Province':1 'city':7 'Kun':6,10 'Panlong':9 'Panlong District':8 'Kunming':5 'Kunming':4 'province':3
(1 row)

ES

es has a special word segmentation engine, supports a variety of word splitters, and often uses IK word segmentation

summary

1. Redisjason supports JSON full-text retrieval and uses Friso word segmentation. The word segmentation is not particularly detailed, which will lead to some binary phrases not being queried

2. Compared with JSON, the operation functions are relatively comprehensive. RedisJson has not been out for a long time, and there are few online application scenarios

This article is composed of blog one article multi posting platform OpenWrite release!

Keywords: Operation & Maintenance

Added by john1704 on Tue, 04 Jan 2022 22:59:13 +0200