Python Elasticsearch DSL query, filtering and aggregation operations

Basic concepts of Elasticsearch

Index: Elasticsearch is a logical area used to store data. It is similar to the database concept in relational databases. An index can be on one or more shards, and a shard may have multiple replicas.

Document: the entity data stored in elastic search is similar to a row of data in a table in relational data.

A document consists of multiple fields. Fields with the same name in different documents must have the same type. Fields in the document can appear repeatedly, that is, a field will have multiple values, that is, multivalued.

Document type: in order to query, an index may have multiple documents, that is, document type It is similar to the concept of table in relational database. However, it should be noted that field s with the same name in different documents must be of the same type.

Mapping: it is similar to the concept of schema definition in relational database. Store the mapping information related to the field. Different document type s have different mapping.

The following figure is a comparison of some terms between ElasticSearch and relational database:

Relationnal database

Elasticsearch

Database

Index

Table

Type

Row

Document

Column

Field

Schema

Mapping

Schema

Mapping

Index

Everything is indexed

SQL

Query DSL

SELECT * FROM table...

GET http://...

UPDATE table SET

PUT http://...

Introduction to Python Elasticsearch DSL

Connect Es:

import elasticsearch

es = elasticsearch.Elasticsearch([{'host': '127.0.0.1', 'port': 9200}])
Copy code

Let's take a look at the search. q refers to the search content. Spaces have no impact on the q query results. size specifies the number, from_ Specify the starting position, filter_path can specify the data to be displayed, as shown in the final result in this example_ id and_ type.

res_3 = es.search(index="bank", q="Holmes", size=1, from_=1)
res_4 = es.search(index="bank", q=" 39225    5686 ", size=1000, filter_path=['hits.hits._id', 'hits.hits._type'])
Copy code

Query all data at the specified index:

Where, index specifies the index, and the string represents an index; The list represents multiple indexes, such as index=["bank", "banner", "country"]; The regular form represents multiple indexes that meet the conditions, such as index=["apple *"], representing all indexes starting with apple.

You can also specify a specific doc type in search.

from elasticsearch_dsl import Search

s = Search(using=es, index="index-test").execute()
print s.to_dict()
Copy code

Multiple query criteria can be superimposed according to a field query:

s = Search(using=es, index="index-test").query("match", sip="192.168.1.1")
s = s.query("match", dip="192.168.1.2")
s = s.excute()
Copy code

Multi field query:

from elasticsearch_dsl.query import MultiMatch, Match

multi_match = MultiMatch(query='hello', fields=['title', 'content'])
s = Search(using=es, index="index-test").query(multi_match)
s = s.execute()

print s.to_dict()
Copy code

You can also use the Q() object to query multiple fields. Fields is a list, and query is the value to be queried.

from elasticsearch_dsl import Q

q = Q("multi_match", query="hello", fields=['title', 'content'])
s = s.query(q).execute()

print s.to_dict()
Copy code

The first parameter of Q() is the query method or bool.

q = Q('bool', must=[Q('match', title='hello'), Q('match', content='world')])
s = s.query(q).execute()

print s.to_dict()
Copy code

The combined query through Q() is equivalent to another writing method of the above query.

q = Q("match", title='python') | Q("match", title='django')
s = s.query(q).execute()
print(s.to_dict())
# {"bool": {"should": [...]}}

q = Q("match", title='python') & Q("match", title='django')
s = s.query(q).execute()
print(s.to_dict())
# {"bool": {"must": [...]}}

q = ~Q("match", title="python")
s = s.query(q).execute()
print(s.to_dict())
# {"bool": {"must_not": [...]}}
Copy code

Filtering: range filtering here, range is the method, timestamp is the name of the field to be queried, gte is greater than or equal to, lt is less than, and can be set as needed.

About the difference between term and match, term is an exact match, match will be blurred, word segmentation will be performed, and the matching score will be returned. (if term queries a string of lowercase letters, if there is uppercase, it will return null, i.e. no hit. Match is case insensitive, and the returned result is the same.)

# Range query
s = s.filter("range", timestamp={"gte": 0, "lt": time.time()}).query("match", country="in")
# General filtration
res_3 = s.filter("terms", balance_num=["39225", "5686"]).execute()
Copy code

Other writing methods:

s = Search()
s = s.filter('terms', tags=['search', 'python'])
print(s.to_dict())
# {'query': {'bool': {'filter': [{'terms': {'tags': ['search', 'python']}}]}}}

s = s.query('bool', filter=[Q('terms', tags=['search', 'python'])])
print(s.to_dict())
# {'query': {'bool': {'filter': [{'terms': {'tags': ['search', 'python']}}]}}}
s = s.exclude('terms', tags=['search', 'python'])
# perhaps
s = s.query('bool', filter=[~Q('terms', tags=['search', 'python'])])
print(s.to_dict())
# {'query': {'bool': {'filter': [{'bool': {'must_not': [{'terms': {'tags': ['search', 'python']}}]}}]}}}
Copy code

Aggregation can be superimposed after query, filtering and other operations, and aggs needs to be added.

bucket is a group. The first parameter is the name of the group and you can specify it yourself. The second parameter is a method and the third is a specified field.

The same is true for metric. The metric methods include sum, avg, max, min, etc., but it should be noted that there are two methods that can return these values at one time, stats and extended_stats, which can also return variance equivalent.

# Example 1
s.aggs.bucket("per_country", "terms", field="timestamp").metric("sum_click", "stats", field="click").metric("sum_request", "stats", field="request")

# Example 2
s.aggs.bucket("per_age", "terms", field="click.keyword").metric("sum_click", "stats", field="click")

# Example 3
s.aggs.metric("sum_age", "extended_stats", field="impression")

# Example 4
s.aggs.bucket("per_age", "terms", field="country.keyword")

# In example 5, this aggregation is based on the interval
a = A("range", field="account_number", ranges=[{"to": 10}, {"from": 11, "to": 21}])

res = s.execute()
Copy code

Finally, execute() should still be executed. It should be noted here that the s.aggs operation cannot be received with variables (for example, res=s.aggs, this operation is wrong), and the aggregated results will be saved and displayed in res.

sort

s = Search().sort(
    'category',
    '-title',
    {"lines" : {"order" : "asc", "mode" : "avg"}}
)
Copy code

paging

s = s[10:20]
# {"from": 10, "size": 10}
Copy code

Some extension methods can be seen by interested students:

s = Search()

# Set extended properties to use ` extra() method
s = s.extra(explain=True)

# Set parameters using ` params()`
s = s.params(search_type="count")

# To limit the returned fields, you can use the 'source()' method
# only return the selected fields
s = s.source(['title', 'body'])
# don't return any fields, just the metadata
s = s.source(False)
# explicitly include/exclude fields
s = s.source(include=["title"], exclude=["user.*"])
# reset the field selection
s = s.source(None)

# Serialize a query using dict
s = Search.from_dict({"query": {"match": {"title": "python"}}})

# Modify an existing query
s.update_from_dict({"query": {"match": {"title": "python"}}, "size": 42})
Copy code

Added by beermaker74 on Thu, 06 Jan 2022 15:55:20 +0200