Open source vector database -- milvus

Milvus is an open source vector similarity search engine, which supports the addition, deletion and modification of TB level vectors and near real-time query. It has the characteristics of high flexibility, stability, reliability and high-speed query. Milvus integrates widely used vector index libraries such as Faiss, NMSLIB and Annoy, and provides a set of simple and intuitive API s, so that you can choose different index types for different scenarios. In addition, Milvus can also filter scalar data, further improving the recall rate and enhancing the flexibility of search.

characteristic

  • Heterogeneous computing
  1. The performance of GPU based search vector and indexing is optimized
  2. It can complete the millisecond search of TB data on a single general server
  3. Dynamic data management
  • Support mainstream index libraries, distance calculation methods and monitoring tools
  1. It integrates vector index libraries such as Faiss, NMSLIB and Annoy
  2. Support quantization based index, graph based index and tree based index
  3. Similarity calculation methods include Euclidean distance (L2), inner product (IP), Hamming distance, jackard distance, etc
  4. Prometheus is used as a storage scheme for monitoring and performance indicators, and Grafana is used as a visual component for data display
  • Near real time search
  1. The data inserted into Milvus can be searched in 1 second by default

Vector distance

Euclidean distance L2

inner product

Jackard distance

Tanamoto distance

Hamming distance

python SDK

pip3 install pymilvus

from milvus import Milvus, IndexType, MetricType, Status


milvus = Milvus(host='localhost', port='19530')
milvus = Milvus(uri='tcp://localhost:19530')

  • Create collection
    • Create a set named test01, with a dimension of 256, a data file size of 1024 MB for automatic index creation, and a distance measurement method of Euclidean distance (L2)

param = {'collection_name':'test01', 'dimension':256, 'index_file_size':1024, 'metric_type':MetricType.L2}


milvus.create_collection(param)

  • Delete collection
milvus.drop_collection(collection_name='test01')
  • Create partition
milvus.create_partition('test01', 'tag01')

  • delete a partition
milvus.drop_partition(collection_name='test01', partition_tag='tag01')

  • Inserts a vector into the set
import random
vectors = [[random.random() for _ in range(256)] for _ in range(20)]

milvus.insert(collection_name='test01', records=vectors)


custom id
vector_ids = [id for id in range(20)]
milvus.insert(collection_name='test01', records=vectors, ids=vector_ids)

  • Insert vector in partition
milvus.insert('test01', vectors, partition_tag="tag01")

  • Delete by id
ids = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

milvus.delete_entity_by_id(collection_name='test01', id_array=ids)

  • Create index
ivf_param = {'nlist': 16384}
milvus.create_index('test01', IndexType.IVF_FLAT, ivf_param)
  • Delete index
milvus.drop_index('test01')

  • Query vector
search_param = {'nprobe': 16}
q_records = [[random.random() for _ in range(256)] for _ in range(5)]
milvus.search(collection_name='test01', query_records=q_records, top_k=2, params=search_param)


top_k It refers to the nearest target vector in vector space k Vector
top_k The scope of is:[1, 16384]. 

  • Query vector in partition
q_records = [[random.random() for _ in range(256)] for _ in range(5)]
milvus.search(collection_name='test01', query_records=q_records, top_k=1, partition_tags=['tag01'], params=search_param)

  • Data drop time after data modification is 1s
milvus.flush(collection_name_array=['test01'])

  • Data segment sorting

A collection can contain multiple data segments. If the vector data in a data segment is deleted, the space occupied by it will not be automatically released.

milvus.compact(collection_name='test01', timeout=1)

Keywords: Big Data NLP milvus

Added by dibyendrah on Tue, 08 Mar 2022 16:49:49 +0200