A sharp tool for fast weight removal under the background of large amount of data

1, Foreword

When we use mainstream data structures such as Lists, Maps, Sets, Trees, etc., I can get the exact results, whether the data exists or not. Probabilistic data structure can provide a memory based method to quickly find a possible rather than exact result. This probabilistic data structure is the Bloom filter. The Bloom filter checks whether the value is "probably in the set" or "absolutely not in the set".

2.1 definitions

Bloom Filter was proposed by bloom in 1970. It is actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that the spatial efficiency and query time are much better than the general algorithm, but its disadvantage is that it has a certain false recognition rate and deletion difficulty.

• binary vector is simply understood as a binary array. The values stored in this array are either 0 or 1. • Mapping function, which can map an element to a point in a Bit array. Therefore, through this point, we can judge whether there is this element in the collection.

2.2 basic ideas

• when an element is added to the set, the element is mapped to k points in a bit group through k hash functions, and they are set to 1. • When retrieving an element, map the element through the k hash functions to see if these positions are all 1, so as to know whether the element in the collection exists or not. If there is any 0 in these positions, the element must not exist; If they are all 1, the inspected element is likely to exist. • Bloom Filter is different from a single hash function mapping. Bloom Filter uses k hash functions, and each element corresponds to k bits. This reduces the probability of conflict.

2.3 characteristics

1) If an element is judged to exist, the element does not necessarily exist, but it must not exist when it is judged not to exist. That is, the bloom filter can only judge whether the data must not exist, but cannot judge whether the data must exist. 2) The bloom filter can add elements, but cannot delete elements. Because deleting elements will increase the misjudgment rate.

2.4 advantages and disadvantages

2.4.1 advantages

1. The binary array occupies less memory, and the insertion and query speed is very fast, with constant level. 2.Hash functions are not necessarily related to each other, which is convenient for parallel implementation by hardware. 3. Only 0 and 1 are stored without storing the element itself. It has advantages in some occasions where confidentiality requirements are very strict.

2.4.2 disadvantages

There is an error rate. As the number of stored elements increases, the miscalculation rate increases. (for example, in reality, whether you encounter normal emails that are also put into the spam directory and normal SMS are intercepted) you can add a small white list to store elements that may be misjudged.

Delete difficulty. An element mapped to k positions in the bit array is 1. It cannot be simply set to 0 when deleted, which may affect the judgment of other elements. Because the mapping of other elements may also be set to 1 in the same position. You can use Counting Bloom Filter to solve this problem.

2, Practical scheme

2.1 Guava implements BloomFilter

First, introduce the jar package of guava,

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>27.0.1-jre</version>
</dependency>

Create the following in Java code

    public static void main(String[] args) {
    // 1. Create a qualified bloom filter
    // Expected data volume 10000, error rate 0.0001
    BloomFilter<CharSequence> bloomFilter =
            BloomFilter.create(Funnels.stringFunnel(
                                    Charset.forName("utf-8")),10000, 0.0001);
    // 2. Add some data
    for (int i = 0; i < 5000; i++) {
        bloomFilter.put("" + i);
    }
    System.out.println("Data writing completed");
    // 3. Test results
    for (int i = 0; i < 10000; i++) {
        if (bloomFilter.mightContain("" + i)) {
            System.out.println(i + "existence");
        } else {
            System.out.println(i + "non-existent");
        }
    }
}

2.2 redistribution implementation scheme

public class RedissonBloomFilterDemo {

    public static void main(String[] args) {

        Config config = new Config();
        config.useSingleServer().setAddress("redis://127.0.0.1:6379");
        RedissonClient redisson = Redisson.create(config);

        RBloomFilter<String> bloomFilter = redisson.getBloomFilter("user");
        // Initialize the bloom filter, the estimated number of statistical elements is 55 million, and the expected error rate is 0.03
        bloomFilter.tryInit(55000000L, 0.03);
        bloomFilter.add("Tom");
        bloomFilter.add("Jack");
        System.out.println(bloomFilter.count());   //2
        System.out.println(bloomFilter.contains("Tom"));  //true
        System.out.println(bloomFilter.contains("Linda"));  //false
    }
}

3, Practical application scenario

Take the mailbox blacklist as an example. When a user registers or logs in, we need to judge whether the user belongs to the 1 billion blacklist mailboxes. If so, we need to intercept the user. Although the business rules are simple, the difficulty lies in the huge amount of data. How to store these 1 billion elements and quickly retrieve them among 10 elements

import com.google.common.base.Preconditions;
import com.google.common.hash.Funnel;
import com.google.common.hash.Hashing;
import org.springframework.data.redis.core.RedisTemplate;

public class RedisBloom<T> {

    private int numHashFunctions;

    private int bitSize;

    private Funnel<T> funnel;

    private RedisTemplate redisTemplate;

    private RedisBloom(){

    }

    /**
     *
     * @param funnel
     * @param expectedInsertions Estimated total number of elements
     * @param fpp Allowable error, for example, 0.0001, an error of one ten thousandth is allowed
     * @param redisTemplate
     */
    public RedisBloom(Funnel<T> funnel, int expectedInsertions, double fpp, RedisTemplate redisTemplate) {
        Preconditions.checkArgument(funnel != null, "funnel Cannot be empty");
        this.funnel = funnel;
        bitSize = optimalNumOfBits(expectedInsertions, fpp);
        numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, bitSize);
        this.redisTemplate = redisTemplate;
    }

    private int[] murmurHashOffset(T value) {
        int[] offset = new int[numHashFunctions];

        long hash64 = Hashing.murmur3_128().hashObject(value, funnel).asLong();
        int hash1 = (int) hash64;
        int hash2 = (int) (hash64 >>> 32);
        for (int i = 1; i <= numHashFunctions; i++) {
            int nextHash = hash1 + i * hash2;
            if (nextHash < 0) {
                nextHash = ~nextHash;
            }
            offset[i - 1] = nextHash % bitSize;
        }

        return offset;
    }

    /**
     * Calculate bit array length
     */
    private int optimalNumOfBits(long n, double p) {
        if (p == 0) {
            p = Double.MIN_VALUE;
        }
        return (int) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
    }

    /**
     * Calculate the number of hash method executions
     */
    private int optimalNumOfHashFunctions(long n, long m) {
        return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
    }


    /**
     * Adds a value based on the given bloom filter
     */
    public void addByBloomFilter(String key, T value) {
        int[] offset = this.murmurHashOffset(value);
        for (int i : offset) {
            redisTemplate.opsForValue().setBit(key, i, true);
        }
    }

    /**
     * Judge whether the value exists according to the given bloom filter
     */
    public  boolean includeByBloomFilter(String key, T value) {
        int[] offset = murmurHashOffset(value);
        for (int i : offset) {
            if (!redisTemplate.opsForValue().getBit(key, i)) {
                return false;
            }
        }
        return true;
    }
}

RedisBloom is the key to the implementation of functions, including the core algorithm for computing bitmap. In fact, most of the code comes from the BloomFilterStrategies class in Guava library. However, because this class is specially used for Guava's BloomFilter class, it does not expose some important algorithm logic. Let's see how to use BloomFilter and addByBloomFilter together with redis. Add the element includeByBloomFilter to redis and check whether the element is in redis bloomFilter.

Keywords: Algorithm data structure

Added by LuiePL on Wed, 05 Jan 2022 21:44:15 +0200

Programming VIP