Determine whether the IP is in a large ip set (identify domestic IP)

Original address: https://www.ikaze.cn/article/65

The new requirements need to show different languages through ip. Because there are many ip addresses, the dictionary is not applicable. Here are some methods.

1. Through ip location database

Famous service providers include: ipip (paid) maxmind (paid), pure (free).

However, in this application scenario, we do not need specific location information. Similar schemes will waste unnecessary memory, so we give up.

2. Utilize ip continuity

The latter two methods have a premise: most of the ip address lists are continuous.

Here we have a list of domestic ip addresses (there are open source libraries, which are easy to find. In addition, the library I use has merged ip addresses into CIDR format).

We first convert the IP into a number that can be directly compared through binary, and then change the continuous IP into such a set (start_ip, end_ip), so that we can quickly find it by dichotomy.

import ipcalc

class ChinaIp:
    def __init__(self):
        self.data = []

    def load(self, cidr_file='data/china_ip_list.txt'):
        with open(cidr_file, 'r')as f:
            for s in f.readlines():
                self.add(s.strip())

    def add(self, cidr):
        n = ipcalc.Network(cidr)
        self.data.append((n.host_first().ip, n.host_last().ip))

    def search(self, ip):

        l = 0
        r = len(self.data) - 1
        while l <= r:
            mid = (l + r) // 2
            if self.data[mid][0] <= ip <= self.data[mid][1]:
                return True
            elif self.data[mid][0] > ip:
                r = mid - 1
            elif self.data[mid][1] < ip:
                l = mid + 1
            else:
                return False
        return False

    def __contains__(self, item):
        ip = ipcalc.IP(item).ip
        return self.search(ip)
        
china_ip = ChinaIp()
china_ip.load()
print('223.70.163.83' in china_ip)

3. Use the characteristics of CIDR

CIDR ^ is an address in the form of ^ x.x.x/n ^ which represents a group of ip addresses with the same network address, where n represents the first n bits as the network address.  

According to the characteristics of CIDR, we can draw the conclusion that the network address of the ip under the same CIDR is the same.  

Therefore, we can take out the network addresses of all domestic cidr addresses and put them in the dictionary; For an ip, try the possible network address (i.e. n) to see if it is in the dictionary.

import ipcalc

class ChinaIp(object):
    def __init__(self):
        self.data = {}


    def load(self, cidr_files='data/china_ip_list.txt'):
        with open(cidr_files, 'r')as f:
            cidr_list = f.readlines()

        for cidr in cidr_list:
            self.insert(cidr.strip())

    def insert(self, cidr):
        network = ipcalc.Network(cidr)
        self.data[str(network.netmask())]=True

    def search(self, netmask: str) -> bool:

        for i in netmask.split('.'):
            node = node.children.get(i, None)
            if node is None:
                return False
            if node.is_end:
                return True
        return False

    def __contains__(self, ip):
        for i in range(1,33):
            if self.search(str(ipcalc.Network(f'{ip}/{i}').netmask())):
                return True
        return False
        
china_ip = ChinaIp()
china_ip.load()
print('223.70.163.83' in china_ip)

This algorithm seems to have no problem, but in the actual test, the speed is much slower than the second one. The time-consuming place must cycle all n in the comparison, and the dichotomy can quickly eliminate the impossible part.

In this case, there are two optimization methods:

1. List of random n

class ChinaIp(object):

    ...
    
    def __contains__(self, ip):
        l = list(range(1, 33))
        random.shuffle(l)
        for i in l:
            if self.search(str(ipcalc.Network(f'{ip}/{i}').netmask())):
                return True
        return False

This method reduces the test time by more than half.

2. Exclude n that will not appear

class ChinaIp(object):
    def __init__(self):
        ...
        self.mask_list = []
    ...

    def insert(self, cidr):
        network = ipcalc.Network(cidr)
        self.data[str(network.netmask())] = True
        self.mask_set.add(network.mask)

    def __contains__(self, ip):

        for i in self.mask_set:
            netmask = str(ipcalc.Network(f'{ip}/{i}').netmask())
            if netmask in self.data:
                return True
        return False

In this way, the optimized speed is the same as that of the second one, but in practical application, it is also necessary to judge which one to use according to the ip list.

Keywords: ip

Added by kripz on Wed, 26 Jan 2022 15:37:19 +0200