Application of hash -- bitmap

bitmap

Bitmap concept

Give 4 billion non repeating unsigned integers, which have not been sorted. Given an unsigned integer, how to quickly judge whether a number is in these 4 billion numbers. [Tencent]

Idea 1: traversal, time complexity O ( N ) O(N) O(N)

Idea 2: sorting O ( N l o g N ) O(NlogN) O(NlogN), using binary search: O ( l o g N ) O(logN) O(logN)

But the problem is, 4 billion unsigned integers, 16G. Under great pressure.

Idea 3: put in set or unordered_set, and then find.

The problem is that the space cannot be supported. There is additional space for one node of the red black tree, and the space required for hashing is larger.

The correct idea is to solve it with bitmap.

Map a position with a bit. Just mark whether a value is in or not.

So for 4 billion, open 2 32 2^{32} 232 bits can be saved. (because unsigned integer size 2 32 − 1 2^{32}-1 232−1)

And the occupied space is 2 32 / 8 = 512 M 2^{32}/8=512M 232/8=512M.

Therefore, setting the size of bitset can be said to be set to (4294967295u) or directly (- 1).

Whether the data is in the given shaping data, and the result is in or out. There are just two states. Then a binary bit can be used to represent the information of whether the data exists. If the binary bit is 1, it means existence, and 0 means nonexistence.

Each bit can be used as a bool to determine whether a number exists or not.

Implementation of bitmap

#pragma once
namespace Y
{
	template<size_t N>/*Non type template parameters define variable sizes*/
	class BitSet
	{
	public:
		BitSet
		{
			_bits.resize( N/32 + 1 ,0 ); /*Round up*/
		}

		/*Mark the bit of x mapping as 1*/
		void Set(size_t x)
		{
			assert(x < N);
			/*Calculate the number of integers of the bits of the x mapping*/
			size_t i = x / 32;
			/*Calculate the number of bits of the x mapping in this integer*/
			size_t j = x % 32;

			//_ The j-th bit of bits[i] is marked as 1 and does not affect its other bits
			_bits[i] |= (1 << j);

		}
		/*Mark the bit of x mapping as 0*/
		void Reset(size_t x)
		{
			assert(x < N);
			size_t i = x / 32;
			size_t j = x % 32;
			_bits[i] &= (~(1 << j));
		}
		/*Is x 1 or 0*/
		bool Test(size_t x)
		{
			size_t i = x / 32;
			size_t j = x % 32;
			/*If the j-th bit is 1 and the result is not 0, it returns true*/
			/*If bit j is 0 and the result is 0, false is returned*/
			return _bits[i] & (1 << j);
		}
	private:
		vector<int> _bits;
	};

	void TestBitSet()
	{
		//BitSet<4294967295u> bs;
		BitSet<-1> bs;
	}
}

Related applications

  1. Given 10 billion integers, design algorithms to find integers that appear only once
  2. For two files, there are 10 billion integers respectively. We only have 1G memory. How to find the intersection of two files
  3. Bitmap application deformation: a file has 10 billion ints and 1G memory. The design algorithm finds all integers that occur no more than 2 times
  4. Data block bitmap (datamap) and index block bitmap (inodemap) in the file system.
  • For two files, there are 10 billion integers respectively. We only have 1G memory. How to find the intersection of two files.

Integers can only be 2 32 2^{32} 232, so 10 billion must be repeated.

Scheme 1: map all integer tags in the first file to a bitmap. Then read all integers in another file to determine whether they are in the bitmap. In is the number in the intersection, not in is not.

Scheme 2: from the cited example, we can get a bitmap of 512M, successively read all integer marks of the first file and map them to bitmap 1, then read all integer marks of the second file and map them to bitmap 2, and then combine the two bitmaps. (in turn with the integer in the bitmap), and the integer mapped with the bit of 1 after completion is the intersection.

  • Given 10 billion integers, design algorithms to find integers that appear only once

The first idea:

If the flag bit is in or out, one bit is OK.

Then, the following states of an integer marked:

0 times, 1 time, 2 times or more. 00, 01 and 10 respectively. It can be marked with only 2 bit s.

Therefore, the original 32 bits of a bitmap can be transformed into 2 bits to mark a value / 32 /32 /32 become / 16 /16 /16 is enough.

Check two bits during Set. If it is 00, it becomes 01. If it is 01, it becomes 10. If it is 10, it remains unchanged.

The second idea:

Do not transform the original bitmap. Open two bitmaps. Finally, traverse the two bitmaps. Count all the integers marked with 01 status.

Bitset<-1> _bs1,_bs2;
void Set(size_t x)
{
    //00 -> 01
    if(!_bs1.Test(x) && !_bs2.Test(x))
    {
        _bs2.set(x);
    }
    else if(!_bs1.Test(x) && _bs2.Test(x)) //01->10
    {
        _bs1.Set(x);
        _bs2.ReSet(x);
    }
    else if( _bs1.Test(x) && !_bs2.Test(x))// 10->>10
    {
        //Do not handle
    }
    else{
        assert(false);
    }
}
  • Bitmap application deformation: a file has 10 billion ints and 1G memory. The design algorithm finds all integers that occur no more than 2 times

Similar to the previous question. For two bitmaps, those that appear 3 times or more are designed as 11.

Finally, traverse the two bitmaps. Count all the integers marked with 01 and 10 status.

Bitset<-1> _bs1,_bs2;
void Set(size_t x)
{
    //00 -> 01
    if(!_bs1.Test(x) && !_bs2.Test(x))
    {
        _bs2.set(x);
    }
    else if(!_bs1.Test(x) && _bs2.Test(x)) //01->10
    {
        _bs1.Set(x);
        _bs2.ReSet(x);
    }
    else if( _bs1.Test(x) && !_bs2.Test(x))// 10->>10
    {
        _bs2.Set(x);
    }
    else{
        //No treatment;
    }
}

Therefore, in case of multiple, consider how many bits can be combined.

Keywords: C++ Algorithm data structure

Added by phphunger on Sat, 29 Jan 2022 20:58:50 +0200