Solution to the problem of massive data duplication checking

1. Four ways to deal with massive data problems

Divide and conquer
- Basically, the divide and conquer idea can solve the problem of dealing with massive data, but generally it will not be the optimal solution, but it can be used as a baseline, and the sub problem can be gradually optimized to achieve a better solution. The traditional merge sort is the divide and conquer idea, which involves a large number of files that cannot be loaded into memory, sorting and other problems, which can be solved by this method.
- Applicable scenario: a large amount of data cannot be loaded into memory

There is a file with a large number of integers, 5 billion integers and a memory limit of 400 M，Find the duplicate elements in the file and the number of repetitions.
1G=1024x1024x1024=10 7374 1824 About 1 billion
50 The memory occupied by 100 million integers is about= 50/10 G *4 (An integer four bytes) = 20G
 The idea of divide and conquer method: large files are divided into small files, so that each small file can be loaded into memory, find the corresponding repeated elements, and write the results to a file storing repeated elements.
So the number of small files = 20G/400M About 52 small files:
data0.txt
.....
data51.txt

Facilitate the elements of large files, and put each element into the small file corresponding to the sequence number according to the hash mapping function :data %52 = file_index

Hash
- Personally, I think Hash is the most rough way, but it is rough and efficient. The only disadvantage is that it consumes memory and needs to load all data into memory.
- Applicable scenario: fast search. The total amount of data required can be put into memory

int main()
{
	/* 
	Suppose that the original data to be checked is put in this vector
	In order to make the program run faster and produce results, the amount of data is reduced here
	*/
	vector<int> vec;
	for (int i = 0; i < 100000; ++i)
	{
		vec.push_back(rand());
	}

	// The hash table is used to solve the duplicate checking. Because only duplicate checking is used, the unordered set is used to solve the problem
	unordered_set<int> hashSet;
	for (int val : vec)
	{
		// Find val in hash table
		auto it = hashSet.find(val);
		if (it != hashSet.end())
		{
			cout << *it << "Is the first duplicate data" << endl;
			break; // If you want to find all the repeated numbers, you don't need to break here
		}
		else
		{
			// Can't find
			hashSet.insert(val);
		}
	}

	return 0;
}

Bit (bit set or BitMap)
- Bitmap method is to use a bit (0 or 1) to store the state of data. It is more suitable for problem scenarios with simple state, large amount of data and low memory utilization. To solve the problem with the bitmap method, you first need to know the maximum value in the data to be processed, and then open up an array of char type according to the size = (maxNumber / 8) (byte)+1. When you need to find out whether an element exists in the bitmap, you first need to calculate the bits in the array corresponding to the number, and then read the value. 0 indicates that it does not exist and 1 indicates that it already exists. See the specific application in the following questions.
  
  The bitmap method has a big disadvantage, that is, there is not much data, but the maximum value is very large. For example, there are 10 integers, and the maximum value is 1 billion, so you have to calculate the size of the bitmap array according to the number of 1 billion, which is a waste of memory space
- Applicable scenario: fast data search and duplicate judgment can be carried out
- Skill link: performance misunderstanding of Bloom filter

#include <iostream>
#include <vector>
#include <unordered_set>
using namespace std;
int main()
{
	/* 
	Suppose that the original data to be checked is put in this vector
	In order to make the program run faster and produce results, the amount of data is reduced here
	*/
	vector<int> vec;
	for (int i = 0; i < 100000; ++i)
	{
		vec.push_back(rand());
	}

	// Solve problems with bitmap method
	typedef unsigned int uint;
	uint maxNumber = 1000000000;
	int size = maxNumber / 8 + 1;
	char *p = new char[size]();

	for (uint i = 0; i < vec.size(); ++i)
	{
		// Calculates the array subscript that an integer should place
		int index = vec[i] / 8; 
		// Calculate the bit of the corresponding byte
		int offset = vec[i] % 8;
		// Get the value of the corresponding bit
		int v = p[index] & (1 << offset);
		if (0 != v)
		{
			cout << vec[i] << "Is the first duplicate data" << endl;
			break; // If you want to find all the repeated numbers, you don't need to break here
		}
		else
		{
			// It means that the data does not exist. Setting the corresponding position to 1 means that the data is recorded
			p[index] = p[index] | (1 << offset);
		}
	}
	delete[]p;
	return 0;
}

Heap
- Heap sorting is a general solution to the TopN problem, which can meet most of the problems of finding the best value. Readers need to master the basic operation and idea of heap.
- Applicable scenario: deal with the problem of TopN (maximum or minimum) in massive data. It is required that N is small, so that the heap can be put into memory
- Skill link: sorting algorithm Heap sorting

About heap: Heap sorting algorithm_ LIJIWEI0611 blog - CSDN blog

top k problem
top k problems can be roughly divided into two categories:
1. In a group of data, find the top k with the largest value or the top k with the smallest value
2. In a set of data, find the largest number k or the smallest number K.

Small root pile and large root pile
If the data with large top k is found in the small root heap, and the data with small top k is found in the large root heap, such problems can be well solved by using the heap structure. Taking the first 10 data in a group of data as an example, the idea is to create a small root heap structure (compare the smallest element in the current heap every time. If it is larger than the smallest, delete the minimum value and add the value to the queue, so that the last 10 are the largest 10), and then read the 10 values into the heap, Then traverse the remaining elements and compare them with the heap top elements in turn. If they are larger than the heap top elements, delete the heap top elements and add the current elements to the small root heap. After the element traversal is completed, the remaining 10 elements in the heap are the 10 elements with the largest value.

In C++STL, container adapter priority_queue is a large root heap by default. You can change the template type to get a small root heap, which is often used. The example code is as follows:

#include <iostream>
#include <queue>
#include <vector>
#include <functional>
using namespace std;
int main()
{
  /*
  Find the first 10 numbers with the largest element value in the vector container
  */
  vector<int> vec;
  for (int i = 0; i < 100000; ++i)
  {
    vec.push_back(rand() + i);
  }

  // Define small root heap
  priority_queue<int, vector<int>, greater<int>> minHeap;
  // First put 10 elements into the small root heap
  int k = 0;
  for (; k < 10; ++k)
  {
    minHeap.push(vec[k]);
  }

  /*
  Traverse the remaining elements and compare them with the top element in turn. If they are larger than the top element,
  Then delete the top element, add the current element to the small root heap, and the element traversal is completed,
  The remaining 10 elements in the heap are the top 10 elements
  */
  for (; k < vec.size(); ++k)
  {
    if (vec[k] > minHeap.top())
    {
      minHeap.pop();
      minHeap.push(vec[k]);
    }
  }

  // Print results
  while (!minHeap.empty())
  {
    cout << minHeap.top() << " ";
    minHeap.pop();
  }
  cout << endl;

  return 0;
}

Then, the data with small top k is the same as the above principle. The difference is that a large root heap is used, and when comparing the elements with the elements at the top of the heap, it is necessary to judge that they are less than and then replace them (because small elements are to be found, large value elements are to be eliminated).

If you are looking for the k-th element or the k-th element, the processing method is the same as the above code, but only the top element of the heap can be read in the end, because such a problem only finds an element that meets the conditions

Fast row partition function

Introduction to Algorithms Introduction to algorithm: selection algorithm with expectation of linear time_ LIJIWEI0611 blog - CSDN blog
In the fast sorting division function, a cardinality will be selected, the numbers less than the cardinality will be adjusted to the left, and the numbers greater than the cardinality will be adjusted to the right. Finally, the position of the cardinality is the number with the smallest m. if we find the number with the smallest k, the situation is as follows:

1. When k == m, it means that the smallest number k we are looking for has been found
2. When k > m, we need to recursively perform the above operation on the number sequence on the right of the radix until the condition in step 1 is true
3. When k < m, we need to recursively perform the above operation on the number sequence on the left of the cardinality until the condition in step 1 is true

Therefore, when solving the k-th largest number or the k-th smallest number, you can also use the fast sorting division function to solve recursively. The code example is as follows

#include <iostream>
#include <vector>
using namespace std;

/*
Fast sorting partition function, select the element of arr[i] as the cardinality, and put the element less than arr[i]
Adjust to the left, adjust the elements greater than arr[i] to the right, and return the subscript of the cardinality position
*/
int partation(vector<int> &arr, int i, int j)
{
	int k = arr[i];
	while (i < j)
	{
		while (i < j && arr[j] >= k)
			j--;
		if (i < j)
			arr[i++] = arr[j];

		while (i < j && arr[i] < k)
			i++;
		if (i < j)
			arr[j--] = arr[i];
	}
	arr[i] = k;
	return i;
}
/*
params:
1.vector<int> &arr: Container for storing elements
2.int i:Starting subscript of data range
3.int j:End subscript of data range
4.int k:Element k
 Function Description: recursively solve the k-th smallest number through the fast sorting segmentation function and return its value
*/
int selectNoK(vector<int> &arr, int i, int j, int k)
{
	int pos = partation(arr, i, j);
	if (pos == k-1)
		return arr[pos];
	else if (pos < k-1)
		return selectNoK(arr, pos + 1, j, k);
	else
		return selectNoK(arr, i, pos-1, k);
}
int main()
{
	/*
	Find the element value of the 10th smallest element in the vector container
	*/
	vector<int> vec;
	for (int i = 0; i < 100000; ++i)
	{
		vec.push_back(rand() + i);
	}
	
	// selectNoK returns the value of the 10th smallest element
	cout << selectNoK(vec, 0, vec.size()-1, 10) << endl;
	return 0;
}

Comprehensive application of duplicate checking and top k problem

If the problem is to find the top 10 digits with the largest number of repetitions in a group of numbers, the problem is to make hash Statistics (duplicate check operation), and then find the top k problem according to the hash statistics results. The following code example demonstrates how to quickly find the top 10 digits with the largest number of repetitions in a group of data. The code is as follows

#include <iostream>
#include <vector>
#include <queue>
#include <unordered_map>
#include <functional>
using namespace std;
// In a set of numbers, find the top 10 with the most repetitions
int main()
{
	// Use vec to store the number to be processed
	vector<int> vec;
	for (int i = 0; i < 200000; ++i)
	{
		vec.push_back(rand());
	}

	// Count the number of repetitions of all numbers. key: number value, value: number of repetitions
	unordered_map<int, int> numMap;
	for (int val : vec)
	{
		/* Find the val number in the map. If val does not exist, numMap[val] will insert a [val, 0]
		Such a return value, and then + +, get a set of new data such as [val, 1]
		If val exists, numMap[val] just returns the number of second repetitions corresponding to the val number, which is directly++*/
		numMap[val]++;
	}

	// First define a small root heap
	using P = pair<int, int>;
	using FUNC = function<bool(P&, P&)>;
	using MinHeap = priority_queue<P, vector<P>, FUNC>;
	MinHeap minheap([](auto &a, auto &b)->bool {
		return a.second > b.second; // Customize how small root heap elements are compared in size
	});

	// Stack k data first
	int k = 0;
	auto it = numMap.begin();

	// First, read 10 data from the map table to the small root heap, and establish the small root heap of top 10. The smallest element is at the top of the heap
	for (; it != numMap.end() && k < 10; ++it, ++k)
	{
		minheap.push(*it);
	}

	// Traverse the elements from K+1 to the end and compare them with the elements at the top of the heap
	for (; it != numMap.end(); ++it)
	{
		// If the number of repetitions of the current element in the map table is greater than the number of repetitions of the heap top element, it is replaced
		if (it->second > minheap.top().second)
		{
			minheap.pop();
			minheap.push(*it);
		}
	}
	// What is left in the heap is the top k with the largest number of repetitions
	while (!minheap.empty())
	{
		auto &pair = minheap.top();
		cout << pair.first << " : " << pair.second << endl;
		minheap.pop();
	}
	return 0;
}

If the memory size is limited in the problem, for example, there are 2 billion integers, the memory limit is 400M, and the top 10 numbers with the highest number of requests for de duplication are analyzed. 2 billion integers, about 8G in size, certainly cannot be loaded into the memory at one time, then the idea of divide and conquer can be used at this time, Divide the 2 billion integers in the file into 50 small files through hash mapping, then each file has about 40 million integers and the size is about 150M. At this time, the numbers of small files can be loaded into memory in one line, and then solve and merge the final results in sections to obtain the top 10 numbers with the highest number of repetitions. The code is shown as follows:

When memory is limited, the top 10 integers calculated by hash mapping + hash statistics + small root heap

#define _CRT_SECURE_NO_WARNINGS
#include <iostream>
#include <vector>
#include <queue>
#include <unordered_map>
#include <functional>
using namespace std;
// Large files are divided into small files (hash mapping) + hash statistics + small root heap (fast scheduling can also achieve the same time complexity)
int main()
{
  /*In order to quickly view the results, the amount of data is reduced here*/
  FILE *pf1 = fopen("data.dat", "wb");
  for (int i = 0; i < 20000; ++i)
  {
    int data = rand();
    if (data < 0)
      cout << data << endl;
    fwrite(&data, 4, 1, pf1);
  }
  fclose(pf1);


  // Open the original file where the data is stored
  FILE *pf = fopen("data.dat", "rb");
  if (pf == nullptr)
    return 0;

  // Due to the reduction of the amount of original data, the number of files divided here is also reduced, including 11 small files
  const int FILE_NO = 11;
  FILE *pfile[FILE_NO] = { nullptr };
  for (int i = 0; i < FILE_NO; ++i)
  {
    char filename[20];
    sprintf(filename, "data%d.dat", i + 1);
    pfile[i] = fopen(filename, "wb+");
  }

  // Hash mapping maps the data in large files to small files
  int data;
  while (fread(&data, 4, 1, pf) > 0)
  {
    int findex = data % FILE_NO;
    fwrite(&data, 4, 1, pfile[findex]);
    cout << "data:" << data << " file:" << findex << endl;
  }

  // Define a linked hash table
  unordered_map<int, int> numMap;
  // First define a small root heap
  using P = pair<int, int>;
  using FUNC = function<bool(P&, P&)>;
  using MinHeap = priority_queue<P, vector<P>, FUNC>;
  MinHeap minheap([](auto &a, auto &b)->bool {
    return a.second > b.second; // Customize how small root heap element sizes are compared
    });

  // Solve the top 10 numbers of small files in sections, and find the final result
  for (int i = 0; i < FILE_NO; ++i)
  {
    // Recover the file pointer of the small file to the starting position
    fseek(pfile[i], 0, SEEK_SET);

    while (fread(&data, 4, 1, pfile[i]) > 0)
    {
      numMap[data]++;
    }

    int k = 0;
    auto it = numMap.begin();

    // If the heap is empty, add 10 data to the heap first
    if (minheap.empty())
    {
      // First, read 10 data from the map table to the small root heap, and establish the small root heap of top 10. The smallest element is at the top of the heap
      for (; it != numMap.end() && k < 10; ++it, ++k)
      {
        minheap.push(*it);
      }
    }

    // Traverse the elements from K+1 to the end and compare them with the elements at the top of the heap
    for (; it != numMap.end(); ++it)
    {
      // If the number of repetitions of the current element in the map table is greater than the number of repetitions of the heap top element, it is replaced
      if (it->second > minheap.top().second)
      {
        minheap.pop();
        minheap.push(*it);
      }
    }

    // Clear the hash table for data statistics of the next small file
    numMap.clear();
  }

  // What is left in the heap is the top k with the largest number of repetitions
  while (!minheap.empty())
  {
    auto &pair = minheap.top();
    cout << pair.first << " : " << pair.second << endl;
    minheap.pop();
  }

  return 0;
}

Keywords: Algorithm data structure

Added by Jayson on Sun, 12 Dec 2021 17:03:19 +0200

Programming VIP

Solution to the problem of massive data duplication checking

1. Four ways to deal with massive data problems

Divide and conquer

Hash

Bit (bit set or BitMap)

Heap

Fast row partition function

Comprehensive application of duplicate checking and top k problem

Popular Keywords