[classic topics] classics in Classics -- TopK problem

Problem introduction

Please find the minimum / maximum number of k in a pile of data.

The title description is very simple. How many ideas do you have to realize it?

 
 

Solution 1 - simple sorting

First of all, you can think of a very simple idea: sort the data from small to large, and take the first k numbers.

Don't go into details, just look at the code:

class Solution {
    public int[] getLeastNumbers(int[] arr, int k) {
    	// sort
        Arrays.sort(arr);
        // Take the first k number
        int[] res = new int[k];
        for (int i = 0; i < k; i++) {
            res[i] = arr[i];
        }
        return res;
    }
}

Analysis: time complexity O(nlogn), that is, sorting overhead. It's not hard to think of a problem. We only need to get the minimum K numbers, but we don't require that the K numbers and other n-k numbers should be orderly inside - this idea seems to have redundant sorting operations.

 
 

Solution 2 - small top reactor

Then, we thought of a key data structure - heap.

Don't we just put all the data into the small top heap and pop up k numbers?

class Solution {
    public int[] getLeastNumbers(int[] arr, int k) {
        PriorityQueue<Integer> heap = new PriorityQueue<>();
        // All data into heap
        for (int n : arr) {
            heap.offer(n);
        }
        // Number of pop-up k
        int[] res = new int[k];
        int index = 0;
        while (index < k) {
            res[index++] = heap.poll();
        }
        return res;
    }
}

Analysis: this idea is actually no different from the simple sorting idea. For each element, the cost of entering and leaving the heap is O(logn), so the time complexity is still O(nlogn). This idea is even worse because it also uses more O(n) heap space.

 
 

Solution 3 - large top reactor

Next, it is the correct use of heap > <!

We maintain a large top heap with a size of K and put the data into the heap in turn. When the size of the heap exceeds K, an extra element will pop up; The pop-up element is the maximum value in the current heap, and it can never be included in the minimum k elements; The k elements in the final heap are the smallest k elements in all data.

class Solution {
    public int[] getLeastNumbers(int[] arr, int k) {
        PriorityQueue<Integer> heap = new PriorityQueue<>((o1, o2) -> o2 - o1);
        for (int n : arr) {
            heap.offer(n);
            if (heap.size() > k) {
                heap.poll();
            }
        }
        int[] res = new int[k];
        int index = 0;
        while (!heap.isEmpty()) {
            res[index++] = heap.poll();
        }
        return res;
    }
}

Further optimization can be carried out: if the current element is greater than or equal to the element at the top of the heap, it will be skipped directly. Anyway, it will be in and out of the heap.

class Solution {
    public int[] getLeastNumbers(int[] arr, int k) {
        if (arr.length == 0 || k == 0) {
            return new int[0];
        }
        PriorityQueue<Integer> heap = new PriorityQueue<>((o1, o2) -> o2 - o1);
        for (int n : arr) {
            if (heap.size() < k) {
                heap.offer(n);
                continue;
            }
            if (n < heap.peek()) {
                heap.poll();
                heap.offer(n);
            }
        }
        int[] res = new int[k];
        int index = 0;
        while (!heap.isEmpty()) {
            res[index++] = heap.poll();
        }
        return res;
    }
}

Analysis: compared with small top reactor, what is the essence of large top reactor optimization? The size of the heap is fixed to K without loading all n elements. Therefore, the overhead of entering and leaving the heap is reduced to O(logk), and the total time complexity is O(nlogk). In addition, the spatial complexity is also reduced from O(n) to O(k).

 
 

Solution 4 - quick sort

This kind of thinking is very ingenious and requires solid basic knowledge. Let's follow the thinking and experience it.

The requirement of this question is to find the minimum k number in the left and right data.

It is equivalent to: divide the data into two groups. The first group has smaller values and the latter group has larger values, but order is not required within these two groups.

The idea of quick sorting is to divide the data into two groups: the first group is all smaller than the benchmark, and the latter group is all larger than the benchmark, but order is not required within these two groups.

Inspired by this, the following algorithm ideas are obtained:

1) Perform a quick sort on the data, and the final benchmark value (left == right) falls at the subscript position of mid;

2) At this time, the position of the benchmark value is the final position after the overall sorting is completed (this requires you to have a deep understanding of quick sorting);

3) If k == mid, it means that arr[k] is the smallest number k+1, and the first k numbers are the smallest K numbers;

4) If K < mid, it means that the number k+1 is in the left array, and then the left array is recursive

5) If k > mid, it means that the k+1 smallest number is in the right array, and then recurse the right array

Look at the code:

class Solution {
    public int[] getLeastNumbers(int[] arr, int k) {
        if (arr.length == k) {
            return arr;
        }
        return quickSort(arr, 0, arr.length - 1, k);
    }

    private int[] quickSort(int[] arr, int L, int R, int k) {
        int left = L;
        int right = R;
        int temp = arr[left];
        while (left < right) {
            while (left < right && arr[right] >= temp) {
                right--;
            }
            arr[left] = arr[right];
            while (left < right && arr[left] <= temp) {
                left++;
            }
            arr[right] = arr[left];
        }
        arr[left] = temp;
        if (k < left) {
            return quickSort(arr, L, left - 1, k);
        }
        if (k > left) {
            return quickSort(arr, left + 1, R, k);
        }
        return Arrays.copyOf(arr, k);
    }
}

Analysis: each time, it will compare the subscript position and k of the benchmark, and recurse on this basis. Each time, the part to be sorted will be halved. The time complexity O(n) can be obtained by summing an isometric sequence. The spatial complexity is the recursion depth O(logn).

 
 

Analysis and summary

1) Usage scenario of quick sort idea: divide the data into two parts according to a certain feature, one in the front and the other in the back, but the order is not considered inside the two parts.

2) Using the idea of heap, time complexity O(nlogk), using the idea of quick sort, time complexity O(n).

3) Is the idea of quick sort better than that of heap? This is true in terms of time complexity, but the idea of quick sort has spatial limitations: heap can process a large amount of data in the form of stream, while quick sort requires all data to be stored first; When the memory is not enough, the heap is the optimal solution to the TopK problem.

 
 
 
 
 
 
 
 
 
 
 
 

E N D END END

Keywords: Algorithm Interview quick sort

Added by jeny on Thu, 30 Dec 2021 17:49:56 +0200