LeetCode 295. Find Median from Data Stream

This article is part of the "conquering LeetCode" series, which officially began on August 12, 2021. Because some questions on LeetCode are locked, this series will last at least until the date when all unlocked questions are brushed; As LeetCode continues to create new questions, the end date of this series may be forever. In this series of problem brushing articles, I will not only explain a variety of problem-solving ideas and their optimization, but also use a variety of programming languages to solve the problem. When it comes to the general solution, I will summarize the corresponding algorithm template.

In order to facilitate running, debugging and sharing code files on PC, I have also established relevant warehouses: https://github.com/memcpy0/LeetCode-Conquest . In this warehouse, you can not only see LeetCode's original title link, Title Solution code, title solution article link, similar title induction, general solution summary, etc., but also see important information such as the occurrence frequency of the original title and related enterprises. If there are other preferred solutions, they can also be shared with others.

As the content of this series of articles may be updated at any time, you are welcome to pay attention and collect Conquer LeetCode series article directory A memo.

The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value and the median is the mean of the two middle values.

  • For example, for arr = [2,3,4], the median is 3.
  • For example, for arr = [2,3], the median is (2 + 3) / 2 = 2.5.

Implement the MedianFinder class:

  • MedianFinder() initializes the MedianFinder object.
  • void addNum(int num) adds the integer num from the data stream to the data structure.
  • double findMedian() returns the median of all elements so far. Answers within 10-5 of the actual answer will be accepted.

Example 1:

Input
["MedianFinder", "addNum", "addNum", "findMedian", "addNum", "findMedian"]
[[], [1], [2], [], [3], []]
Output
[null, null, null, 1.5, null, 2.0]

Explanation
MedianFinder medianFinder = new MedianFinder();
medianFinder.addNum(1);    // arr = [1]
medianFinder.addNum(2);    // arr = [1, 2]
medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2)
medianFinder.addNum(3);    // arr[1, 2, 3]
medianFinder.findMedian(); // return 2.0

Constraints:

  • -105 <= num <= 105
  • There will be at least one element in the data structure before calling findMedian.
  • At most 5 * 104 calls will be made to addNum and findMedian.

Follow up:

  • If all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?
  • If 99% of all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?

The median is the number in the middle of a sequence table. If the list length is even, the median is the average of the middle two numbers. Design a data structure that supports the following two operations:

  • void addNum(int num) - adds an integer from the data stream to the data structure.
  • double findMedian() - returns the median of all current elements.

Advanced:

  • If all integers in the data stream are in the range of 0 to 100, how will you optimize your algorithm?
  • If 99% of the integers in the data stream are in the range of 0 to 100, how will you optimize your algorithm?

Solution reactor

This is a classic data structure problem. The maximum number of calls is 5 * 104, so use O ( n log ⁡ n ) O(n\log n) O(nlogn) level algorithm. Specifically, two priority queues (heaps) are used to maintain the entire data flow. The heap for maintaining the data on the left half of the data flow is left and the heap for maintaining the data on the right half of the data flow is right. Therefore, it is obvious that if left is the large root heap and right is the small root heap, the top elements of the two heaps are candidates for the current median, which can be used O ( 1 ) O(1) O(1) complexity gets the current median:

  • When the data stream length is odd, make left one more element than right. At this time, the dynamic median is the heap top element of left;
  • When the data stream length is even, the left and right sizes are the same, and the dynamic median is the average of the top elements of the two heaps

In order to ensure the order between the two heaps (so as to meet the order of the element sequence) and the size relationship (so as to obtain the median), we should deal with it in odd and even cases when addNum:

  • The size of the left and right heaps before insertion is the same, which means that the length of the data stream before insertion is even, and the length after insertion becomes odd. At this time, it is expected that there will be one more element in the left heap than in the right heap after insertion, and the double heap will remain in order. Therefore, further classification discussion:
    • If they are all empty heaps (you can only check whether right is empty), it means that the first element is inserted and left is added directly;
    • If right is not empty and num < = right. Top(), it indicates that the insertion position of num will not be in the right of the second half, but also directly add left;
    • If right is not empty and num > right. Top(), it indicates that the insertion position of num is in the right of the second half. At this time, pop up the right heap top element and put it into the left, and then put num into right.
  • The sizes of the left and right heaps are different before insertion, that is, there is one more element in the left heap, which means that the length of the data stream before insertion is odd and the length after insertion becomes even. At this time, it is expected that the number of elements in the two heaps after insertion is equal and the double heap remains orderly. Therefore, further classification discussion:
    • If num > = left. Top(), it indicates that the insertion position of num is not in the left of the first half, and right is added directly;
    • If num < left. Top(), it indicates that the insertion position of num is in the left of the first half. At this time, pop up the left heap top element and put it into right, and then put num into left.

In the following code, the time complexity of the addNum method is O ( log ⁡ n ) O(\log n) O(logn), the time complexity of findmedium method is O ( 1 ) O(1) O(1), the overall spatial complexity is O ( n ) O(n) O(n) :

/C++ version
class MedianFinder {
private:
    priority_queue<int, vector<int>, less<int>> left; //Big root pile
    priority_queue<int, vector<int>, greater<int>> right; //Small root pile
public:
    /** initialize your data structure here. */
    MedianFinder() {}
    /* left Maintain the left half of the sequence and right maintain the right half
    ** 0 <= left.size() - right.size() <= 1
    ** When the left and right stacks are equal in size, add a new number. The data stream length is an odd number. Take the median and take left.top()
    **   - When both are empty (the right heap is empty), it is directly added to the left heap
    **   - When none is empty:
    **       - If num < = right. Top(), it should be put into the left heap
    **       - If num > right.top(), you should take out right.top() and put it into the left heap, and put num into the right heap
    ** When the size difference between the left and right stacks is 1 (left > right), add a new number, and the data stream len gt h is even,
    ** Take the median (left.top()+right.top())/2.0
    **   - If num > = left. Top(), it should be put into the right heap
    **   - If num < left.top(), take out left.top() and put it into the right heap, and put num into the left heap
    */
    void addNum(int num) {
        if (left.size() == right.size()) {
            if (right.empty() || num <= right.top()) left.push(num);
            else {
                left.push(right.top()); right.pop();
                right.push(num);
            }
        } else {
            if (num >= left.top()) right.push(num);
            else {
                right.push(left.top()); left.pop();
                left.push(num);
            }
        }
    }
    double findMedian() {
        return left.size() == right.size() ? 
                ((left.top() + right.top()) / 2.0) : left.top();
    }
};
//Execution time: 284 ms, beating 58.06% of users in all C + + submissions
//Memory consumption: 114.3 MB, beating 41.31% of users in all C + + submissions
//Jave version
class MedianFinder {
    private PriorityQueue<Integer> left = new PriorityQueue<>((a, b) -> b - a);
    private PriorityQueue<Integer> right = new PriorityQueue<>((a, b) -> a - b);
    /** initialize your data structure here. */
    public MedianFinder() {}
    
    public void addNum(int num) {
        if (left.size() == right.size()) {
            if (right.isEmpty() || num <= right.peek()) {
                left.add(num);
            } else {
                left.add(right.poll());
                right.add(num);
            } 
        } else {
            if (num >= left.peek()) {
                right.add(num);
            } else {
                right.add(left.poll());
                left.add(num);
            }
        }
    }
    
    public double findMedian() {
        if (left.size() == right.size()) return (left.peek() + right.peek()) / 2.0;
        else return left.peek();
    }
}
//Execution time: 91 ms, beating 63.61% of users in all Java submissions
//Memory consumption: 58.5 MB, beating 74.80% of users in all Java submissions

Advanced

If all integers in the data stream are in the range of 0 to 100, how will you optimize your algorithm?

Specifically, a bucket sequence with a length of 101 is established, the occurrence times of each integer are counted respectively, and the total length of the data stream is recorded. Each time you look for the median, first calculate its number, and then scan all buckets from front to back to get the answer.

This approach does not have much optimization in time, because the addNum method in the double heap solution is O ( log ⁡ n ) O(\log n) The O(logn) and findmedium methods are O ( 1 ) O(1) O(1), addNum method in bucket counting solution is O ( 1 ) O(1) O(1). The findmedium method is O ( 101 ) O(101) O(101) . Limit call 5 × 1 0 4 5 \times 10^4 five × Under 104 times, the amount of addNum calculation of double heap solution is still lower than that of bucket count findMedian. If the influence of constants is taken into account, the time gap between the two will not be large, but more space optimization, from O ( n ) O(n) O(n) decreases to O ( 101 ) O(101) O(101) .

If 99% of the integers in the data stream are in the range of 0 to 100, how will you optimize your algorithm?

On the basis of the solution of the previous question 1 % 1\% 1% of the data is solved by sentinel mechanism, that is, an ordered sequence is maintained on both sides of the smallest bucket and the largest bucket, representing negative infinity and positive infinity. The actual code is as follows:

//Java version
class MedianFinder {
    int[] arr = new int[101];
    int a, b, c; //a. C is the number of elements in the sentry bucket, and b is the number of elements in the [0100] range
    TreeMap<Integer, Integer> h = new TreeMap<>(), t = new TreeMap<>();
    public void addNum(int num) {
        if (num >= 0 && num <= 100) { //99% data of advanced questions 1 and 2
            ++arr[num];
            ++b; //count
        } else if (num < 0) { //Add to the leftmost sentinel bucket
            h.put(num, h.getOrDefault(num, 0) + 1);
            ++a; //count
        } else if (num > 100) { //Add to the far right sentinel bucket
            t.put(num, t.getOrDefault(num, 0) + 1);
            ++c; //count
        }
    }
    public double findMedian() {
        int size = a + b + c; //Total length of data stream
        if (size % 2 == 0) return (find(size / 2) + find(size / 2 + 1)) / 2.0;
        return find(size / 2 + 1) * 1.0;
    }
    private int find(int n) { //Find the element at position n
        if (n <= a) {
            for (int num : h.keySet()) {
                n -= h.get(num);
                if (n <= 0) return num;
            }
        } else if (n <= a + b) { 
            n -= a;
            for (int i = 0; i <= 100; ++i) { //Search method of advanced first question
                n -= arr[i];
                if (n <= 0) return i;
            }
        } else {
            n -= a + b;
            for (int num : t.keySet()) {
                n -= t.get(num);
                if (n <= 0) return num;
            }
        }
        return -1; //never visited
    }
}

Since most of the real data are not distributed in the range of [0, 100], they will TLE and can only pass 17 / 21 test cases.

Keywords: Algorithm leetcode

Added by bapan on Fri, 03 Sep 2021 23:11:17 +0300