TADM2E 4.15

From Algorithm Wiki
Jump to: navigation, search

Part A

It helps to know what you can't do first. The (2n-3) algorithm is a simple scan through the array with a max1 and max2 variable, comparing each newly encountered element against max2 and possibly max1. The worst case scenario is:

  • 1 compare to add the first two elements
  • 2 compares for each of the remaining n-2 elements

This gives a total of 2n-3 compares, though on randomized data you would expect an average of n+log2(n) compares. As per the problem description, worst case performance knocks linear scan out of the picture.

Similarly, we could heapify the entire array, which is guaranteed to take less than or equal to 2n compares. The first element in the resulting array is the largest, and the second largest will be either the second or third element in the array. This results in a compare count of <= 2n+1, which for randomized data and reasonably sized n will be substantially less than 2n-3. However, the worst case still makes this an unacceptable solution.

One correct answer is to use the tournament algorithm to find the 2nd largest element. In the tournament algorithm we pair up sets of two elements and then find the winner, then recursively do the same with the winners, like a knockout football game. The second largest number can be any one in the list of numbers which lost to the winner. Check the winner in the losers group (lost to the best element). This can be done in n+log(n) time complexity. While not strictly a heap, the structure of the tournament solution is somewhat heaplike and likely what Skiena had in mind for a valid solution.

Alternatively, we could split the the set of keys into two equal halves and find the largest keys in each of them (2 * n/2 comparisons). Then we will compare the found keys (1 comparison) and repeat the largest key search in the halve with the largest maximum, this time excluding the said maximum (n/2 - 1 comparisons). After that we once again compare the maximums from both halves (1 comparison) and the largest one between them will be the second largest key in the whole set. This approach always requires 3n/2 + 1 comparisons, which is slightly better than the linear scan's worst case of 2n - 3 comparisons, but worse than the linear scan's average of n + log2(n) comparisons.


Alternative solution that gives 2*n - 3 comparisons for randomized data.

selection_size = 750  # n
limit = 2 * selection_size - 3  # 2*n - 3
t_lst = random.sample(range(1, 10000), selection_size)  # Choosing n random positive numbers
max_1, max_2, comparisons_amount = 0, 0, 0

for element in t_lst:  # O(0 + 1 + 2 + ... + 2) = O(0 + 1 + 2*(n - 2)) = O(2*n - 3)
    if max_1 == 0:  # max_1 not initialized, 0 comparisons
        max_1 = element
        continue

    if max_2 == 0:  # max_1 initialized, max_2 not initialized, 1 comparison
        if max_1 < element:
            max_1, max_2 = element, max_1
        else:
            max_2 = element

        comparisons_amount += 1
        continue

    if element > max_1:  # Both max_1 and max_2 initialized, 2 comparisons
        if max_1 > max_2:
            max_2 = max_1

        max_1 = element
    elif element > max_2:
        max_2 = element

    comparisons_amount += 2

assert comparisons_amount <= limit

--Bkarpov96 (talk) 17:39, 19 June 2020 (UTC)

Part B

One possibility for third largest key is to heapify for <= 2n compares. The largest key will be the first array element. The second largest key will be either array element 2 or 3, requiring one compare to determine. The third largest key will be among array elements 3, 4, 5, 6, and 7 (if the second largest key was array element 2). Thus, the third largest element requires additional four compares to determine. The total compare count would be <= 2n+5.

There is also a tournament solution. The first and second largest are found in n+log(n) time as above, while the third largest can be picked as the best remaining loser to either the first or second.

Tradeoffs

While tournament has the best reliable compare count, it comes at the expense of additional storage space to store the wins and losses. The heapify option, while having a larger compare count, is in-place and in a real implementation this may make it worth using.

The trivial 'scan and compare' approach is by far the most likely approach that any real piece of software is likely to use. Its average compare count on random data is going to be similar to that of heapify, but without the complex logic and nested loops. Further, it won't have to reorder the data or keep additional storage around. It also has excellent cache locality.