TADM2E 4.45

From Algorithm Wiki
Jump to: navigation, search


We can see from inspection that looking at pairwise combinations of indices tells us nothing. So we know 3 indices (one from each list) must be looked at in combination. We can also observe that the correct answer may involve indices at any position in the three lists (e.g., the solution may involve the final entry in each list).

Let's call the three lists of positions/indices A, B, and C.

To begin, consider an abstract triplet <Ai, Bj, Ck> where Ai < Bj < Ck. For example, <1, 3, 5>.

Incrementing j to j+1 can only result in an equally good (e.g., <1, 4, 5>) or worse (<1, 6, 5>) solution.

Similarly, incrementing k to k+1 can only result in a worse solution (e.g., <1, 3, 7>).

Therefore, incrementing i to i+1 provides the only possibility of finding a better solution (e.g., <2, 3, 5>), but may also produce a worse solution (<8, 3, 5>).

So, if we start with the triplet formed by the first element in each of the three lists, calculate the word span, and then loop by incrementing the index of the word position that occurs earliest in the text, we can proceed orderly though the lists of word positions, ignoring only combinations that are guaranteed not to produce a better solution.

Note that this approach also allows us to terminate the search early in the case where the triplet element representing the word with earliest occurrence in text (i.e., that which we would next increment) is the final entry in one of the word position lists (as we have shown above that all remaining unexplored combinations must have a greater span).

Applying this algorithm to the example given in the text results in combinations being tested in the following order:

<1, 3, 2>: 1-3*
<4, 3, 2>: 2-4
<4, 3, 6>: 3-6
<4, 9, 6>: 4-9
<5, 9, 6>: 6-9 

The algorithm terminates here, as 5 is the lowest value and the word1 index list has no more elements).

The answer is the span 1-3, which we find by examining only 5 possible combinations.

The algorithm executes in linear time, O(n) = O(|A| + |B| + |C|), and can be easily extended to an arbitrary number of words.

Alternative Solution

This solution is using the same principle to the one above but is less efficient because of the initial merge.

Let's use the example given, for words say A, B, C, in the problem but not get too tied to the specific values. It helps to think about sorting the search string words by their indexes in the document:

A  C  B  A  A  C  *  *  B  B  *  *  *  *  C
1  2  3  4  5  6  7  8  9  10 11 12 13 14 15

In the code example this is just a call to sort but since each word index list is sorted we could use MergeSort which means this portion of the code is $ \mathcal{O}(n log(k)) $ for $ n $ indexes and $ k $ lists.

Once we have the list we push each element into a hash with a separate key for each word and update our estimate of the snippet span that includes all the words.

Here's python code for one simple implementation.

import sys
def smallest_snippet(*args):
    args -- K lists of word positions for K words
    >>> smallest_snippet([1,10], [2,20], [3,30])
    [1, 3]
    >>> smallest_snippet([1,9,27], [6,10,19], [8,12,14])
    [8, 10]
    >>> smallest_snippet([1,4,11,27], [3,6,10,19], [5,8,12,14])
    [3, 5]
    >>> smallest_snippet([1,4,5], [3,9,10], [2,6,15])
    [1, 3]
    master = []
    for i in range(len(args)):
        master.extend(map(lambda j: (i,j), args[i])) # ith word, jth index
    master.sort(lambda x,y: cmp(x[1], y[1]))         # TODO: mergesort
    tops = {}                                        # { word i: index j }
    best = [master[0][1], master[-1][1]]
    minspan = best[-1] - best[0] + 1
    # update span after each new index tuple
    for (i,j) in master:
        tops[i] = j
        if len(tops) == len(args):
            curr = [min(tops.values()), max(tops.values())]
            if curr[1] - curr[0] < minspan:
                minspan = curr[1] - curr[0]
                best = curr
    return best
if __name__ == "__main__":
    import doctest