# Difference between revisions of "TADM2E 4.45"

m (fix typo) |
|||

Line 54: | Line 54: | ||

sys.exit() | sys.exit() | ||

</source> | </source> | ||

+ | |||

+ | ---- | ||

+ | |||

+ | |||

+ | Solution 2: | ||

+ | |||

+ | We can see from inspection that looking at pairwise combinations of indices tells us nothing. So we know 3 indices (one from each list) must be looked at in combination. We can also observe that the correct answer may involve indices at any position in the three lists (e.g., the solution may involve the final entry in each list). | ||

+ | |||

+ | Let's call the three lists of positions/indices A, B, and C. | ||

+ | |||

+ | To begin, consider an abstract triplet <A<sub>i</sub>, B<sub>j</sub>, C<sub>k</sub>> where A<sub>i</sub> < B<sub>j</sub> < C<sub>k</sub>. For example, <1, 3, 5>. | ||

+ | |||

+ | Incrementing j to j+1 can only result in an equally good (e.g., <1, 4, 5>) or worse (<1, 6, 5>) solution. | ||

+ | |||

+ | Similarly, incrementing k to k+1 can only result in a worse solution (e.g., <1, 3, 7>). | ||

+ | |||

+ | Therefore, incrementing i to i+1 provides the only possibility of finding a better solution (e.g., <2, 3, 5>), but may also produce a worse solution (<8, 3, 5>). | ||

+ | |||

+ | So, if we start with the triplet formed by the first element in each of the three lists, calculate the word span, and then loop by incrementing the index of the word position that occurs earliest in the text, we can proceed orderly though the lists of word positions, ignoring only combinations that are guaranteed not to produce a better solution. | ||

+ | |||

+ | Note that this approach also allows us to terminate the search early in the case where the triplet element representing the word with earliest occurrence in text (i.e., that which we would next increment) is the final entry in one of the word position lists (as we have shown above that all remaining unexplored combinations must have a greater span). | ||

+ | |||

+ | Applying this algorithm to the example given in the text results in combinations being tested in the following order: | ||

+ | |||

+ | <1, 3, 2>: 1-3<sup>*</sup> | ||

+ | <4, 3, 2>: 2-4 | ||

+ | <4, 3, 6>: 3-6 | ||

+ | <4, 9, 6>: 4-9 | ||

+ | <5, 9, 6>: 6-9 | ||

+ | |||

+ | The algorithm terminates here, as 5 is the lowest value and the word1 index list has no more elements). | ||

+ | |||

+ | The answer is the span 1-3, which we find by examining only 5 possible combinations. | ||

+ | |||

+ | The algorithm executes in linear time, O(n) = O(|A| + |B| + |C|), and can be easily extended to an arbitrary number of words. |

## Revision as of 14:16, 15 July 2016

Let's use the example given, for words say *A, B, C*, in the problem but not get too tied to the specific values. It helps to think about sorting the search string words by their indexes in the document:

A C B A A C * * B B * * * * C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

In the code example this is just a call to sort but since each word index list is sorted we could use `MergeSort`

which means this portion of the code is $ \mathcal{O}(n) $ for $ n $ indexes.

Once we have the list we push each element into a hash with a separate key for each word and update our estimate of the snippet *span* that includes all the words.

Here's python code for one simple implementation.

import sys def smallest_snippet(*args): ''' args -- K lists of word positions for K words >>> smallest_snippet([1,10], [2,20], [3,30]) [1, 3] >>> smallest_snippet([1,9,27], [6,10,19], [8,12,14]) [8, 10] >>> smallest_snippet([1,4,11,27], [3,6,10,19], [5,8,12,14]) [3, 5] >>> smallest_snippet([1,4,5], [3,9,10], [2,6,15]) [1, 3] ''' master = [] for i in range(len(args)): master.extend(map(lambda j: (i,j), args[i])) # ith word, jth index master.sort(lambda x,y: cmp(x[1], y[1])) # TODO: mergesort tops = {} # { word i: index j } best = [master[0][1], master[-1][1]] minspan = best[-1] - best[0] + 1 # update span after each new index tuple for (i,j) in master: tops[i] = j if len(tops) == len(args): curr = [min(tops.values()), max(tops.values())] if curr[1] - curr[0] < minspan: minspan = curr[1] - curr[0] best = curr return best if __name__ == "__main__": import doctest doctest.testmod() sys.exit()

Solution 2:

We can see from inspection that looking at pairwise combinations of indices tells us nothing. So we know 3 indices (one from each list) must be looked at in combination. We can also observe that the correct answer may involve indices at any position in the three lists (e.g., the solution may involve the final entry in each list).

Let's call the three lists of positions/indices A, B, and C.

To begin, consider an abstract triplet <A_{i}, B_{j}, C_{k}> where A_{i} < B_{j} < C_{k}. For example, <1, 3, 5>.

Incrementing j to j+1 can only result in an equally good (e.g., <1, 4, 5>) or worse (<1, 6, 5>) solution.

Similarly, incrementing k to k+1 can only result in a worse solution (e.g., <1, 3, 7>).

Therefore, incrementing i to i+1 provides the only possibility of finding a better solution (e.g., <2, 3, 5>), but may also produce a worse solution (<8, 3, 5>).

So, if we start with the triplet formed by the first element in each of the three lists, calculate the word span, and then loop by incrementing the index of the word position that occurs earliest in the text, we can proceed orderly though the lists of word positions, ignoring only combinations that are guaranteed not to produce a better solution.

Note that this approach also allows us to terminate the search early in the case where the triplet element representing the word with earliest occurrence in text (i.e., that which we would next increment) is the final entry in one of the word position lists (as we have shown above that all remaining unexplored combinations must have a greater span).

Applying this algorithm to the example given in the text results in combinations being tested in the following order:

<1, 3, 2>: 1-3^{*}<4, 3, 2>: 2-4 <4, 3, 6>: 3-6 <4, 9, 6>: 4-9 <5, 9, 6>: 6-9

The algorithm terminates here, as 5 is the lowest value and the word1 index list has no more elements).

The answer is the span 1-3, which we find by examining only 5 possible combinations.

The algorithm executes in linear time, O(n) = O(|A| + |B| + |C|), and can be easily extended to an arbitrary number of words.