TADM2E 3.29

From Algorithm Wiki
Jump to: navigation, search

Using a Hashtable

Scan the web page word by word and for each one, except the first, take the pair formed by the previous word + the current one. Use that pair as the key to a hashtable, where you store the frequency of that pair. If the entry is not already in the hash table store it with value 1, if it is there increment the current value by one. Keep the highest frequency seen and update it wherever you see a greater one. In the end return the greatest frequency seen.

Complexity, where N = number of words in the text:

Time: O(N) (you only iterate over each word once)

Space: O(N) (Worst case is when every word is distinct and thus you have no repeated pairs, and that would take N storage on the hash table)


Using a Trie

The solution with the hashtable will only work in time $ O(n) $ if the hashtable contains $ m=n $ slots (in general, lookup in a hashtable takes $ O(n/m) $ time). Still, this approach could work well if $ m $ is chosen wisely.

Another approach would be to store the text in a trie and save the number of occurrences in the leaf nodes. The implementation below iterates through the text letter by letter (in $ O(n) $ time), using two word pointers word1 and word2 that point to the position in the trie of the current word and the concatenation of the current with the last word, respectively.

If the text is long and natural, it is expected to contain many repetitions and common prefixes, and an efficient implementation would use relatively little memory. Another advantage is that no parameters have to be pre-defined (such as $ m $ in the example using the hashtable).

class Trie:
 
    def __init__(self, parent=None):
        self.parent = parent
        self.children = {}
        self.n = 0
 
    def add(self, c):
        return self.children.setdefault(c, Trie(self))
 
    def letter(self, child):
        for c in self.children:
            if self.children[c] == child:
                return c
        return None
 
    def spell(self):
        if self.parent:
            return self.parent.spell() + self.parent.letter(self)
        return ''
 
def wrap(text):
    """iterates through text, yielding lowercase letter or single ' '"""
    inword = False
    for c in text:
        c = c.lower()
        if ord(c) >= ord('a') and ord(c) <= ord('z'):
            yield c
            inword = True
        else:
            if inword:
                yield ' '
            inword = False
    if inword:
        yield ' '
 
def pairmax(text):
    """finds the word pair with maximum occurrence"""
    root = word2 = Trie()
    best = word1 = None
    for c in wrap(text):
        if c == ' ':
            if word1:
                word2.n += 1
                if best is None or word2.n > best.n:
                    best = word2
                word2 = word1.add(' ')
                word1 = root
            else:
                word1 = root
                word2 = word2.add(' ')
        else:
            word2 = word2.add(c)
            if word1:
                word1 = word1.add(c)
    return best.spell(), best.n
 
 
if __name__ == '__main__':
    import urllib2, sys
 
    urls = sys.argv[1:]
    if not urls:
        urls = ['http://www.rfc-editor.org/rfc/rfc19%02d.txt' % i for i in range(30, 40)]
 
    for url in urls:
        word, n = pairmax(urllib2.urlopen(url).read())
        sys.stdout.write('%s : %dx "%s"\n' %(url, n, word))