I have a dictionary with each item containing a quote in another dictionary:
{ 'example1': {'quote': 'And it must follow, as the night the day, thou canst not then be false to any man.\n'}, 'example2': {'quote': 'What a piece of work is man! how noble in reason!.\n'} }
I need to completely remove each entry whose quote contains a badword, not just checking if the string contains the badword but if it matches the full word. For instance, following the above example, considering as to be a badword, it should remove example1 but not example2 (that contains reASon).
filter.py
#!/usr/bin/env python2.7 # -*- coding: utf-8 -*- import re def filter_bad_words(entries): f = open("badwords.txt", "r") badwords = f.readlines() # remove new lines from each word for i in range(len(badwords)): badwords[i] = badwords[i].strip('\n') original_count = len(entries) for key, item in entries.items(): quote = item['quote'] if any(findWholeWord(x)(quote) for x in badwords): del entries[key] print "Remove: %s" % quote print "Removed %s items." % (original_count - len(entries)) def findWholeWord(w): """ removes exact word""" return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search def main(): with open("quotes.txt", "r") as f: quotes = f.readlines() entries = {} for key, quote in enumerate(quotes): entry_key = "example{}".format(key) entries[entry_key] = {'quote': quote} filter_bad_words(entries) if __name__ == "__main__": main()
quotes.txt
To be, or not to be: that is the question. Neither a borrower nor a lender be; For loan oft loses both itself and friend, and borrowing dulls the edge of husbandry. This above all: to thine own self be true. Though this be madness, yet there is method in 't. That it should come to this!. There is nothing either good or bad, but thinking makes it so. What a piece of work is man! how noble in reason! how infinite in faculty! in form and moving how express and admirable! in action how like an angel! in apprehension how like a god! the beauty of the world, the paragon of animals! . The lady doth protest too much, methinks. In my mind's eye. A little more than kin, and less than kind. The play 's the thing wherein I'll catch the conscience of the king. And it must follow, as the night the day, thou canst not then be false to any man. Brevity is the soul of wit. Doubt that the sun doth move, doubt truth to be a liar, but never doubt I love. Rich gifts wax poor when givers prove unkind. Do you think I am easier to be played on than a pipe? I will speak daggers to her, but use none. When sorrows come, they come not single spies, but in battalions. All the world 's a stage, and all the men and women merely players. They have their exits and their entrances; And one man in his time plays many parts. Can one desire too much of a good thing?. I like this place and willingly could waste my time in it. How bitter a thing it is to look into happiness through another man's eyes! Blow, blow, thou winter wind! Thou art not so unkind as man's ingratitude. True is it that we have seen better days. For ever and a day. The fool doth think he is wise, but the wise man knows himself to be a fool. Now is the winter of our discontent. A horse! a horse! my kingdom for a horse!. Conscience is but a word that cowards use, devised at first to keep the strong in awe. So wise so young, they say, do never live long. Off with his head! An honest tale speeds best, being plainly told. The king's name is a tower of strength. The world is grown so bad, that wrens make prey where eagles dare not perch. O Romeo, Romeo! wherefore art thou Romeo?. It is the east, and Juliet is the sun. Good Night, Good night! Parting is such sweet sorrow, that I shall say good night till it be morrow. What's in a name? That which we call a rose by any other name would smell as sweet. Wisely and slow; they stumble that run fast. Tempt not a desperate man. For you and I are past our dancing days. O! she doth teach the torches to burn bright. It seems she hangs upon the cheek of night like a rich jewel in an Ethiope's ear. See, how she leans her cheek upon her hand! O that I were a glove upon that hand, that I might touch that cheek!. Not stepping o'er the bounds of modesty.
badwords.txt
people history way art world information map two family government health system computer meat year thanks music person reading method data food understanding theory law bird literature problem software control knowledge power ability economics love internet television science library nature fact product idea temperature investment area society activity story industry media thing oven community definition safety quality
It does its job but I find it extremely slow when dealing with more than 100,000 entries, so I would appreciate suggestions to improve its performance.
(I’ve setup a repo to make it easier for testing.)
Answer
The current time complexity is O(N∗M) where N is the number of quotes and M is the number of bad words. For every single word in a quote you are iterating over all the bad words to check if there is a match.
We can do better than that. What if you would initialize bad words as a set and would just lookup if a word is there – the lookup itself would be constant time – O(1) which would make the overall complexity O(N+M) – we still need O(M) to initially make a set of bad words.
Also, I would use a more appropriate and robust nltk.word_tokenize()
for word tokenization.
Modified code:
from nltk import word_tokenize
def filter_bad_words(entries):
with open("badwords.txt", "r") as f:
badwords = set(word.strip() for word in f)
filtered_entries = {}
for key, item in entries.items():
quote = item['quote']
words = word_tokenize(quote)
if not any(word in badwords for word in words):
filtered_entries[key] = {'quote': quote}
print("Removed %s items." % (len(entries) - len(filtered_entries)))
return filtered_entries
def main():
with open("quotes.txt", "r") as f:
entries = {
"example{}".format(index): {'quote': quote}
for index, quote in enumerate(f)
}
print(filter_bad_words(entries))
if __name__ == "__main__":
main()
Attribution
Source : Link , Question Author : marcanuy , Answer Author : Toby Speight