Removing entries from a dictionary containing bad words

I have a dictionary with each item containing a quote in another dictionary:

{
  'example1': {'quote': 'And it must follow, as the night the day, thou canst not then be false to any man.\n'},
  'example2': {'quote': 'What a piece of work is man! how noble in reason!.\n'}
}

I need to completely remove each entry whose quote contains a badword, not just checking if the string contains the badword but if it matches the full word. For instance, following the above example, considering as to be a badword, it should remove example1 but not example2 (that contains reASon).

filter.py

#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-

import re

def filter_bad_words(entries):

    f = open("badwords.txt", "r")
    badwords = f.readlines()

    # remove new lines from each word
    for i in range(len(badwords)):
            badwords[i] = badwords[i].strip('\n')

    original_count = len(entries)
    for key, item in entries.items():
        quote = item['quote']
        if any(findWholeWord(x)(quote) for x in badwords):
            del entries[key]
            print "Remove: %s" % quote
    print "Removed %s items." % (original_count - len(entries))

def findWholeWord(w):
    """ removes exact word"""
    return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search

def main():
    with open("quotes.txt", "r") as f:
        quotes = f.readlines()
        entries = {}
        for key, quote in enumerate(quotes):
            entry_key = "example{}".format(key)
            entries[entry_key] = {'quote': quote}
        filter_bad_words(entries)

if __name__ == "__main__":
        main()

quotes.txt

To be, or not to be: that is the question.
Neither a borrower nor a lender be; For loan oft loses both itself and friend, and borrowing dulls the edge of husbandry.
This above all: to thine own self be true.
Though this be madness, yet there is method in 't.
That it should come to this!.
There is nothing either good or bad, but thinking makes it so.
What a piece of work is man! how noble in reason! how infinite in faculty! in form and moving how express and admirable! in action how like an angel! in apprehension how like a god! the beauty of the world, the paragon of animals! .
The lady doth protest too much, methinks.
In my mind's eye.
A little more than kin, and less than kind.
The play 's the thing wherein I'll catch the conscience of the king.
And it must follow, as the night the day, thou canst not then be false to any man.
Brevity is the soul of wit.
Doubt that the sun doth move, doubt truth to be a liar, but never doubt I love.
Rich gifts wax poor when givers prove unkind.
Do you think I am easier to be played on than a pipe?
I will speak daggers to her, but use none.
When sorrows come, they come not single spies, but in
battalions.
All the world 's a stage, and all the men and women merely players. They have their exits and their entrances; And one man in his time plays many parts.
Can one desire too much of a good thing?.
I like this place and willingly could waste my time in it.
How bitter a thing it is to look into happiness through another man's eyes!
Blow, blow, thou winter wind! Thou art not so unkind as man's ingratitude.
True is it that we have seen better days.
For ever and a day.
The fool doth think he is wise, but the wise man knows
himself to be a fool.
Now is the winter of our discontent.
A horse! a horse! my kingdom for a horse!.
Conscience is but a word that cowards use, devised at first to keep the strong in awe.
So wise so young, they say, do never live long.
Off with his head!
An honest tale speeds best, being plainly told.
The king's name is a tower of strength.
The world is grown so bad, that wrens make prey where eagles dare not
perch.
O Romeo, Romeo! wherefore art thou Romeo?.
It is the east, and Juliet is the sun.
Good Night, Good night! Parting is such sweet sorrow, that I shall say good night till it be morrow.
What's in a name? That which we call a rose by any other name would smell as sweet.
Wisely and slow; they stumble that run fast.
Tempt not a desperate man.
For you and I are past our dancing days.
O! she doth teach the torches to burn bright.
It seems she hangs upon the cheek of night like a rich jewel in an Ethiope's ear.
See, how she leans her cheek upon her hand! O that I were a glove upon that hand, that I might touch that cheek!.
Not stepping o'er the bounds of modesty.

badwords.txt

people
history
way
art
world
information
map
two
family
government
health
system
computer
meat
year
thanks
music
person
reading
method
data
food
understanding
theory
law
bird
literature
problem
software
control
knowledge
power
ability
economics
love
internet
television
science
library
nature
fact
product
idea
temperature
investment
area
society
activity
story
industry
media
thing
oven
community
definition
safety
quality

It does its job but I find it extremely slow when dealing with more than 100,000 entries, so I would appreciate suggestions to improve its performance.

(I’ve setup a repo to make it easier for testing.)

Answer

The current time complexity is O(NM) where N is the number of quotes and M is the number of bad words. For every single word in a quote you are iterating over all the bad words to check if there is a match.

We can do better than that. What if you would initialize bad words as a set and would just lookup if a word is there – the lookup itself would be constant time – O(1) which would make the overall complexity O(N+M) – we still need O(M) to initially make a set of bad words.

Also, I would use a more appropriate and robust nltk.word_tokenize() for word tokenization.

Modified code:

from nltk import word_tokenize


def filter_bad_words(entries):
    with open("badwords.txt", "r") as f:
        badwords = set(word.strip() for word in f)

    filtered_entries = {}
    for key, item in entries.items():
        quote = item['quote']
        words = word_tokenize(quote)

        if not any(word in badwords for word in words):
            filtered_entries[key] = {'quote': quote}

    print("Removed %s items." % (len(entries) - len(filtered_entries)))
    return filtered_entries


def main():
    with open("quotes.txt", "r") as f:
        entries = {
            "example{}".format(index): {'quote': quote}
            for index, quote in enumerate(f)
        }

        print(filter_bad_words(entries))


if __name__ == "__main__":
    main()

Attribution
Source : Link , Question Author : marcanuy , Answer Author : Toby Speight

Leave a Comment