As seen in this recent XKCD strip and this recent blog post from Peter Norvig (and a Slashdot story featuring the latter), “regex golf” (which might better be called the regular expression separation problem) is the puzzle of defining the shortest possible regular expression that accepts every word in set A and no word in set B. Norvig’s post includes an algorithm for generating a reasonably short candidate, and he notes that his approach involves solving an NP-complete Set Cover problem, but he’s also careful to point out that his approach doesn’t consider every possible regular expression, and of course his isn’t necessarily the only algorithm, so his solutions aren’t guaranteed to be optimal, and it’s also possible that some other assuredly polynomial-time algorithm could find equivalent or better solutions.
For concreteness’ sake and to avoid having to solve the optimization question, I think the most natural formulation of Regular Expression Separation would be:
Given two (finite) sets A and B of strings over some alphabet Σ, is there a regular expression of length ≤k that accepts every string in A and rejects every string in B?
Is anything known about the complexity of this particular separation problem? (Note that since I’ve specified A and B as finite sets of strings, the natural notion of size for the problem is the total lengths of all strings in A and B; this swamps any contribution from k). It seems highly likely to me that it is NP-complete (and in fact, I would expect the reduction to be to some sort of cover problem) but a few searches haven’t turned up anything particularly useful.
Assuming the TCS-variant of regex, the problem is indeed NP-complete.
We assume that our regexes contain
- letters from Σ, matching themselves,
- +, denoting union,
- ⋅, denoting concatenation,
- ∗, denoting Kleene-Star,
- λ, matching the empty string
and nothing else. Length of a regex is defined as the number of characters from Σ. As in the comic strip, we consider a regex to match a word, if it matches a substring of the word. (Changing any of these assumptions should only influence the complexity of the construction below, but not the general result.)
That it is in NP is straightforward, as explained in the comments (verify a candidate-RE by translating it into an NFA and running that on all words from A and B).
In order to show NP-hardness, we reduce Set cover:
Given a universe U and a collection C of subsets of U, is there a set C′⊆C of size ≤k so that ⋃S∈C′S=U?
We translate an input for Set cover into one for regex golf as follows:
- Σ contains one character for each subset in C and one additional character (denoted x in the following).
- A contains one word for each element e of U. The word consists of exactly the characters representing subsets in C that contain e (in arbitrary order).
- B contains the single word x.
- k is simply carried over.
This reduction is obviously in P and equivalence is also quite simple to see:
- If c1,…,ck is a solution for the set cover instance, the regex c1+⋯+ck is a solution to regex golf.
- A regex matching the empty subword would match x. Thus, any regex solving the golf problem has to contain at least one letter from each of the words in A. Hence, if the golf instance is solvable, there is a set of at most k letters from Σ so that each word in A is covered by this set of letters. By construction, the corresponding set of subsets from C is a solution to the set cover instance.