The Scenario
The year is 3000. Aliens have invaded and enslaved nearly the entire human race. Thankfully, you have somehow managed to stay safe and out of sight from your wouldbe alien overlords, and are currently holed up in a small house in a rural part of the country. However, supplies are beginning to wear thin, and before long you know you're going to need to retrieve these 4 different items from the abandoned buildings around you if you want to survive.
You've been hiding out here for a long time, and as such are rather familiar with the area. You know exactly where to find these items, but traveling to the different buildings for supplies is going to be dangerous. While traveling between buildings, you're at risk of being abducted by a roving UFO's tractor beam, vaporized, or worse! Analyzing a map of the area indicates that there are 4 buildings within equal distance from your safe house that combined have all of the resources you need.
Additionally, your current circumstances indicate that you only need one of each item. Retrieving more than one copy of each item is simply not going to increase your chances of survival. With this in mind, which of these buildings should you visit to get all the items you need in the minimum amount of trips? At a glance, it should be easy to tell that you should visit building 1 and building 4 to get the supplies you need, and that the minimum number of trips that you'll need to make to get all of the supplies is 2. However, what if you wanted to gather 100 items from 20 different buildings? Well, you'd need to come up with an algorithm of course! As we'll soon find out, hidden within this relatively silly scenario is a deceptively hard problem known as the Set Cover Search Problem. Before we try and come up with an algorithm however, let's take a step back and see if we can turn this problem into something a bit more generic.
Abstracting The Problem
To start off with, we can think of our items and buildings as a set of subsets \(S = \{S_0, S_1, S_2\ ...\}\), and the union of all subsets in \(S\) as the universal set \(\mathbb{U}\). In mathematical notation:
\[ \mathbb{U} = \bigcup\limits^n _{i=0} S_i \]
where \(n\) is the number of subsets in \(S\).
\[e.g.\]
The universal set \(\mathbb{U}\) of \(\{\{1, 2\},\ \{4, 5\},\ \{2, 3\}\} = \{1, 2, 3, 4, 5\}\)
Next, let's use \(O\) to denote the optimal (or minimum) set cover of any given \(S\). This is in contrast to a suboptimal set cover, which is a set cover solution that contains more than the minimum amount of subsets needed to achieve full set coverage.
\[e.g.\]
The optimal set cover solution \(O\) of \(S\) where \(S = \{\{1, 2\}, \{3, 4\}, \{2, 3\}\}\) is \(\{\{1, 2\}, \{3, 4\}\}\)
Lastly, we're going to use the function \(p(S)\) to denote the power set of \(S\). The power set is defined as the set of all subsets of any given \(S\), including the empty set \(\emptyset\) and \(S\) itself.
\[e.g.\]
\[ p(\{1, 2\}) = \{\{\emptyset\},\ \{1\},\ \{2\},\ \{1, 2\}\} \]
Devising An Algorithm
With these tools at our disposal, lets see if we can come up with an algorithm. To start, one can readily deduce that if the power set \(p(S)\) contains all the subsets of \(S\), and our optimal solution \(O \subseteq S\), then \(O \in p(S)\). Therefore, every possible set cover in \(S\) (be it optimal or suboptimal) can be found within \(p(S)\) as well. As a result, finding \(O\) is as simple as iterating through all of the sets in \(p(S)\) to find the set \(S\) with the least amount of subsets where \(\bigcup\limits^n _i S_i = \mathbb{U}\)
Ultimately, our solution is going to look something like this:
def optimal_set_cover(S)
P = p(S)
U = universe(S)
O = S
for each set Pi in P
Pi_U = universe(Pi)
if Pi_U == U and len(Pi) < len(O)
O = Pi
return O
Alright! With some psuedocode to guide our path, let's try and implement a working solution in Python.
Implementation
For starters we're going to need to write a function to generate power sets. As generating power sets isn't the focus of this post, we're simply going to use a recipe directly from the python itertools recipe book to do it for us.


>>> [s for s in power_set([1, 2, 3])]
[(), (1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]
As you can see from the sample output, this is going to generate a list of tuples instead of sets like we might expect. We could easily modify this function to convert these tuples into sets, but doing so would significantly increase our final run time complexity. Given that our code will be able handle these tuples the same as if they were sets (as they're both iterables), this is an unnecessary step that I have intentionally left out.
We're also going to need a function to generate the universe of any given \(S\). This will work pretty much exactly as described in the previous section of this post; simply iterate through \(S\) (labeled as "sos" in our code) and union each subset together to form the universe \(\mathbb{U}\).


>>> universe([{1, 2, 3}, {4, 5}, {6}])
{1, 2, 3, 4, 5, 6}
With these two functions at our disposal, we now have everything we need to implement our optimal set cover algorithm:


>>> optimal_set_cover([{1, 2}, {2, 3}, {3, 4}])
[{1, 2}, {3, 4}]
Alright, problem solved! Time to go home!
...Right?
Analyzing Our Implementation
Unfortunately, as you may have guessed when I mentioned deriving our answer from the power set of \(S\), this algorithm is very, very slow. So slow that any attempt to generate a solution for more than 50 subsets is going to be realistically impossible for any computer to calculate; and unfortunately, it gets worse. This is about as fast as generating optimal set covers gets. Unfortunately, no algorithm exists that can generate an optimal set cover without having to calculate \(p(S)\) in the worst case. As generating \(p(S)\) has a worst case run time complexity of \(O(2^n)\), all optimal set cover algorithms are doomed to run in exponential time. Because of this, the set cover search problem is said to fall under the class of NPComplete problems, a class of problems that are all extremely difficult to compute.
To put into perspective how incredibly slow an algorithm with a worst case run time complexity of to \(O(2^n)\) really is, lets write some benchmarks. To aid in benchmarking our code, I've written a little helper function generate random sets of subsets.


>>> generate_random_sos(5, 3)
[{4, 5}, {3, 4}, {1, 5}]
Using the timeit library, we were able to generate the following results.
optimal_set_cover(generate_random_sos(10, 10)) # 0.0028 seconds
optimal_set_cover(generate_random_sos(15, 15)) # 0.113 seconds
optimal_set_cover(generate_random_sos(20, 20)) # 5.08 seconds
optimal_set_cover(generate_random_sos(25, 25)) # 251.313 seconds
optimal_set_cover(generate_random_sos(30, 30)) # 6092.618 seconds
All benchmarks were recorded on a 2014 Macbook Pro.
Extrapolating upon this trend, we can estimate that calculating the optimal set cover for any given \(S\) with 50 subsets and 50 unique elements would take over 200 years to compute!
So that's it right? No solution exists that can realistically compute this problem for anything larger than a handful of sets and we're all doomed.
...Right?
Well, not all hope is lost if we're alright with moving the goal posts a little. Instead of trying to find the absolute optimal set cover for any given set of subsets, what if we were okay with a solution that was closeenough? In other words, a suboptimal solution. If so, we're back in business because there just so happens to be an algorithm that can generate a pretty good suboptimal set covers (how good is beyond the scope of this post, but you can read more here if you'd like) that runs in \(O(n^2)\) time!
Taking A Greedy Approach
Key to most approximation algorithms is a solid heuristic, and this problem is no exception! One simple heuristic we could use is simply the size of each subset. After all, the larger the subset, the larger the amount of potentially uncovered elements that might exist in that subset. Using this heuristic, lets try implementing a greedy algorithm that orders the subsets from largest to smallest, and then iteratively unions them together until we've achieved full set coverage.


>>> suboptimal_set_cover_1([{1, 2, 3}, {4, 5}, {3}])
[{1, 2, 3}, {4, 5}] # Perfect!
suboptimal_set_cover_1([{1, 2}, {3, 4}, {2, 3}, {2}])
[{2, 3}, {3, 4}, {1, 2}] # Okay...
>>> suboptimal_set_cover_1([{1, 2, 3}, {2, 3, 4}, {4, 5, 6}])
[{4, 5, 6}, {2, 3, 4}, {1, 2, 3}] # Yikes!
So, our solution isn't perfect, but it's not the worst! Additionally, it's lightning fast compared to our previous algorithm, operating with a worst case run time complexity of \(O(nlog(n))\) (the time complexity of sorting \(S\) by the size of each subset). Notably, our algorithm really suffers when faced with an \(S\) where most of the subsets are of a similar size. In the special case where all subsets are the exact same size, our algorithm completely falls flat! In such a situation, all subsets are weighted equally according to our heuristic and our algorithm is unable make any meaningful decisions.
Fine Tuning Our Heuristic
Fortunately, we can do better than this by slightly changing our heuristic. What if instead of picking the largest subsets first, we instead picked the subsets that contain the largest amount of uncovered elements first? This should help us avoid situations where most of the subsets are of the same size.


>>> suboptimal_set_cover_2([{1, 2, 3}, {4, 5}, {3}])
[{1, 2, 3}, {4, 5}] # Perfect!
suboptimal_set_cover_2([{1, 2}, {3, 4}, {2, 3}, {2}])
[{1, 2}, {3, 4}] # Perfect!
>>> suboptimal_set_cover_2([{2, 3, 4, 5}, {1, 2, 3, 4}, {4, 5, 6}, {1, 2, 3}])
[{2, 3, 4, 5}, {1, 2, 3, 4}, {4, 5, 6}] # Not perfect but I'll take it.
One unfortunate side effect of this implementation is that it does increase the run time complexity to \(O(n^2)\) This is a result of having to recalculate the subset with the largest set difference from the known universe at each iteration. However, this is a fine price to pay for greatly improved performance.
Conclusion
If you've managed to make it all the way to the bottom of this post without skipping over any sections, congratulations! Hopefully it didn't take you as long to read as it did for me to write. This is my first time writing something as technical and mathy as this, so if you spot any issues (and I'm sure there are), please shoot me an email so that I can correct it. If you manage to come up with any other heuristics or significantly faster algorithms used to generate optimal or suboptimal set cover solutions, I'd love to hear about it, and may even include it at the bottom of this post (with your permission of course) for future readers to see.
Lastly, be sure to check back on the 1st of March for what will hopefully be an exciting post on using Shannon Entropy to find secrets that have been accidentally committed to source code control systems (like Github).