A simple algorithm for generating a dataset based on combinations of multiple sets

Use case background

I was looking into putting together a sample dataset related to a range of airline routes and their related information. With that I was aiming to create a table containing an exhaustive set of combinations of an airline's operating origin and destination routes along with all the possible variations related to those O&Ds tickets such as codeshare, language, class options etc...

The content of the generated data set isn't meant to be random data, but should come from a set of options available for each column. For example, the program is fed the options below

Origin code: [DXB, DWC, ABC, KFC]
Destination code: [ABC, BJG, KBJ, DWC]
Class: [First, Business, Economy, General]
Language: [EN, AR, FR]
Tier: [1, 2, 3]
Return: [yes, no]

The resulting data set, would end up having unique rows generated baed on all the possible combinations in one row for all possible sets of input. In the case of airline routes, each row would constitute a representation of a "ticket" with an inventory of ticket options. Example output would look something like this:

[ DXB, ABC, First, EN, 1, yes ],
[ DXB, ABC, Business, EN, 1, yes],
[ DXB, ABC, Business, EN, 2, yes],
[ DXB, ABC, Business, EN, 3, yes],

etc...

For the above example, the number of resulting unique rows would be the multiplication of the length of each column options set. In this case: 4 * 4 * 4 * 3 * 3 * 2 = 1152. For the actual use case, origin And destination code are definitely much more (around) 200 which obviously would result in a much larger dataset.

Options available

For generating datasets related to the above case, there are obviously several libraries that can be included for building that. Most of the libraries out there usually also include the option of generating random data with different data types.

In my case, I have the constraint of limiting the amount of dependencies as much as possible especially that the use case is pretty basic.

The problem is mainly around "multiplying" different set items with each other to generate rows as permutations of all possible combinations for column values. That's usually known as the cartesian product.

Utilizing recursion

Following a recursive approach, we'd reduce the problem down to generating all combinations from 2 sets (denoted by left and right), then bubble that up and apply the result against one more set to generate more combinations against of the rest of the option sets.

Here is the implementation in Javascript, the function below expects a 2 dimensional array as input.

let buildCombo = (input) => {
    if (input.length == 1) {
        return input[0];
    } else {
        let result = [];
        let remainingCombo = buildCombo(input.slice(1));

        for (let leftItem of remainingCombo) {
            for (let rightItem of input[0]) {
                result.push([rightItem].concat(leftItem));
            }
        }

        return result;
    }
}

Example usage:

let contentOptions = [
    ['DXB', 'DWC', 'XYZ', 'KFC', 'BCJ', 'PPP'],
    ['DXB', 'DWC', 'XYZ', 'KFC', 'BCJ', 'PPP'],
    ['AR', 'EN', 'FR'],
    ['First', 'Business', 'Economy'],
    ['yes', 'no'],
    ['yes', 'no']
];

let results = buildCombo(contentOptions);

Cleaning the results

As promised, the above function generates all possible combinations. One caveat though, is that in some cases the data needs to fit the use case constraints. For example, in the dataset I'm trying to generate, it wouldn't make sense to have rows containing an origin airport code equal to a destination airport code because obviously, for any ticket it's impossible for a trip to be from and to itself.

This can be easily solved by applying a filter function on the resulting data to weed out the duplicates, for example:

let cleanup = results.filter(row => row[0] !== row[1]);

From that point onward, it would be easy to export that into CSV after escaping the data.

Hope you find this tip useful, happy to hear your comments on the solution and any suggested improvements!

Sailing through a sea of code

A simple algorithm for generating a dataset based on combinations of multiple sets

Use case background

Options available

Utilizing recursion

Cleaning the results

Topics