Thursday, June 11, 2020

Extensible Database Tester in Python, with Permutations

Here is some Python code that allows you to generate all combinations of options by choosing one item from an arbitrary number of different lists.  Not only that, but each option is actually a data structure that also includes lots of metadata about the ramifications of the option, in addition to the option value itself.

Let's say you're looking to run performance testing on different types of tables and underlying data arrangements in Amazon Redshift Spectrum.  You can devise a template SQL query that can be configurable in multiple ways.  For instance, you could choose between multiple tables or views from which to query data (say, if comparing Redshift to Redshift Spectrum performance), and you also want to see the performance of count(1) vs. count(*) to see if one uses more data or works quicker on Redshift vs. Spectrum.  Thus, you already have two lists of options, and need to come up with the four permutations of these options.

The Groundwork


Basically, to lay some ground rules, each option contains both the value to be placed into the query, as well as a tag that can be appended to a string in order to designate the specific query that was run when looking back at query performance later.  A single list of choices, with expandability for more lists, looks like this:

config.query_options = {
    "table_choice": [
        {"value": "my_redshift_table", "tag": "rst"},
        {"value": "my_spectrum_table", "tag": "spt"}
    ]
}

One can continue adding more lists in that same parent, the configuration option dictionary, as a sibling of table_choice.  It's interesting to have a tag field for each option because in Redshift and Spectrum, there are a series of tables that may exist under the pg_catalog schema.  One of these tables is called stl_query, which actually lists queries run on the Redshift cluster and how long they took to run.  (This assumes you have such logging enabled for the cluster.)  As such, if you want to refer back to this data later, you can concatenate all the tags from the selected options and SELECT it as a column in order to refer to the specific query later.  This could be implemented by means of:

SELECT 'tag-cat1a-cat2b' as tag, other_columns_I_actually_care_about FROM table ETC...

And this could be a mechanism of indicating that option A was picked for category 1, and option B was picked for category 2.  Now, you could search the stl_query field for metrics on the original query, but then you would have to synthesize the whole query once again to search for how it ran given the specific set of options.  But with the tag, now you can easily search for the performance of a query with a given set of options:

SELECT * from pg_catalog.stl_query WHERE querytxt ilike '%tag-cat1a-cat2b%';

While we're at it, let's have a look at what the body of configuration templates looks like.  These templates are what all the options above get stuffed into.

config.queries = {
    "aggregation-query": {
        "params": ["count_type", "table_choice"],
        "query": (lambda params : f"""
            select count({params['count_type']['value']})
            from {params['table_choice']['value']} tbl
            where tbl.evt_yr = '2020'
            and tbl.evt_mo = '6'
            and tbl.evt_day = '11';""")
    }
}

Above, we have defined a single SQL query, to which both configuration lists of count_type and table_choice are important.  (As you will see later, we can pick and choose what lists we want, in case we don't want all combinations.). Then, the actual query itself is a heredoc where the options are filled in by means of a formatted string.  The use of a lambda function allows us to pass in any combination relevant to the query, no matter how many parameters it has (i.e. it doesn't matter if the options were chosen from all the lists or not.)  This allows for code reuse by not forcing a fixed number of parameters like you would have if you were trying to pass parameters directly into a string formatter such as %s or .format().  We can even pass in options with "extras", i.e. more than just value and tag, in case some options have additional ramifications.

Methods to the Madness of Concocting Combinations


There are easy ways to come up with all combinations already with list comprehensions.  However, these are inflexible since the number of lists must be fixed, and it is not easily expandable into other types of data structures like dictionaries.  Or, even if the above isn't true, it starts to look quite complex and becomes difficult to read.

We can rely on the itertools library to help us with making the permutations:

    params = [[{desired_key: option} for option in config.query_options[desired_key]] for desired_key in desired_keys]
    param_combos = list(itertools.product(*params))

The first line creates a dictionary object consisting of, following the example configuration above: {"table_choice": {"value": "my_redshift_table", "tag": "rst"}}

The desired_keys list is something belonging to each configuration template, and designates exactly what parameter lists are important to it.  That is to say, if you were to have multiple groups of choices in the structure at the top of this post, but one group doesn't matter to a particular template (e.g. query), then just omit it from this list.  Anyway, the dictionary object created by the above code snippet is nested in an array of similar dictionary objects for each entry in the table_choice list, and then multiple such arrays exist in a parent array for each subsequent list of configurations, if specified.

If you imagine this boiled down into just literals, you would have [[a, b, c], [1, 2]].  The characters abc1, and 2 represent five different dictionaries.  They are grouped into lists based on what list of options they come from.  Each of these dictionaries looks like this:

{"desired_key": {"tag": "cat1a", "value": "its_value"}}

The dictionaries ab, and c are derived from the first group of options (e.g. the table_choice), thus the desired_key field would actually say table_choice for these items.  The dictionaries 1 and 2 are derived from the second list of options (e.g. the count_type), thus the desired_key field would actually say count_type for these items.

The second line is where the magic happens in creating the iterations.  The product function takes items inside each sub-list and pairs them with each item inside the other sub-lists to form all possible combinations of items from each list (note that order doesn't matter, so these are not permutations).  This function produces a list of tuples, where each tuple contains items from each list.  Again, with just literals (from the array in the previous paragraph), you would get [(a, 1), (a, 2), (b, 1), (b, 2), (c, 1), (c, 2)].  However, our objects are a lot more complex than that, so you can imagine taking the dictionary structure in the previous paragraph and substituting each letter and number in the above list of tuples with that dictionary structure.  It gets pretty lengthy to write out!

Now that we have made all the combinations, we have to consider how to use them to fill in template placeholders in our long string values.  First, let's consider how to make our tag concatenation, i.e. the tag-cat1a-cat2b from above.  To do this, note that each combination made by product is a tuple in a list.  Consider each tuple in the list as the variable param_set, and then use the join function over a list comprehension as such:

suffix = '-'.join([i[list(i)[0]]['tag'] for i in param_set])

Because each dictionary item in the param_set tuple contains one key, and just one key, but it is unknown (it could be either table_choice or count_type in this example), we can abstract that away by transforming the dictionary into a list and asking for its first element.  This is what list(i)[0] does.  Once we have that, we can fetch the tag item inside each dictionary i in the tuple, i of course representing the one selected option from each list making up the particular combination param_set.  Then, the join function puts them all together in one long string.

The next step is to actually fill in the configuration template with the desired values.  Unfortunately, it won't be easy to do this with arbitrary keys cooped up inside tuples, so let's extract the dictionary object from each tuple and merge them together.  We can do this by creating an empty dict, and then iterating over each tuple to update the new dict with the dicts inside all of the tuples.

params = {}
for i in param_set:
    params.update(i)
initial_query = config.queries[key]['query'](params)

The final line calls a lambda function that exists inside a dictionary object listing each parameter template.  For convenience, here is the configuration template again:

config.queries = {
    "aggregation-query": {
        "params": ["count_type", "table_choice"],
        "query": (lambda params : f"""
            select count({params['count_type']['value']})
            from {params['table_choice']['value']} tbl
            where tbl.evt_yr = '2020'
            and tbl.evt_mo = '6'
            and tbl.evt_day = '11';""")
    }
}

In this case, the key is aggregation-query, and the query is the lambda function, and we pass in params as an argument to the lambda function, where params contains the selected option from each option list for this given combination.

The Whole Enchilada


This is what the entire code looks like.  If you're looking for an example or sample in this tutorial, look no further.

def make_params(desired_keys):
    params = [[{desired_key: option} for option in config.query_options[desired_key]] for desired_key in desired_keys]
    param_combos = list(itertools.product(*params))
    return param_combos

for key in config.queries.keys():
    print("Testing key " + key)
    if (config.queries[key]['params'] == []):
        param_combos = [()]
    else:
        param_combos = make_params(config.queries[key]['params'])

    for param_set in param_combos:
        print(param_set)
        suffix = '-'.join([i[list(i)[0]]['tag'] for i in param_set])
        name = key + ("-" + suffix) if (suffix != "") else key
        params = {}
        for i in param_set:
            params.update(i)
        initial_query = config.queries[key]['query'](params)

        identity_string = f"select '{name}' as name, "
        query = initial_query.replace("select ", identity_string, 1)
        cur.execute(query)

Accommodating Templates that Need No Options


"What if the parameter template doesn't have any parameters," you ask?  First off, to illustrate what that would look like, the configuration template for a query with no desired options would look like this:

{
    "no-params-needed": {
        "params": [],
        "query": (lambda params : f"""select 1;""")
    }
}

In this case, before the for loop runs that finds each param_set from within your param_combos (i.e. the params list that is empty), you need to set param_combos to a list with a single empty dict object in it, so it can see that there is one param_set that is empty.  It will construct an empty suffix, and will pass in an empty dictionary into a lambda function that won't even pay attention to the dictionary at all.

What about additional parameters in your options?


Let's say that some of your Rosetta Spectrum tables are based on Spark dataframes that were partitioned in various ways.  Some tables have daily partitions, whereas others are monthly.  As such, the daily tables have event_day as a partition column, whereas the monthly tables don't.  To select on day, you will need to utilize parsing a timestamp column.  Here is what that looks like:

Daily partitioned data: WHERE event_day = '10'
Monthly partitioned data: WHERE DATE_PART('day', tbl.event_timestamp) = 10

To account for the different WHERE clauses you must use in your queries, note these aren't individual options themselves, but they are in fact tied to the selection of a particular option.  As such, you need to specify these extra parameters with the option for which table you are going to select from, knowing whether it's backed by daily or monthly data.

You can define dictionaries for each partition type, as such:

extras_daily = {
    "event_day": "event_day = '10'",
}

extras_monthly = {
    "event_day": "DATE_PART('day', tbl.event_timestamp) = 10",
}

Note that if you're trying to select on a range of dates, you can add another key to both of these dictionaries to write the query to look at a range over event_day or event_timestamp, depending on what is required by the partition type.

In any event, you incorporate this into your options dictionaries as such:

config.query_options = {
    "table_choice": [
       {"value": "daily_table", "tag": "dy_tbl", "extras": extras_daily},
       {"value": "monthly_table", "tag": "mo_tbl", "extras": extras_monthly}
    ]
}

And into your configuration template and Lambda function as such:

config.queries = {
    "aggregation-query": {
        "params": ["count_type", "table_choice"],
        "query": (lambda params : f"""
            select count({params['count_type']['value']})
            from {params['table_choice']['value']} tbl
            where tbl.evt_yr = '2020'
            and tbl.evt_mo = '6'
            and {params['table_choice']['extras']['event_day']};""")
    }
}

And voila!  Now you have a highly flexible, customizable way to create combinations of different options from different lists into a configuration template.