Multi-columns, shortcut strings and subset transformations in DataFrameMacros.jl v0.2

julia
Published

December 28, 2021

DataFrameMacros.jl is a Julia package that makes it easier to manipulate DataFrames, by rewriting code into source-function-sink expressions that conform to DataFrames.jl’s more verbose mini-language. In version v0.2 (and v0.2.1) I have added a couple new features that are powerful, but not immediately obvious. This post takes a closer look at the new functionality.

The new features are multi-column specifiers, shortcut strings for renaming and subset transformations.

Multi-column specifiers

So far, DataFrameMacros.jl only supported statements with single-column specifiers. For example, @select(df, :x + 1) or @combine(df, $column_variable * $2). The expressions :x, $column_variable and $2 all refer to one column each. The underlying source-function-sink expression that DataFrameMacros.jl created was therefore always of the form source => function => sink. For many tasks this is perfectly sufficient, but other times one wants to execute the same function over a set of similar or related columns.

DataFrames.jl has a neat way to run the same function on a set of columns. This is done by using the .=> operator, to broadcast over a set or sets of columns and create an array of source => function => sink expressions. For example, you could compute the sum for each column in a DataFrame with transform(df, names(df, All()) .=> sum), or in the recent v1.3 release even with transform(df, All() .=> sum).

Now, the trick that DataFrameMacros.jl v0.2.1 uses is to change the underlying representation from source => function => sink to source(s) .=> function(s) .=> sink(s). This doesn’t break the existing functionality, because scalars in Julia broadcast just fine, so it’s no problem to say something like combine(df, :x .=> sum .=> "y") - even though broadcasting doesn’t add anything if only scalars participate.

Where it gets interesting is when collections of columns are used. With the change to source(s) .=> function(s) .=> sink(s) you are now free to use column expressions that refer to multiple columns. The only restriction is that the shapes of source(s), function(s) and sink(s) have to be compatible for broadcasting.

There are multiple ways in which you can reference multiple columns at once, and they are closely related to what x can be in the function names(df, x). For example, All(), Between(x, y) and Not(args...) are now recognized directly as multi-column specifiers by DataFrameMacros, without having to mark them with the usual $ sign. Then you can use any Type T marked by $, which selects all columns whose elements are subtypes of T, for example $Real or $String. You can use a regex that selects all columns with matching names, for example $(r"a") for any column with the letter a. Of course it’s also possible to just pass an array of column names, for example $["a", "b"].

Here are a few practical examples:

using DataFrameMacros
using DataFrames

df = DataFrame(
    name = ["alice", "bob", "charlie"],
    age = [20, 31, 42],
    country = ["andorra", "brazil", "croatia"],
    salary = [9999, 6666, 3333],
)

3 rows × 4 columns

name age country salary
String Int64 String Int64
1 alice 20 andorra 9999
2 bob 31 brazil 6666
3 charlie 42 croatia 3333

We can transform both String columns at once and both Int columns at once, by using the Type multi-column specifier.

@select df begin
    uppercasefirst($String)
    Float64($Int)
end

3 rows × 4 columns

name_uppercasefirst country_uppercasefirst age_Float64 salary_Float64
String String Float64 Float64
1 Alice Andorra 20.0 9999.0
2 Bob Brazil 31.0 6666.0
3 Charlie Croatia 42.0 3333.0

We can try out the All() specifier by reversing the element order of each column. We need the @c flag so reverse acts on each column vector and not each column element. This works the same way with the Between and Not selectors.

@select df @c reverse(All())

3 rows × 4 columns

name_reverse age_reverse country_reverse salary_reverse
String Int64 String Int64
1 charlie 42 croatia 3333
2 bob 31 brazil 6666
3 alice 20 andorra 9999

We can combine multi-column specifiers with single-column specifiers, they can always broadcast together because scalars work together with any shape. For example, let’s say we have a column with tax rates and four columns with quarterly gains and we want to compute the quarterly taxes.

df = DataFrame(
    year = [2019, 2020, 2021],
    tax_rate = [0.19, 0.20, 0.21],
    income_q1 = [2000, 3000, 4000],
    income_q2 = [2100, 3100, 4100],
    income_q3 = [2200, 3200, 4200],
    income_q4 = [2300, 3300, 4300],
)

3 rows × 6 columns

year tax_rate income_q1 income_q2 income_q3 income_q4
Int64 Float64 Int64 Int64 Int64 Int64
1 2019 0.19 2000 2100 2200 2300
2 2020 0.2 3000 3100 3200 3300
3 2021 0.21 4000 4100 4200 4300

Then we can simply multiply the tax rate with the four income columns at once, which we select with the Between selector.

@select(df, :tax_rate * Between(3, 6))

3 rows × 4 columns

tax_rate_income_q1_* tax_rate_income_q2_* tax_rate_income_q3_* tax_rate_income_q4_*
Float64 Float64 Float64 Float64
1 380.0 399.0 418.0 437.0
2 600.0 620.0 640.0 660.0
3 840.0 861.0 882.0 903.0

Another option to select the columns would be to use a regex. We have to mark it with $ so that DataFrameMacros knows to treat it as a column specifier.

@select(df, :tax_rate * $(r"income"))

3 rows × 4 columns

tax_rate_income_q1_* tax_rate_income_q2_* tax_rate_income_q3_* tax_rate_income_q4_*
Float64 Float64 Float64 Float64
1 380.0 399.0 418.0 437.0
2 600.0 620.0 640.0 660.0
3 840.0 861.0 882.0 903.0

Now one issue is that the resulting column names are very ugly. We could specify the new names directly as a vector. Remember that the expression is source(s) .=> function(s) .=> sink(s) so we can also broadcast a vector of sinks. The string "taxes_q1" will be the sink associated with the first element from the regex selector, and so on.

@select(df,
    ["taxes_q1", "taxes_q2", "taxes_q3", "taxes_q4"] = :tax_rate * $(r"income"))

3 rows × 4 columns

taxes_q1 taxes_q2 taxes_q3 taxes_q4
Float64 Float64 Float64 Float64
1 380.0 399.0 418.0 437.0
2 600.0 620.0 640.0 660.0
3 840.0 861.0 882.0 903.0

But writing out strings like that is error prone, especially if the order of columns can change. So it would be better to transform the original column names. DataFrames allows to use anonymous functions for this, the input for the function is a vector with all column names used in the expression. We can split off the "q1" part from the second column in each expression (the income column) and prefix with "taxes_":

@select(df, (names -> "taxes_" * split(names[2], "_")[2]) = :tax_rate * $(r"income"))

3 rows × 4 columns

taxes_q1 taxes_q2 taxes_q3 taxes_q4
Float64 Float64 Float64 Float64
1 380.0 399.0 418.0 437.0
2 600.0 620.0 640.0 660.0
3 840.0 861.0 882.0 903.0

Broadcasting with more than one dimension

You are not technically limited to broadcasting one vector of columns with scalar columns, you can even evaluate two- or higher-dimensional grids of column combinations if you like. For example, if you had two different tax rates and three income categories, you could compute all six tax columns with one expression. Here we extract the income columns first so we can make them into a row-vector with permutedims, which will form a 2D grid when broadcasted together with the column vector with the two tax columns.

df = DataFrame(
    year = [2019, 2020, 2021],
    tax_a = [0.19, 0.20, 0.21],
    tax_b = [0.22, 0.23, 0.24],
    income_a = [2000, 3000, 4000],
    income_b = [2100, 3100, 4100],
    income_c = [2200, 3200, 4200],
)

income_cols = permutedims(names(df, r"income"))

@select(df, $(r"tax") * $income_cols)

3 rows × 6 columns (omitted printing of 1 columns)

tax_a_income_a_* tax_b_income_a_* tax_a_income_b_* tax_b_income_b_* tax_a_income_c_*
Float64 Float64 Float64 Float64 Float64
1 380.0 440.0 399.0 462.0 418.0
2 600.0 690.0 620.0 713.0 640.0
3 840.0 960.0 861.0 984.0 882.0

The column names are again not ideal, which brings us to another new feature.

Shortcut strings for renaming

Often, we want to give new columns names that are just simple combinations of column names used to compute them. In the last example, a better name than tax_a_income_a_* could be tax_a_on_income_b.

If DataFrameMacros encounters a string literal as the sink which contains "{}", "{1}" or "{2}" and up, it translates this into a renaming function that pastes the input column names at the respective locations. Here’s the last example again with such a shortcut string:

@select(df, "{1}_on_{2}" = $(r"tax") * $income_cols)

3 rows × 6 columns (omitted printing of 1 columns)

tax_a_on_income_a tax_b_on_income_a tax_a_on_income_b tax_b_on_income_b tax_a_on_income_c
Float64 Float64 Float64 Float64 Float64
1 380.0 440.0 399.0 462.0 418.0
2 600.0 690.0 620.0 713.0 640.0
3 840.0 960.0 861.0 984.0 882.0

Subset transformations

The third new feature goes hand in hand with a new addition in DataFrames v1.3. Now you can call transform! or select! on the view returned by subset(df, some_subset_expression, view = true), and this will mutate the underlying DataFrame only in the selected rows. If new columns are added, all rows outside the subset are filled with missing values.

In base DataFrames, you need to first create a subset view, then mutate it, then continue on with the original DataFrame. Here’s the DataFrame we start with

df = DataFrame(x = 1:4, y = 5:8)

4 rows × 2 columns

x y
Int64 Int64
1 1 5
2 2 6
3 3 7
4 4 8

Now we subset some rows and increment the y values by 10 there. We also create new z values:

subset_view = subset(df, :x => ByRow(>=(3)), view = true)
transform!(
    subset_view,
    :y => ByRow(x -> x + 10) => :y,
    :x => (x -> x * 3) => :z
)
df

4 rows × 3 columns

x y z
Int64 Int64 Int64?
1 1 5 missing
2 2 6 missing
3 3 17 9
4 4 18 12

In DataFrameMacros v0.2, you can now use a more convenient syntax that plays well with Chain.jl or other piping mechanisms, where you only want to use functions that return the DataFrame you work with, not a subset view. You can simply pass a @subset expression to @transform! or @select! after the DataFrame argument. This @subset expression doesn’t take its own DataFrame argument as usual, that’s implied to be the DataFrame that is being transformed. The returned object after mutating the selected rows is the original DataFrame. You can see how much more concise the same operation becomes:

df = DataFrame(x = 1:4, y = 5:8)
@transform!(df, @subset(:x >= 3), :y = :y + 10, :z = 3 * :x)

4 rows × 3 columns

x y z
Int64 Int64 Int64?
1 1 5 missing
2 2 6 missing
3 3 17 9
4 4 18 12

Summary

That concludes the overview of the three new features, multi-column specifiers, shortcut strings for renaming and subset transformations. Especially multi-column specifiers with their implicit broadcasting might need a moment to wrap your head around, but I think you’ll find them very convenient. I hope you enjoy using the new release!