Documentation
DataFrameMacros.DataFrameMacros
— ModuleDataFrameMacros offers macros which transform expressions for DataFrames functions that use the source .=> function .=> sink
mini-language. The supported functions are @transform
/@transform!
, @select/@select!
, @groupby
, @combine
, @subset
/@subset!
, @sort
/@sort!
and @unique
.
All macros have signatures of the form:
@macro(df, args...; kwargs...)
Each positional argument in args
is converted to a source .=> function .=> sink
expression for the transformation mini-language of DataFrames. By default, all macros execute the given function by-row, only @combine
executes by-column. There is automatic broadcasting across all column specifiers, so it is possible to directly use multi-column specifiers such as {All()}
, {Not(:x)}
, {r"columnname"}
and {startswith("prefix")}
.
For example, the following pairs of expressions are equivalent:
transform(df, :x .=> ByRow(x -> x + 1) .=> :y)
@transform(df, :y = :x + 1)
select(df, names(df, All()) .=> ByRow(x -> x ^ 2))
@select(df, {All()} ^ 2)
combine(df, :x .=> (x -> sum(x) / 5) .=> :result)
@combine(df, :result = sum(:x) / 5)
Column references
Each positional argument must be of the form [sink =] some_expression
. Columns can be referenced within sink
or some_expression
using a Symbol
, a String
, or an Int
. Any column identifier that is not a Symbol
must be wrapped with {}
. Wrapping with {}
also allows to use variables or expressions that evaluate to column identifiers.
The five expressions in the following code block are equivalent.
using DataFrames
using DataFrameMacros
df = DataFrame(x = 1:3)
@transform(df, :y = :x + 1)
@transform(df, :y = {"x"} + 1)
@transform(df, :y = {1} + 1)
col = :x
@transform(df, :y = {col} + 1)
cols = [:x, :y, :z]
@transform(df, :y = {cols[1]} + 1)
Multi-column references
You can also use multi-column specifiers. For example @select(df, sqrt({Between(2, 4)}))
acts as if the function sqrt
is applied along each column that belongs to the group selected by Between(2, 4)
. Because the source-function-sink complex is connected by broadcasted pairs like source .=> function .=> sink
, you can use multi-column specifiers together with single-column specifiers in the same expression. For example, @select(df, {All()} + :x)
would compute df.some_column + df.x
for each column in the DataFrame df
.
If you use {{}}
, the multi-column expression is not broadcast, but given as a tuple so you can aggregate over it. For example sum({{All()}}
calculates the sum of all columns once, while sum({All()})
would apply sum
to each column separately.
Sink names in multi-column scenarios
For most complex function expressions, DataFrames concatenates all names of the columns that you used to create a new sink column name, which looks like col1_col2_function
. It's common that you want to use a different naming scheme, but you can't write @select(df, :x = {All()} + 1)
because then every new column would be named x
and that is disallowed. There are several options to deal with the problem of multiple new columns:
You can use a Vector of strings or symbols such as
["x", "y", "z"] = sqrt({All()})
. The length has to match the number of columns in the multi-column specifier(s). This is the most direct way to specify multiple names, but it doesn't leverage the names of the used columns dynamically.You can use DataFrameMacro's string shortcut syntax. If there's a string literal with one or more {} brackets, it's treated as an anonymous function that takes in column names and splices them into the string.
{}
is equivalent to{1}
, but you can access further names with{2}
and so on, if there is more than one column used in the function. In the example above, you could rename all columns with@select(df, "sqrt_of_{}" = sqrt({All()}))
.You can use
{1}
,{2}
, etc. in any expression to refer to the first, second, etc. column name. Like with the shortcut string syntax,{}
is the same as{1}
.For example:
julia> df = DataFrame(a_1 = 1:3, b_1 = 4:6)
3×2 DataFrame
Row │ a_1 b_1
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> @transform(df, "result_" * split({}, "_")[1] = sqrt({All()}))
3×4 DataFrame
Row │ a_1 b_1 result_a result_b
│ Int64 Int64 Float64 Float64
─────┼──────────────────────────────────
1 │ 1 4 1.0 2.0
2 │ 2 5 1.41421 2.23607
3 │ 3 6 1.73205 2.44949
Passing multiple expressions
Multiple expressions can be passed as multiple positional arguments, or alternatively as separate lines in a begin end
block. You can use parentheses, or omit them. The following expressions are equivalent:
@transform(df, :y = :x + 1, :z = :x * 2)
@transform df :y = :x + 1 :z = :x * 2
@transform df begin
:y = :x + 1
:z = :x * 2
end
@transform(df, begin
:y = :x + 1
:z = :x * 2
end)
Modifier macros
You can modify the behavior of all macros using modifier macros, which are not real macros but only signal changed behavior for a positional argument to the outer macro.
macro | meaning |
---|---|
@byrow | Switch to by-row processing. |
@bycol | Switch to by-column processing. |
@passmissing | Wrap the function expression in passmissing . |
@astable | Collect all :symbol = expression expressions into a NamedTuple where (; symbol = expression, ...) and set the sink to AsTable . |
Example @bycol
To compute a centered column with @transform
, you need access to the whole column at once and signal this with the @bycol
modifier.
using Statistics
using DataFrames
using DataFrameMacros
julia> df = DataFrame(x = 1:3)
3×1 DataFrame
Row │ x
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> @transform(df, :x_centered = @bycol :x .- mean(:x))
3×2 DataFrame
Row │ x x_centered
│ Int64 Float64
─────┼───────────────────
1 │ 1 -1.0
2 │ 2 0.0
3 │ 3 1.0
Example @passmissing
Many functions need to be wrapped in passmissing
to correctly return missing
if any input is missing
. This can be achieved with the @passmissing
modifier macro.
julia> df = DataFrame(name = ["alice", "bob", missing])
3×1 DataFrame
Row │ name
│ String?
─────┼─────────
1 │ alice
2 │ bob
3 │ missing
julia> @transform(df, :name_upper = @passmissing uppercasefirst(:name))
3×2 DataFrame
Row │ name name_upper
│ String? String?
─────┼─────────────────────
1 │ alice Alice
2 │ bob Bob
3 │ missing missing
Example @astable
In DataFrames, you can return a NamedTuple
from a function and then automatically expand it into separate columns by using AsTable
as the sink value. To simplify this process, you can use the @astable
modifier macro, which collects all statements of the form :symbol = expression
in the function body, collects them into a NamedTuple
, and sets the sink argument to AsTable
.
julia> df = DataFrame(name = ["Alice Smith", "Bob Miller"])
2×1 DataFrame
Row │ name
│ String
─────┼─────────────
1 │ Alice Smith
2 │ Bob Miller
julia> @transform(df, @astable begin
s = split(:name)
:first_name = s[1]
:last_name = s[2]
end)
2×3 DataFrame
Row │ name first_name last_name
│ String SubString… SubString…
─────┼─────────────────────────────────────
1 │ Alice Smith Alice Smith
2 │ Bob Miller Bob Miller
The @astable
modifier also works with tuple destructuring syntax, so the previous example can be shortened to:
@transform(df, @astable :first_name, :last_name = split(:name))
DataFrameMacros.@combine
— Macro@combine(df, args...; kwargs...)
The @combine
macro builds a DataFrames.combine
call. Each expression in args
is converted to a src .=> function . => sink
construct that conforms to the transformation mini-language of DataFrames.
Keyword arguments kwargs
are passed down to combine
but have to be separated from the positional arguments by a semicolon ;
.
The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros
module docstring, accessible via ?DataFrameMacros
.
DataFrameMacros.@select!
— Macro@select!(df, args...; kwargs...)
The @select!
macro builds a DataFrames.select!
call. Each expression in args
is converted to a src .=> function . => sink
construct that conforms to the transformation mini-language of DataFrames.
Keyword arguments kwargs
are passed down to select!
but have to be separated from the positional arguments by a semicolon ;
.
The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros
module docstring, accessible via ?DataFrameMacros
.
@subset argument
You can pass a @subset
expression as the second argument to @select!
, between the input argument and the source-function-sink expressions. Then, the call is equivalent to first taking a subset
of the input with view = true
, then calling select!
on the subset and returning the mutated input. If the input is a GroupedDataFrame
, the parent DataFrame
is returned.
df = DataFrame(x = 1:5, y = 6:10)
@select!(df, @subset(:x > 3), :y = 20, :z = 3 * :x)
DataFrameMacros.@select
— Macro@select(df, args...; kwargs...)
The @select
macro builds a DataFrames.select
call. Each expression in args
is converted to a src .=> function . => sink
construct that conforms to the transformation mini-language of DataFrames.
Keyword arguments kwargs
are passed down to select
but have to be separated from the positional arguments by a semicolon ;
.
The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros
module docstring, accessible via ?DataFrameMacros
.
DataFrameMacros.@subset!
— Macro@subset!(df, args...; kwargs...)
The @subset!
macro builds a DataFrames.subset!
call. Each expression in args
is converted to a src .=> function . => sink
construct that conforms to the transformation mini-language of DataFrames.
Keyword arguments kwargs
are passed down to subset!
but have to be separated from the positional arguments by a semicolon ;
.
The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros
module docstring, accessible via ?DataFrameMacros
.
DataFrameMacros.@subset
— Macro@subset(df, args...; kwargs...)
The @subset
macro builds a DataFrames.subset
call. Each expression in args
is converted to a src .=> function . => sink
construct that conforms to the transformation mini-language of DataFrames.
Keyword arguments kwargs
are passed down to subset
but have to be separated from the positional arguments by a semicolon ;
.
The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros
module docstring, accessible via ?DataFrameMacros
.
DataFrameMacros.@transform!
— Macro@transform!(df, args...; kwargs...)
The @transform!
macro builds a DataFrames.transform!
call. Each expression in args
is converted to a src .=> function . => sink
construct that conforms to the transformation mini-language of DataFrames.
Keyword arguments kwargs
are passed down to transform!
but have to be separated from the positional arguments by a semicolon ;
.
The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros
module docstring, accessible via ?DataFrameMacros
.
@subset argument
You can pass a @subset
expression as the second argument to @transform!
, between the input argument and the source-function-sink expressions. Then, the call is equivalent to first taking a subset
of the input with view = true
, then calling transform!
on the subset and returning the mutated input. If the input is a GroupedDataFrame
, the parent DataFrame
is returned.
df = DataFrame(x = 1:5, y = 6:10)
@transform!(df, @subset(:x > 3), :y = 20, :z = 3 * :x)
DataFrameMacros.@transform
— Macro@transform(df, args...; kwargs...)
The @transform
macro builds a DataFrames.transform
call. Each expression in args
is converted to a src .=> function . => sink
construct that conforms to the transformation mini-language of DataFrames.
Keyword arguments kwargs
are passed down to transform
but have to be separated from the positional arguments by a semicolon ;
.
The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros
module docstring, accessible via ?DataFrameMacros
.
DataFrameMacros.@unique
— Macro@unique(df, args...; kwargs...)
The @unique
macro builds a DataFrames.unique
call. Each expression in args
is converted to a src .=> function . => sink
construct that conforms to the transformation mini-language of DataFrames.
Keyword arguments kwargs
are passed down to unique
but have to be separated from the positional arguments by a semicolon ;
.
The transformation logic for all DataFrameMacros macros is explained in the DataFrameMacros
module docstring, accessible via ?DataFrameMacros
.