DataFrameMacros.jl
DataFrameMacros.jl offers macros for DataFrame manipulation with a syntax geared towards clarity, brevity and convenience. Each macro translates expressions into the more verbose source => function => sink
mini-language from DataFrames.jl.
Here is a simple example:
julia> using DataFrameMacros, DataFrames
julia> df = DataFrame(name = ["Mary Louise Parker", "Thomas John Fisher"])
2×1 DataFrame Row │ name │ String ─────┼──────────────────── 1 │ Mary Louise Parker 2 │ Thomas John Fisher
julia> result = @transform(df, :middle_initial = split(:name)[2][1] * ".")
2×2 DataFrame Row │ name middle_initial │ String String ─────┼──────────────────────────────────── 1 │ Mary Louise Parker L. 2 │ Thomas John Fisher J.
Unlike DataFrames.jl, most operations are row-wise by default. This often results in cleaner code that's easier to understand and reason about, especially when string or object manipulation is involved. Such operations often don't have a clean broadcasting syntax, for example, somestring[2]
is easier to read than getindex.(somestrings, 2)
. The same is true for someobject.property
and getproperty.(someobjects, :property)
.
The following macros are currently available:
@transform
/@transform!
@select
/@select!
@groupby
@combine
@subset
/@subset!
@sort
/@sort!
@unique
Together with Chain.jl, you get a convient syntax for chains of transformations:
using DataFrameMacros
using DataFrames
using Chain
using Random
using Statistics
Random.seed!(123)
df = DataFrame(
id = shuffle(1:5),
group = rand('a':'b', 5),
weight_kg = randn(5) .* 5 .+ 60,
height_cm = randn(5) .* 10 .+ 170)
result = @chain df begin
@subset(:weight_kg > 50)
@transform(:BMI = :weight_kg / (:height_cm / 100) ^ 2)
@groupby(iseven(:id), :group)
@combine(:mean_BMI = mean(:BMI))
@sort(sqrt(:mean_BMI))
end
show(result)
4×3 DataFrame
Row │ id_iseven group mean_BMI
│ Bool Char Float64
─────┼────────────────────────────
1 │ false a 19.0728
2 │ true a 20.4405
3 │ false b 22.097
4 │ true b 22.9701
Design choices
These are the most important aspects that differ from other packages (DataFramesMeta.jl in particular):
- All macros except
@combine
work row-wise by default. This reduces syntax complexity in most cases because no broadcasting is necessary. A flag macro (@c
or@r
) can be used to switch between row/column-based mode when needed. @groupby
and@sort
allow using arbitrary expressions including multiple columns, without having to@transform
first and repeat the new column names.- Column expressions are interpolated into the macro with
$
. All column expressions are broadcasted implicitly to create a collection of src-func-sink pairs. This allows to use multi-column specifiers likeAll()
orNot(:x)
where the specified transformation is executed for each column. - Keyword arguments to the macro-underlying functions work by separating them from column expressions with the
;
character. - Target column names are written with
:
symbols to avoid visual ambiguity (:newcol = ...
). This also allows to useAsTable
as a target like in DataFrames.jl. - The flag macro can also include the character
m
to switch on automaticpassmissing
in row-wise mode. - There is also a
@t
flag macro, which extracts every:sym = expression
expression and collects the new symbols in a named tuple, while setting the target toAsTable
.