Tutorial
In this tutorial, we'll get to know the macros of DataFrameMacros while working with the well-known Titanic dataset from Kaggle.
Loading the data
The titanic
function returns the DataFrame
with data about passengers of the Titanic.
julia> using DataFrameMacros, DataFrames, Statistics
julia> df = DataFrameMacros.titanic()
891×12 DataFrame Row │ PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked │ Int64 Int64 Int64 String String Float64? Int64 Int64 String Float64 String? String? ─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 missing S 2 │ 2 1 1 Cumings, Mrs. John Bradley (Flor… female 38.0 1 0 PC 17599 71.2833 C85 C 3 │ 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.925 missing S 4 │ 4 1 1 Futrelle, Mrs. Jacques Heath (Li… female 35.0 1 0 113803 53.1 C123 S ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 889 │ 889 0 3 Johnston, Miss. Catherine Helen … female missing 1 2 W./C. 6607 23.45 missing S 890 │ 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0 C148 C 891 │ 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 missing Q 884 rows omitted
@select
The simplest operation one can do is to select columns from a DataFrame. DataFrames.jl has the select
function for that purpose and DataFramesMacro has the corresponding @select
macro. We can pass symbols or strings with column names that we're interested in.
julia> @select(df, :Name, :Age, :Survived)
891×3 DataFrame Row │ Name Age Survived │ String Float64? Int64 ─────┼──────────────────────────────────────────────────────── 1 │ Braund, Mr. Owen Harris 22.0 0 2 │ Cumings, Mrs. John Bradley (Flor… 38.0 1 3 │ Heikkinen, Miss. Laina 26.0 1 4 │ Futrelle, Mrs. Jacques Heath (Li… 35.0 1 ⋮ │ ⋮ ⋮ ⋮ 889 │ Johnston, Miss. Catherine Helen … missing 0 890 │ Behr, Mr. Karl Howell 26.0 1 891 │ Dooley, Mr. Patrick 32.0 0 884 rows omitted
We can also compute new columns with @select
. We can either specify a new column ourselves, or DataFrames selects an automatic name.
For example, we can extract the last name from each name string by splitting at the comma.
julia> @select(df, :last_name = split(:Name, ",")[1])
891×1 DataFrame Row │ last_name │ SubStrin… ─────┼─────────── 1 │ Braund 2 │ Cumings 3 │ Heikkinen 4 │ Futrelle ⋮ │ ⋮ 889 │ Johnston 890 │ Behr 891 │ Dooley 884 rows omitted
The split
function operates on a single string, so for this expression to work on the whole column :Name
, there must be an implicit broadcast expansion happening. In DataFrameMacros, every macro but @combine
works by-row by default. The expression that the @select
macro creates is equivalent to the following ByRow
construct:
select(df, :Name => ByRow(x -> split(x, ",")[1]) => :last_name)
@transform
Another thing we can try is to categorize every passenger into child or adult at the boundary of 18 years.
Let's use the @transform
macro this time, which appends new columns to an existing DataFrame.
julia> @transform(df, :type = :Age >= 18 ? "adult" : "child")
ERROR: TypeError: non-boolean (Missing) used in boolean context
This command fails because some passengers have no age recorded, and the ternary operator ... ? ... : ...
(a shortcut for if ... then ... else ...
) cannot operate on missing
values.
The @m passmissing
flag macro
One option is to remove the missing values beforehand, but then we would have to delete rows from the dataset. A simple option to make the expression pass through missing values, is by using the special flag macro @m
.
julia> @transform(df, :type = @m :Age >= 18 ? "adult" : "child")
891×13 DataFrame Row │ PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked type │ Int64 Int64 Int64 String String Float64? Int64 Int64 String Float64 String? String? String? ─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 missing S adult 2 │ 2 1 1 Cumings, Mrs. John Bradley (Flor… female 38.0 1 0 PC 17599 71.2833 C85 C adult 3 │ 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.925 missing S adult 4 │ 4 1 1 Futrelle, Mrs. Jacques Heath (Li… female 35.0 1 0 113803 53.1 C123 S adult ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 889 │ 889 0 3 Johnston, Miss. Catherine Helen … female missing 1 2 W./C. 6607 23.45 missing S missing 890 │ 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0 C148 C adult 891 │ 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 missing Q adult 884 rows omitted
This is equivalent to a DataFrames construct, in which the function is wrapped in passmissing
:
transform(df, :Age => ByRow(passmissing(x -> x >= 18 ? "adult" : "child")) => :type)
This way, if any input argument is missing
, the function returns missing
, too.
@subset
To retain only rows that fulfill certain conditions, you can use the @subset
macro. For this macro it does not make sense to specify sink column names, because derived columns do not appear in the result. If there are missing
values, you can use the @m
flag to pass them through the boolean condition, and add the keyword argument skipmissing = true
which the underlying subset
function requires to remove such rows.
julia> @subset(df, @m startswith(:Name, "M") && :Age > 50; skipmissing = true)
7×12 DataFrame Row │ PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked │ Int64 Int64 Int64 String String Float64? Int64 Int64 String Float64 String? String? ─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 2 │ 153 0 3 Meo, Mr. Alfonzo male 55.5 0 0 A.5. 11206 8.05 missing S 3 │ 318 0 2 Moraweck, Dr. Ernest male 54.0 0 0 29011 14.0 missing S 4 │ 457 0 1 Millet, Mr. Francis Davis male 65.0 0 0 13509 26.55 E38 S 5 │ 493 0 1 Molson, Mr. Harry Markland male 55.0 0 0 113787 30.5 C30 S 6 │ 673 0 2 Mitchell, Mr. Henry Michael male 70.0 0 0 C.A. 24580 10.5 missing S 7 │ 773 0 2 Mack, Mrs. (Mary) female 57.0 0 0 S.O./P.P. 3 10.5 E77 S
@groupby
The groupby
function in DataFrames does not use the src => function => sink
mini-language, it requires you to create any columns you want to group by beforehand. In DataFrameMacros, the @groupby
macro works like a transform
and groupby
combination, so that you can create columns and group by them in one stroke.
For example, we could group the passengers based on if their last name begins with a letter from the first or the second half of the alphabet.
julia> @groupby(df, :alphabet_half = :Name[1] <= 'M' ? "first" : "second")
GroupedDataFrame with 2 groups based on key: alphabet_half First Group (570 rows): alphabet_half = "first" Row │ PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked alphabet_half │ Int64 Int64 Int64 String String Float64? Int64 Int64 String Float64 String? String? String ─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 missing S first 2 │ 2 1 1 Cumings, Mrs. John Bradley (Flor… female 38.0 1 0 PC 17599 71.2833 C85 C first 3 │ 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.925 missing S first 4 │ 4 1 1 Futrelle, Mrs. Jacques Heath (Li… female 35.0 1 0 113803 53.1 C123 S first ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 567 │ 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0 B42 S first 568 │ 889 0 3 Johnston, Miss. Catherine Helen … female missing 1 2 W./C. 6607 23.45 missing S first 569 │ 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0 C148 C first 570 │ 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 missing Q first 562 rows omitted ⋮ Last Group (321 rows): alphabet_half = "second" Row │ PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked alphabet_half │ Int64 Int64 Int64 String String Float64? Int64 Int64 String Float64 String? String? String ─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.075 missing S second 2 │ 10 1 2 Nasser, Mrs. Nicholas (Adele Ach… female 14.0 1 0 237736 30.0708 missing C second 3 │ 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7 G6 S second 4 │ 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.05 missing S second ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 318 │ 880 1 1 Potter, Mrs. Thomas Jr (Lily Ale… female 56.0 0 1 11767 83.1583 C50 C second 319 │ 881 1 2 Shelley, Mrs. William (Imanita P… female 25.0 0 1 230433 26.0 missing S second 320 │ 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.05 missing S second 321 │ 886 0 3 Rice, Mrs. William (Margaret Nor… female 39.0 0 5 382652 29.125 missing Q second 313 rows omitted
begin ... end
syntax
You can of course group by multiple columns, in that case just add more positional arguments. In order to write more readable code, we can arrange our multiple arguments as lines in a begin ... end
block instead of two comma-separated positional arguments.
julia> group = @groupby df begin :alphabet_half = :Name[1] <= 'M' ? "first" : "second" :Sex end
GroupedDataFrame with 4 groups based on keys: alphabet_half, Sex First Group (368 rows): alphabet_half = "first", Sex = "male" Row │ PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked alphabet_half │ Int64 Int64 Int64 String String Float64? Int64 Int64 String Float64 String? String? String ─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 missing S first 2 │ 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.05 missing S first 3 │ 6 0 3 Moran, Mr. James male missing 0 0 330877 8.4583 missing Q first 4 │ 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S first ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 365 │ 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5 missing S first 366 │ 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0 missing S first 367 │ 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0 C148 C first 368 │ 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 missing Q first 360 rows omitted ⋮ Last Group (112 rows): alphabet_half = "second", Sex = "female" Row │ PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked alphabet_half │ Int64 Int64 Int64 String String Float64? Int64 Int64 String Float64 String? String? String ─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 10 1 2 Nasser, Mrs. Nicholas (Adele Ach… female 14.0 1 0 237736 30.0708 missing C second 2 │ 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7 G6 S second 3 │ 15 0 3 Vestrom, Miss. Hulda Amanda Adol… female 14.0 0 0 350406 7.8542 missing S second 4 │ 19 0 3 Vander Planke, Mrs. Julius (Emel… female 31.0 1 0 345763 18.0 missing S second ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 109 │ 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.225 missing C second 110 │ 880 1 1 Potter, Mrs. Thomas Jr (Lily Ale… female 56.0 0 1 11767 83.1583 C50 C second 111 │ 881 1 2 Shelley, Mrs. William (Imanita P… female 25.0 0 1 230433 26.0 missing S second 112 │ 886 0 3 Rice, Mrs. William (Margaret Nor… female 39.0 0 5 382652 29.125 missing Q second 104 rows omitted
@combine
We can compute summary statistics on groups using the @combine
macro. This is the only macro that works by-column by default because aggregations are most commonly computed on full columns, not on each row.
For example, we can compute survival rates for the groups we created above.
julia> @combine(group, :survival_rate = mean(:Survived))
4×3 DataFrame Row │ alphabet_half Sex survival_rate │ String String Float64 ─────┼────────────────────────────────────── 1 │ first male 0.214674 2 │ first female 0.752475 3 │ second male 0.143541 4 │ second female 0.723214
@chain
The @chain
macro from Chain.jl is useful to build sequences of operations. It is not included in DataFrameMacros but works well with it.
In a chain, the first argument of each function or macro call is by default the result from the previous line.
julia> using Chain
julia> @chain df begin @select(:Sex, :Age, :Survived) dropmissing(:Age) @groupby(:Sex, :age_range = floor(Int, :Age/10) * 10 : ceil(Int, :Age/10) * 10 - 1) @combine(:survival_rate = mean(:Survived)) @sort(first(:age_range), :Sex) end
17×3 DataFrame Row │ Sex age_range survival_rate │ String UnitRang… Float64 ─────┼────────────────────────────────── 1 │ female 0:9 0.633333 2 │ male 0:9 0.59375 3 │ female 10:19 0.772727 4 │ male 10:19 0.125 ⋮ │ ⋮ ⋮ ⋮ 15 │ female 60:69 1.0 16 │ male 60:69 0.0833333 17 │ male 70:79 0.0 10 rows omitted
Here you could also see the @sort
macro, which is useful when you want to sort by values that are derived from different columns, but which you don't want to include in the DataFrame.
The @c flag macro
Some @transform
or @select
calls require access to whole columns at once. One scenario is computing a z-score. Because @transform
and @select
work by-row by default, you need to add the @c
flag macro to signal that you want to work by-column. This is exactly the opposite from DataFrames, where you work by-column by default and signal by-row behavior with the ByRow
wrapper.
julia> @select( dropmissing(df, :Age), :age_z = @c (:Age .- mean(:Age)) ./ std(:Age))
714×1 DataFrame Row │ age_z │ Float64 ─────┼─────────── 1 │ -0.530005 2 │ 0.57143 3 │ -0.254646 4 │ 0.364911 ⋮ │ ⋮ 712 │ -0.736524 713 │ -0.254646 714 │ 0.158392 707 rows omitted
The @t flag macro
If a computation should return multiple different columns, DataFrames allows you to do this by returning a NamedTuple
and setting the sink argument to AsTable
. To streamline this process you can use the @t
flag macro. It signals that all :symbol = expression
expressions that are found are rewritten so that a NamedTuple
like (symbol = expression, symbol2...)
is returned and the sink argument is set to AsTable
.
julia> @select(df, @t begin nameparts = split(:Name, r"[\s,]+") :title = nameparts[2] :first_name = nameparts[3] :last_name = nameparts[1] end)
891×3 DataFrame Row │ title first_name last_name │ SubStrin… SubStrin… SubStrin… ─────┼────────────────────────────────── 1 │ Mr. Owen Braund 2 │ Mrs. John Cumings 3 │ Miss. Laina Heikkinen 4 │ Mrs. Jacques Futrelle ⋮ │ ⋮ ⋮ ⋮ 889 │ Miss. Catherine Johnston 890 │ Mr. Karl Behr 891 │ Mr. Patrick Dooley 884 rows omitted
You can also use tuple destructuring syntax with the @t
macro. This can often make assignments of multiple columns even more terse:
julia> @select(df, @t begin :last_name, :title, :first_name, rest... = split(:Name, r"[\s,]+") end)
891×3 DataFrame Row │ last_name title first_name │ SubStrin… SubStrin… SubStrin… ─────┼────────────────────────────────── 1 │ Braund Mr. Owen 2 │ Cumings Mrs. John 3 │ Heikkinen Miss. Laina 4 │ Futrelle Mrs. Jacques ⋮ │ ⋮ ⋮ ⋮ 889 │ Johnston Miss. Catherine 890 │ Behr Mr. Karl 891 │ Dooley Mr. Patrick 884 rows omitted
Multi-column specifications
So far we have only accessed a single column with each column specifier, like :Survived
. But often, transformations are supposed to be applied over a set of columns.
In DataFrameMacros, the source-function-sink
pair construct being created is automatically broadcasted over all column specifiers. This means one can not only use any expression marked by $
which results in a single column identifier, but also in multi column identifiers. The broadcasting is "invisible" to the user when they only limit their use to single-column identifiers, as broadcasting over singular objects results in a singular source-function-sink expression.
Possible identifiers are n-dimensional arrays of strings, symbols or integers and all valid inputs to the DataFrames.names(df, specifier)
function. Examples of these are All()
, Not(:x)
, Between(:x, :z)
, any Type
, or any Function
that returns true
or false
given a column name String
.
Let's look at a few basic examples. Here's a simple selection of columns without transformation:
julia> @select(df, $(Between(:Name, :Age)))
891×3 DataFrame Row │ Name Sex Age │ String String Float64? ─────┼────────────────────────────────────────────────────── 1 │ Braund, Mr. Owen Harris male 22.0 2 │ Cumings, Mrs. John Bradley (Flor… female 38.0 3 │ Heikkinen, Miss. Laina female 26.0 4 │ Futrelle, Mrs. Jacques Heath (Li… female 35.0 ⋮ │ ⋮ ⋮ ⋮ 889 │ Johnston, Miss. Catherine Helen … female missing 890 │ Behr, Mr. Karl Howell male 26.0 891 │ Dooley, Mr. Patrick male 32.0 884 rows omitted
Or another example with a Function
that selects all columns ending with "e":
julia> @select(df, $(endswith("e")))
891×3 DataFrame Row │ Name Age Fare │ String Float64? Float64 ─────┼─────────────────────────────────────────────────────── 1 │ Braund, Mr. Owen Harris 22.0 7.25 2 │ Cumings, Mrs. John Bradley (Flor… 38.0 71.2833 3 │ Heikkinen, Miss. Laina 26.0 7.925 4 │ Futrelle, Mrs. Jacques Heath (Li… 35.0 53.1 ⋮ │ ⋮ ⋮ ⋮ 889 │ Johnston, Miss. Catherine Helen … missing 23.45 890 │ Behr, Mr. Karl Howell 26.0 30.0 891 │ Dooley, Mr. Patrick 32.0 7.75 884 rows omitted
The next step is to actually compute with the selected columns. The resulting DataFrames mini-language construct is sources .=> function[s] .=> sinks
where in the default case, there is just a single function, even when using multiple columns.
For example, we can select all columns that are subtypes of Real
and convert them to Float32
:
julia> @select(df, Float32($Real))
891×6 DataFrame Row │ PassengerId_Float32 Survived_Float32 Pclass_Float32 SibSp_Float32 Parch_Float32 Fare_Float32 │ Float32 Float32 Float32 Float32 Float32 Float32 ─────┼─────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 1.0 0.0 3.0 1.0 0.0 7.25 2 │ 2.0 1.0 1.0 1.0 0.0 71.2833 3 │ 3.0 1.0 3.0 0.0 0.0 7.925 4 │ 4.0 1.0 1.0 1.0 0.0 53.1 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 889 │ 889.0 0.0 3.0 1.0 2.0 23.45 890 │ 890.0 1.0 1.0 0.0 0.0 30.0 891 │ 891.0 0.0 3.0 0.0 0.0 7.75 884 rows omitted
On the left-hand side of left_expression = right_expression
, we can also create a multi-column-specifier object in order to choose a collection of column names for the result of right_expression
. We can splice collections of existing names in with $
which makes it easy to create new names based on old ones. For example, to continue with the Float32
example, we could lowercase
each column name and append a _32
suffix instead of relying on the automatic renaming.
julia> @select(df, lowercase.($Real) .* "_32" = Float32($Real))
891×6 DataFrame Row │ passengerid_32 survived_32 pclass_32 sibsp_32 parch_32 fare_32 │ Float32 Float32 Float32 Float32 Float32 Float32 ─────┼───────────────────────────────────────────────────────────────────── 1 │ 1.0 0.0 3.0 1.0 0.0 7.25 2 │ 2.0 1.0 1.0 1.0 0.0 71.2833 3 │ 3.0 1.0 3.0 0.0 0.0 7.925 4 │ 4.0 1.0 1.0 1.0 0.0 53.1 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 889 │ 889.0 0.0 3.0 1.0 2.0 23.45 890 │ 890.0 1.0 1.0 0.0 0.0 30.0 891 │ 891.0 0.0 3.0 0.0 0.0 7.75 884 rows omitted
Just to reiterate, this expression amounts to something close to:
select(df, DataFrameMacros.stringargs(df, Real) .=> ByRow(Float32) .=> lowercase.(DataFrameMacros.stringargs(df, Real) .* "_32"))
The stringargs
function handles the conversion from input object to column names and is almost equivalent to using DataFrames.names
, except that Symbols
, Strings
, and collections thereof are passed through as-is.
We can see the broadcasting aspect better by combining column specifiers of different length in one expression. Let's pretend for example, that we wanted to have columns that compute interactions of multiple numeric variables, such as age with survival status or passenger class:
julia> @select(df, :Age * $[:Survived, :Pclass])
891×2 DataFrame Row │ Age_Survived_* Age_Pclass_* │ Float64? Float64? ─────┼────────────────────────────── 1 │ 0.0 66.0 2 │ 38.0 38.0 3 │ 26.0 78.0 4 │ 35.0 35.0 ⋮ │ ⋮ ⋮ 889 │ missing missing 890 │ 26.0 26.0 891 │ 0.0 96.0 884 rows omitted
As you can see, the :Age column was multiplied element-wise with each of the other two columns.
This process works also with n-dimensional arrays, for example to multiply multiple columns in all possible combinations, we can use one row and one column vector:
julia> @select(df, $[:Survived, :Pclass] * $(permutedims([:Survived, :Pclass])))
891×4 DataFrame Row │ Survived_Survived_* Pclass_Survived_* Survived_Pclass_* Pclass_Pclass_* │ Int64 Int64 Int64 Int64 ─────┼──────────────────────────────────────────────────────────────────────────── 1 │ 0 0 0 9 2 │ 1 1 1 1 3 │ 1 3 3 9 4 │ 1 1 1 1 ⋮ │ ⋮ ⋮ ⋮ ⋮ 889 │ 0 0 0 9 890 │ 1 1 1 1 891 │ 0 0 0 9 884 rows omitted
The sink specifier can be an n-dimensional array as well, which is finally flattened into a sequence of columns going column-first.
julia> @select(df, ["a" "c"; "b" "d"] = $[:Survived, :Pclass] * $(permutedims([:Survived, :Pclass])))
891×4 DataFrame Row │ a b c d │ Int64 Int64 Int64 Int64 ─────┼──────────────────────────── 1 │ 0 0 0 9 2 │ 1 1 1 1 3 │ 1 3 3 9 4 │ 1 1 1 1 ⋮ │ ⋮ ⋮ ⋮ ⋮ 889 │ 0 0 0 9 890 │ 1 1 1 1 891 │ 0 0 0 9 884 rows omitted
The left-hand side doesn't necessarily have to match the size of the right-hand side expression (remember we're broadcasting) but of course you just copy columns multiple times if you have more names than source columns.
julia> @select(df, ["a", "b", "c"] = :Survived)
891×3 DataFrame Row │ a b c │ Int64 Int64 Int64 ─────┼───────────────────── 1 │ 0 0 0 2 │ 1 1 1 3 │ 1 1 1 4 │ 1 1 1 ⋮ │ ⋮ ⋮ ⋮ 889 │ 0 0 0 890 │ 1 1 1 891 │ 0 0 0 884 rows omitted