**id_creation** is mandatory and the first rule that initiates the
harmonization process. This rule allows the user to provide the column used as
a reference per observation (row).
`r DT_id_creation`
Notes:
* Usually, the harmonized variable is a standardized identifier generated
from the input identifier.
* If the dataset does not have any identifier column, the user can create
*before harmonization* an index and provide this index as the variable to use.
The harmonized variable is generated by replicating one input variable.
`r DT_direct_mapping`
Note:
* One and only one variable can be replicated at a time.
The harmonized variable is generated by recoding values from one input variable.
`r DT_recode`
Notes:
* One and only one variable can be recoded at a time.
* The variable to be recoded must be (partially at the very least) a categorical
variable. To recode a continuous variable (to create brackets for example.),
use **case_when** instead.
* If all categories are recoded to the same categories (recode(1 = 1 ; 2 = 2)),
Prefer **direct_mapping** instead.
* Separate each value/code with an equal sign **=**
* Separate each elements with a semi-colon **;** .
* Use **ELSE = NA** to attribute NA to all of the other values.
If an equal sign already exists in the data, use **\_=** to escape them. Equally,
if a semi-colon already exists in the data, use **\_;** to escape them.
```
recode(
"banana ; apple" = "fruits" _;
"salad ; potatoe" = "veggies" _;
"bread ; pasta" = "carbs" )
recode(
"1000 (='high') _= 3 ;
" 500 (='mid') _= 2 ;
" 200 (='low') _= 1 )
```
The values can be gathered using R syntax to recode multiple numerical values.
```
recode(
0 = "low" ;
c(1:10) = "mid" ;
c(-7, -99) = NA )
```
If the recoding requires more complex codification, use **case_when** or **other** instead.
The harmonized variable is generated from one or more if-else conditions,
using one or more input variables.
`r DT_case_when`
Notes:
* Multiple variables can be used to combine their values using case_when. Separate
each of them in **input_variables** by a semi-colon **;**
* If only one variable is used, and is (or seems) a categorical
variable, use **recode** or **direct_mapping** instead.
* Each statement ("if ... equals, greater, is not, ...") can be use in this function.
Separate the statement/code with a tilde **~**
* Separate each elements with a semi-colon **;** .
* Use **ELSE ~ NA** to attribute NA to all of the other values.
**case_when** is sensitive to the data type. Each code generated with the statement
must have the same data type, including the NA.
```
case_when(
var_x == 1 ~ 1L
var_x != 0 & !is.na(var_y) ~ 0L
ELSE ~ NA_integer_ )
case_when(
var_x == 1 ~ "1"
var_x != 0 & !is.na(var_y) ~ "0"
ELSE ~ NA_character_)
```
If the statement requires more complex codification, use **other** instead.
The harmonized variable is generated by setting the same value for all observation, not taken from a input variable.
`r DT_paste`
Notes:
* This function does not require any variable. The user must provide
**\_\_BLANK\_\_** as a placeholder.
* Usually, the harmonized variable is a standardized identifier for the whole
dossier when comes the time to aggregate the harmonized datasets into a
pooled harmonized dataset.
The harmonized variable is generated by applying an operation to one or more input variables.
`r DT_operation`
Notes:
* Multiple variables can be used to combine their values using case_when. Separate
each of them in **input_variables** by a semi-colon **;**
* If the operation (or seems) is simple, prefer **case_when**, **recode** or **direct_mapping**
instead.
* The user must have the libraries present on their machine (and loaded) to function
with the call of them in the **case_when** script. To specify the library calling, use
double two-point **::** in the formula.
```
lubridate::year(var_x)
```
* If the operation is requires more complex codification, use **other** instead.
The harmonized variable is generated from a non-standard or complex processing
rule, not covered by other rule categories.
`r DT_other`
Note:
* This feature is equivalent to launch a local code/function in a R script.
If assignment is needed to modify environment of the user, use
double assignation **<<-** to place the result in the user environment. Carefully
make sure you control your environment when using the **other** function.
```
my_harmo_var <- runif(20) + ... # complex lines of code
# double assignation to modify the environment.
harmonized_dossier$DATASET$variable_F <<- my_harmo_var
```
**other** function can be used to source a code from a different script where complex
harmonization processes are written.
```
source("my_file.R")
```
These additional features allow the user to handle specific cases. This ensure that the line
is completed and there is no missing argument in the function to perform.
`r DT_impundebla`
Notes:
* *\_\_BLANK\_\_* : If no variable is needed to generate the harmonized variable
(for example using the rule category **paste** or **other**)
* *impossible* : If the project of research does not collect DataSchema variable
or cannot be used to generate DataSchema variable or is unknown.
* *undetermined* : If the user needs further investigation to harmonize, or future
information to be completed, they can use this feature without being blocked in
the process.