**id_creation** is mandatory and the first rule that initiates the harmonization process. This rule allows the user to provide the column used as a reference per observation (row).

`r DT_id_creation`

Notes: * Usually, the harmonized variable is a standardized identifier generated from the input identifier. * If the dataset does not have any identifier column, the user can create *before harmonization* an index and provide this index as the variable to use.

The harmonized variable is generated by replicating one input variable.

`r DT_direct_mapping`

Note: * One and only one variable can be replicated at a time.

The harmonized variable is generated by recoding values from one input variable.

`r DT_recode`

Notes: * One and only one variable can be recoded at a time. * The variable to be recoded must be (partially at the very least) a categorical variable. To recode a continuous variable (to create brackets for example.), use **case_when** instead. * If all categories are recoded to the same categories (recode(1 = 1 ; 2 = 2)), Prefer **direct_mapping** instead. * Separate each value/code with an equal sign **=** * Separate each elements with a semi-colon **;** . * Use **ELSE = NA** to attribute NA to all of the other values. If an equal sign already exists in the data, use **\_=** to escape them. Equally, if a semi-colon already exists in the data, use **\_;** to escape them. ``` recode( "banana ; apple" = "fruits" _; "salad ; potatoe" = "veggies" _; "bread ; pasta" = "carbs" ) recode( "1000 (='high') _= 3 ; " 500 (='mid') _= 2 ; " 200 (='low') _= 1 ) ``` The values can be gathered using R syntax to recode multiple numerical values. ``` recode( 0 = "low" ; c(1:10) = "mid" ; c(-7, -99) = NA ) ``` If the recoding requires more complex codification, use **case_when** or **other** instead.

The harmonized variable is generated from one or more if-else conditions, using one or more input variables.

`r DT_case_when`

Notes: * Multiple variables can be used to combine their values using case_when. Separate each of them in **input_variables** by a semi-colon **;** * If only one variable is used, and is (or seems) a categorical variable, use **recode** or **direct_mapping** instead. * Each statement ("if ... equals, greater, is not, ...") can be use in this function. Separate the statement/code with a tilde **~** * Separate each elements with a semi-colon **;** . * Use **ELSE ~ NA** to attribute NA to all of the other values. **case_when** is sensitive to the data type. Each code generated with the statement must have the same data type, including the NA. ``` case_when( var_x == 1 ~ 1L var_x != 0 & !is.na(var_y) ~ 0L ELSE ~ NA_integer_ ) case_when( var_x == 1 ~ "1" var_x != 0 & !is.na(var_y) ~ "0" ELSE ~ NA_character_) ``` If the statement requires more complex codification, use **other** instead.

The harmonized variable is generated by setting the same value for all observation, not taken from a input variable.

`r DT_paste`

Notes: * This function does not require any variable. The user must provide **\_\_BLANK\_\_** as a placeholder. * Usually, the harmonized variable is a standardized identifier for the whole dossier when comes the time to aggregate the harmonized datasets into a pooled harmonized dataset.

The harmonized variable is generated by applying an operation to one or more input variables.

`r DT_operation`

Notes: * Multiple variables can be used to combine their values using case_when. Separate each of them in **input_variables** by a semi-colon **;** * If the operation (or seems) is simple, prefer **case_when**, **recode** or **direct_mapping** instead. * The user must have the libraries present on their machine (and loaded) to function with the call of them in the **case_when** script. To specify the library calling, use double two-point **::** in the formula. ``` lubridate::year(var_x) ``` * If the operation is requires more complex codification, use **other** instead.

The harmonized variable is generated from a non-standard or complex processing rule, not covered by other rule categories.

`r DT_other`

Note: * This feature is equivalent to launch a local code/function in a R script. If assignment is needed to modify environment of the user, use double assignation **<<-** to place the result in the user environment. Carefully make sure you control your environment when using the **other** function. ``` my_harmo_var <- runif(20) + ... # complex lines of code # double assignation to modify the environment. harmonized_dossier$DATASET$variable_F <<- my_harmo_var ``` **other** function can be used to source a code from a different script where complex harmonization processes are written. ``` source("my_file.R") ```

These additional features allow the user to handle specific cases. This ensure that the line is completed and there is no missing argument in the function to perform. `r DT_impundebla`
Notes: * *\_\_BLANK\_\_* : If no variable is needed to generate the harmonized variable (for example using the rule category **paste** or **other**) * *impossible* : If the project of research does not collect DataSchema variable or cannot be used to generate DataSchema variable or is unknown. * *undetermined* : If the user needs further investigation to harmonize, or future information to be completed, they can use this feature without being blocked in the process.