The Data Processing Elements (DPE) is a table that defines and documents information about the processing used to generate harmonized datasets and is typically prepared in an Excel spreadsheet. Each row indicates if an input dataset can generate a DataSchema variable, and if so, how input variables are processed to generate a harmonized variable as defined in the DataSchema. This page explains the basic methods to fill out the DPEs in order to be used correctly by Rmonize functions.
The DPE is an typically an Excel file that you open locally in your computer and that you can fill one row after the other to generate the rules of harmonization. It contains at least 5 mandatory columns, plus one additional for your documentation. The process cannot work if one of these columns are not present.
id_creation is mandatory and the first rule that initiates the harmonization process. This rule allows the user to provide the column used as a reference per observation (row).
Notes:
Usually, the harmonized variable is a standardized identifier generated from the input identifier.
If the dataset does not have any identifier column, the user can create before harmonization an index and provide this index as the variable to use.
The harmonized variable is generated by replicating one input variable.
Note:
The harmonized variable is generated by recoding values from one input variable.
Notes:
One and only one variable can be recoded at a time.
The variable to be recoded must be (partially at the very least) a categorical variable. To recode a continuous variable (to create brackets for example.), use case_when instead.
If all categories are recoded to the same categories (recode(1 = 1 ; 2 = 2)), Prefer direct_mapping instead.
Separate each value/code with an equal sign =
Separate each elements with a semi-colon ; .
Use ELSE = NA to attribute NA to all of the other values.
If an equal sign already exists in the data, use _= to escape them. Equally, if a semi-colon already exists in the data, use _; to escape them.
recode(
"banana ; apple" = "fruits" _;
"salad ; potatoe" = "veggies" _;
"bread ; pasta" = "carbs" )
recode(
"1000 (='high') _= 3 ;
" 500 (='mid') _= 2 ;
" 200 (='low') _= 1 )
The values can be gathered using R syntax to recode multiple numerical values.
recode(
0 = "low" ;
c(1:10) = "mid" ;
c(-7, -99) = NA )
If the recoding requires more complex codification, use case_when or other instead.
The harmonized variable is generated from one or more if-else conditions, using one or more input variables.
Notes:
Multiple variables can be used to combine their values using case_when. Separate each of them in input_variables by a semi-colon ;
If only one variable is used, and is (or seems) a categorical variable, use recode or direct_mapping instead.
Each statement (“if … equals, greater, is not, …”) can be use in this function. Separate the statement/code with a tilde ~
Separate each elements with a semi-colon ; .
Use ELSE ~ NA to attribute NA to all of the other values.
case_when is sensitive to the data type. Each code generated with the statement must have the same data type, including the NA.
case_when(
var_x == 1 ~ 1L
var_x != 0 & !is.na(var_y) ~ 0L
ELSE ~ NA_integer_ )
case_when(
var_x == 1 ~ "1"
var_x != 0 & !is.na(var_y) ~ "0"
ELSE ~ NA_character_)
If the statement requires more complex codification, use other instead.
The harmonized variable is generated by setting the same value for all observation, not taken from a input variable.
Notes:
This function does not require any variable. The user must provide __BLANK__ as a placeholder.
Usually, the harmonized variable is a standardized identifier for the whole dossier when comes the time to aggregate the harmonized datasets into a pooled harmonized dataset.
The harmonized variable is generated by applying an operation to one or more input variables.
Notes:
Multiple variables can be used to combine their values using case_when. Separate each of them in input_variables by a semi-colon ;
If the operation (or seems) is simple, prefer case_when, recode or direct_mapping instead.
The user must have the libraries present on their machine (and loaded) to function with the call of them in the case_when script. To specify the library calling, use double two-point :: in the formula.
lubridate::year(var_x)
The harmonized variable is generated from a non-standard or complex processing rule, not covered by other rule categories.
Note:
If assignment is needed to modify environment of the user, use double assignation <<- to place the result in the user environment. Carefully make sure you control your environment when using the other function.
my_harmo_var <- runif(20) + ... # complex lines of code
# double assignation to modify the environment.
harmonized_dossier$DATASET$variable_F <<- my_harmo_var
other function can be used to source a code from a different script where complex harmonization processes are written.
source("my_file.R")
These additional features allow the user to handle specific cases. This ensure that the line is completed and there is no missing argument in the function to perform.
Notes:
__BLANK__ : If no variable is needed to generate the harmonized variable (for example using the rule category paste or other)
impossible : If the project of research does not collect DataSchema variable or cannot be used to generate DataSchema variable or is unknown.
undetermined : If the user needs further investigation to harmonize, or future information to be completed, they can use this feature without being blocked in the process.