Chapter 4 Data Manipulation
This chapter aims in introducing the
{tidyverse}
as you’ll learn how to manipulate and transform your data by exploring simple (e.g. renaming columns, sorting tables, relocating variables, etc.) to complex transformations (e.g. transposing parts of the table, combining/merging tables, etc.)13. Such transformations are done through simple examples that are relevant to Sensory and Consumer Science.
4.1 Why Manipulating Data?
Before starting, for this chapter, most functions used are from the {tidyverse}
. Also many transformations proposed are not being saved. If you want to apply these changes to your data set, please visit 2.3.3. Let’s start with loading this package:
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'flextable' was built under R version 4.1.3
In sensory science, different data collection tools (e.g. different devices, software, methodologies, etc.) may provide the same data in different ways. Also, different statistical analyses may require having the data structured differently.
A simple example to illustrate this latter point is the analysis of liking data.
Let C consumers provide their hedonic assessments on P samples. To evaluate if samples have received different mean liking scores at the population level, an ANOVA is performed on a long thin table with 3 columns (consumer, sample, and the liking scores), where the combination of CxP is spread in rows.
However, to assess whether consumers have the same preference patterns at the individual level, internal preference mapping or cluster analysis is performed. Both these analyses require as input a short and large table with the P products in rows and the C consumers in columns.
Another example of data manipulation consists in summarizing data, by for instance computing the mean by product of the liking scores, or by generating frequency tables (e.g. distribution of the liking scores by product etc.). In this case, the transformation alters the data as the individual differences are lost. Ultimately, the output table is smaller as it would contain here only P rows and 2 columns:
For these reasons, it is essential to learn to manipulate data and transition from one structure to another (when possible).
4.2 Tidying Data
Hadley Wickham (Wickham (2014)) defined tidy data as “data sets that are arranged such that each variable is a column and each observation (or case) is a row.” Depending on the statistical unit to consider and the analyses to perform, data may need to be manipulated to be presented in a tidy form.
4.2.1 Simple Manipulations
We define here as simple manipulations any data transformations that can easily be performed in other software such as Excel (using copy-paste, sorting and filtering, creating a pivot table, etc.). However, we strongly recommend performing any sorts of transformation in R as this will reduce the risk of errors, typically be faster, more reliable, and will be reusable if you need to perform the same operations on similar data in the future (including updated versions of the current data set). Moreover, these operations will become easier and more natural for you to use as you get familiar with them. Most importantly, performing these transformations in R do not alter the original data, meaning that changes can be reverted (which is not possible when you alter the raw data directly).
4.2.1.1 Handling Columns
4.2.1.1.1 Renaming Variables
Before starting transforming any data, we need data. So let’s start with importing the biscuits_sensory_profile.xlsx file14. For importing the data, the packages {here}
and {readxl}
are being used. Here, the data are being saved in the object called sensory
:
library(here)
## Warning: package 'here' was built under R version 4.1.2
library(readxl)
## Warning: package 'readxl' was built under R version 4.1.3
<- here("data","biscuits_sensory_profile.xlsx")
file_path <- read_xlsx(file_path, sheet="Data") sensory
The first simple transformation we consider consists of renaming one or multiple variables. This procedure can easily be done using the rename()
function from {dplyr}
. In each of the examples below, we use the names()
function to show just the names of the resulting data set.
%>%
sensory names()
## [1] "Judge" "Product"
## [3] "Shiny" "External color intensity"
## [5] "Color evenness"
In sensory
, let’s rename Judge
into Panellist
, and Product
into Sample
(here we apply transformations without saving the results, so the original data set remains unchanged).
To do so, we indicate in rename()
that new_name is replacing old_name as following rename(newname = oldname)
. Additionally, we can apply multiple changes by simply separating them with a ,
:
%>%
sensory rename(Panellist = Judge, Sample = Product) %>%
names()
## [1] "Panellist" "Sample"
## [3] "Shiny" "External color intensity"
## [5] "Color evenness"
Alternatively, it is also possible to rename a column using its position:
%>%
sensory rename(Consumer=1, Biscuit=2) %>%
names()
## [1] "Consumer" "Biscuit"
## [3] "Shiny" "External color intensity"
## [5] "Color evenness"
If this procedure of renaming variables should be applied on many variables following a structured form (e.g. transforming names into snake_case, CamelCase, …, see https://en.wikipedia.org/wiki/Letter_case#Use_within_programming_languages for more information), the use of the {janitor}
package comes handy thanks to its clean_names()
function and the case
parameter:
library(janitor)
%>%
sensory clean_names(case="snake") %>%
names()
## [1] "judge" "product"
## [3] "shiny" "external_color_intensity"
## [5] "color_evenness"
Note that the {janitor}
package offers many options, and although the transformation is performed here on all the variables, it is possible to apply it on certain variables only.
4.2.1.1.2 Re-Organizing Columns
Another simple transformation consists in re-organizing the data set, either by re-ordering, adding, and/or removing columns.
For re-ordering columns, relocate()
is being used. This function allows re-positioning a (set of) variable(s) before or after another variable. By re-using the sensory
data set, let’s position all the variables starting with ‘Qty’ between Product
and Shiny
. This can be specified in two different ways, either by positioning them after Product
or before Shiny
:
%>%
sensory relocate(starts_with("Qty"), .after=Product) %>%
names()
%>%
sensory relocate(starts_with("Qty"), .before=Shiny) %>%
names()
## [1] "Judge" "Product"
## [3] "Qty of inclusions" "Qty of inclusions in mouth"
## [5] "Shiny"
## [1] "Judge" "Product"
## [3] "Qty of inclusions" "Qty of inclusions in mouth"
## [5] "Shiny"
4.2.1.1.3 Removing/Selecting Columns
Another very important function regarding columns transformation is the select()
function from {dplyr}
(see Section 2.2.1 for a justification of the particular writing dplyr::select()
) which allows selecting a set of variables, by simply informing the variables that should be kept in the data. Let’s limit ourselves in selecting Judge
, Product
, and Shiny
:
%>%
sensory ::select(Judge, Product, Shiny) dplyr
## # A tibble: 99 x 3
## Judge Product Shiny
## <chr> <chr> <dbl>
## 1 J01 P01 52.8
## 2 J01 P02 48.6
## 3 J01 P03 48
## 4 J01 P04 46.2
## 5 J01 P05 0
## 6 J01 P06 0
## 7 J01 P07 5.4
## 8 J01 P08 0
## 9 J01 P09 0
## 10 J01 P10 53.4
## # ... with 89 more rows
When a long series of variables should be kept in the same order, the use of the :
is used.
Let’s keep all the variables going from Cereal flavor
to Dairy flavor
:
%>%
sensory ::select(Judge, Product, `Cereal flavor`:`Dairy flavor`) dplyr
## # A tibble: 99 x 6
## Judge Product `Cereal flavor` `RawDough flavor` `Fatty flavor` `Dairy flavor`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 J01 P01 24.6 28.2 13.8 0
## 2 J01 P02 25.8 28.8 7.2 0
## 3 J01 P03 30 26.4 0 0
## 4 J01 P04 16.2 28.2 0 0
## 5 J01 P05 21.6 27.6 6.6 0
## 6 J01 P06 28.8 27.6 0 0
## 7 J01 P07 18 21 6 0
## 8 J01 P08 23.4 27 7.2 0
## 9 J01 P09 28.8 26.4 7.2 0
## 10 J01 P10 32.4 27.6 12 21
## # ... with 89 more rows
However, when only one (or few) variable needs to be removed, it is easier to specify which one to remove rather than informing all the ones to keep. Such solution is then done using the -
sign. The previous example can then be obtained using the following code:
%>%
sensory ::select(-c(Shiny, Melting)) dplyr
The selection process of variables can be further informed through functions such as starts_with()
, ends_with()
, and contains()
, which all select variables that either starts, ends, or contains a certain character or sequence of character. To illustrate this, let’s only keep the variables that starts with ‘Qty’:
%>%
sensory ::select(starts_with("Qty")) dplyr
## # A tibble: 99 x 2
## `Qty of inclusions` `Qty of inclusions in mouth`
## <dbl> <dbl>
## 1 9.6 27.6
## 2 10.8 22.2
## 3 7.8 10.2
## 4 0 13.2
## 5 0 18
## 6 27.6 26.4
## 7 0 9
## 8 23.4 25.8
## 9 21 29.4
## 10 18 30
## # ... with 89 more rows
Rather than selecting variables based on their names, we can also select them based on their position (e.g. dplyr::select(2:5)
to keep the variables that are at position 2 to 5).
Selection of variables can also be done using some rules thanks to the where()
function. Let’s consider the situation in which we only want to keep the variables that are nominal (or character in R), which automatically keeps Judge
and Product
:
%>%
sensory ::select(where(is.character)) dplyr
## # A tibble: 99 x 2
## Judge Product
## <chr> <chr>
## 1 J01 P01
## 2 J01 P02
## 3 J01 P03
## 4 J01 P04
## 5 J01 P05
## 6 J01 P06
## 7 J01 P07
## 8 J01 P08
## 9 J01 P09
## 10 J01 P10
## # ... with 89 more rows
dplyr::select()
is a very powerful function that facilitates the selection of complex variables through very intuitive functions. Ultimately, it can also be used to relocate()
and even rename()
variables, as shown in the example below:
%>%
sensory ::select(Panellist = Judge, Sample = Product, Shiny:Thickness, -contains("olor")) dplyr
More examples illustrating the use of dplyr::select()
are provided throughout the book. In particular, in the next session, another important function called across()
will be introduced. This function can also be associated with dplyr::select()
to give even more freedom in our selection. An example of such application is given in …
4.2.1.1.4 Creating Columns
In some cases, new variables need to be created from existing ones. Examples of such situations include taking the quadratic term of a sensory attribute to test for curvature, or simply considering a new variables as the sum or the subtraction between two (or more) others. Such creation of a variable is processed through the mutate()
function from the {dplyr}
package. This function takes as inputs the name of the variable to create, and the formula that defines that variable.
Let’s create two new variables, one called Shiny2
which corresponds to Shiny
squared up, and one called StiMelt
which corresponds to Sticky
+ Melting.
Since we only use these three variables, let’s first reduce the data set to these three variables with select()
to improve readability:
%>%
sensory ::select(Shiny, Sticky, Melting) %>%
dplyrmutate(Shiny2 = Shiny^2,
StiMelt = Sticky + Melting)
## # A tibble: 99 x 5
## Shiny Sticky Melting Shiny2 StiMelt
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 52.8 37.2 33.6 2788. 70.8
## 2 48.6 35.4 36 2362. 71.4
## 3 48 37.2 8.4 2304 45.6
## 4 46.2 21.6 34.2 2134. 55.8
## 5 0 39 34.8 0 73.8
## 6 0 36 21.6 0 57.6
## 7 5.4 22.8 34.8 29.2 57.6
## 8 0 36.6 19.2 0 55.8
## 9 0 37.2 19.8 0 57
## 10 53.4 35.4 32.4 2852. 67.8
## # ... with 89 more rows
If you want to transform a variable, say by changing its type, or re-writing its content, you can use
mutate()
and assign to the new variable the same name as the original one. This will overwrite the existing column with the new one. To illustrate this, let’s transformProduct
from upper case to lower case only. This can be done by mutatingProduct
into the lowercase version ofProduct
(tolower(Product)
):%>% sensory mutate(Product = tolower(Product))
mutate()
being one of the most important function from the {dplyr}
package, more examples of its use are presented throughout this book.
Since performing mathematical computations on non-numerical columns is not possible, conditions can easily be added through mutate()
combined with across()
. Let’s imagine we want to round all variables to 0 decimal, which can only be applied to numerical variables.
To do so, we mutate()
across()
all variables that are considered as.numeric()
(using where()
):
%>%
sensory mutate(across(where(is.numeric), round, digits=0))
## # A tibble: 99 x 34
## Judge Product Shiny `External color inten~` `Color evenness` `Qty of inclus~`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 J01 P01 53 30 23 10
## 2 J01 P02 49 30 13 11
## 3 J01 P03 48 46 17 8
## 4 J01 P04 46 46 38 0
## 5 J01 P05 0 23 49 0
## 6 J01 P06 0 50 24 28
## 7 J01 P07 5 7 17 0
## 8 J01 P08 0 52 49 23
## 9 J01 P09 0 43 18 21
## 10 J01 P10 53 37 11 18
## # ... with 89 more rows, and 28 more variables: `Surface defects` <dbl>,
## # `Print quality` <dbl>, Thickness <dbl>, `Color contrast` <dbl>,
## # `Overall odor intensity` <dbl>, `Fatty odor` <dbl>, `Roasted odor` <dbl>,
## # `Cereal flavor` <dbl>, `RawDough flavor` <dbl>, `Fatty flavor` <dbl>,
## # `Dairy flavor` <dbl>, `Roasted flavor` <dbl>,
## # `Overall flavor persistence` <dbl>, Salty <dbl>, Sweet <dbl>, Sour <dbl>,
## # Bitter <dbl>, Astringent <dbl>, Warming <dbl>, ...
In case only a selection of numerical variables should be rounded, we could also replace where(is.numeric)
by a vector (using c()
) with the names of the variables to round.
%>%
sensory ::select(Shiny, Sticky, Melting) %>%
dplyrmutate(across(c("Shiny","Sticky"), round, digits=0))
4.2.1.1.5 Merging and Separating columns
It can happen that some columns of a data set contain information (strings) that cover different types of information. For instance, we could imagine coding the name of our panelists as FirstName_LastName or Gender_Name, and we would want to separate them into two columns to make the distinction between the different information (i.e. FirstName and LastName, or Gender and Name respectively). In other situations, we may want to merge information present in multiple columns into one.
For illustration, let’s consider the information stored in the Product Info sheet from biscuits_sensory_profile.xlsx. This table includes information regarding the biscuits, and more precisely their Protein and Fiber content (Low or High).
After importing the data, let’s merge these two columns so that both information is stored in one column called ProtFib
.
To do so, unite()
(from {tidyr}
) is used. This function takes as first element the name of the new variables, followed by all the columns to unite, and by providing the separation key to use between these elements (here -):
<- read_xlsx(file_path, sheet="Product Info") %>%
(prod_info unite(ProtFib, Protein, Fiber, sep="-"))
## # A tibble: 11 x 3
## Product ProtFib Type
## <chr> <chr> <chr>
## 1 P01 Low-Low Trial
## 2 P02 Low-High Trial
## 3 P03 High-High Trial
## 4 P04 High-High Trial
## 5 P05 High-Low Trial
## 6 P06 High-Low Trial
## 7 P07 Low-High Trial
## 8 P08 High-Low Trial
## 9 P09 High-High Trial
## 10 P10 Low-Low Commercial product
## 11 POpt High-Low Optimized trial
By default, unite()
removes from the data set the individual variables that have been merged. To keep these original variables, the parameter remove = FALSE
can be used.
Although it is not relevant for combining columns, it is interesting to mention an additional package that can be used to combine elements together. This package is called
{glue}
and provides interesting alternatives to the usualpaste()
andpaste0()
functions.
To reverse the changes (saved here in prod_info
) and to separate a column into different variables, separate()
(from {tidyr}
) is used. Similarly to unite()
, separate()
takes as first parameter the name of the variable to split, followed by the names for the different segments generated, and of course the separator defined by sep
.
In our example, this would be done as following:
%>%
prod_info separate(ProtFib, c("Protein","Fiber"), sep="-")
## # A tibble: 11 x 4
## Product Protein Fiber Type
## <chr> <chr> <chr> <chr>
## 1 P01 Low Low Trial
## 2 P02 Low High Trial
## 3 P03 High High Trial
## 4 P04 High High Trial
## 5 P05 High Low Trial
## 6 P06 High Low Trial
## 7 P07 Low High Trial
## 8 P08 High Low Trial
## 9 P09 High High Trial
## 10 P10 Low Low Commercial product
## 11 POpt High Low Optimized trial
4.2.1.1.6 Conditions
In some cases, the new column to create depend directly on the value(s) of one or more columns present in the data.
An examples of such situations consists in categorizing a continuous variable into groups by converting the age (in year) of the participants into age groups. For such simple examples, some pre-existing functions (e.g. cut()
in this situation) can be used.
However, in other situations, pre-defined functions do not exist and the transformation should be done manually using conditions.
Let’s illustrate this by converting the Shiny
variable (from sensory
) from numeric to classes.
Since the scale used is a 60pt scale, let’s start by creating a class called Low if the score is lower than 30, and High otherwise.
Here, pre-defined functions (e.g. cut()
) are not being used intentionally as a manual transformation is preferred. Instead, mutate()
is associated to ifelse()
, which works as following: ifelse(condition, results if condition is TRUE, results if condition is not TRUE)
%>%
sensory ::select(Shiny) %>%
dplyrmutate(ShinyGroup = ifelse(Shiny < 30, "Low", "High"))
## # A tibble: 99 x 2
## Shiny ShinyGroup
## <dbl> <chr>
## 1 52.8 High
## 2 48.6 High
## 3 48 High
## 4 46.2 High
## 5 0 Low
## 6 0 Low
## 7 5.4 Low
## 8 0 Low
## 9 0 Low
## 10 53.4 High
## # ... with 89 more rows
Let’s imagine the same variable should now be split into three levels: Low, Medium, and High. Such additional group could be obtained by adding an ifelse()
condition within the existing ifelse()
condition (we use 48 instead of 40 for the upper limit to Medium
so that results are displayed on screen):
%>%
sensory ::select(Shiny) %>%
dplyrmutate(ShinyGroup = ifelse(Shiny < 20, "Low",
ifelse(Shiny < 48, "Medium", "High")))
## # A tibble: 99 x 2
## Shiny ShinyGroup
## <dbl> <chr>
## 1 52.8 High
## 2 48.6 High
## 3 48 High
## 4 46.2 Medium
## 5 0 Low
## 6 0 Low
## 7 5.4 Low
## 8 0 Low
## 9 0 Low
## 10 53.4 High
## # ... with 89 more rows
Since there are only 3 conditions in total here, only two entangled ifelse()
are required. This makes the code still manageable. However, in more complex situations (say 10 different conditions are required), such solution quickly becomes tedious to read, to track, and to debug if errors are being made. Instead, the use of an alternative function called case_when()
is preferred. In the previous case, the same conditions would be written as follow:
%>%
sensory ::select(Shiny) %>%
dplyrmutate(ShinyGroup = case_when(Shiny < 20 ~ "Low",
between(Shiny, 20, 48) ~ "Medium",
> 40 ~ "High")) Shiny
## # A tibble: 99 x 2
## Shiny ShinyGroup
## <dbl> <chr>
## 1 52.8 High
## 2 48.6 High
## 3 48 Medium
## 4 46.2 Medium
## 5 0 Low
## 6 0 Low
## 7 5.4 Low
## 8 0 Low
## 9 0 Low
## 10 53.4 High
## # ... with 89 more rows
This provides the same results as previously, except for the exact value 48 which was assigned as High in the ifelse()
example and to Medium in the case_when()
example. This is due to the way between()
15 considers its borders.
4.2.1.2 Handling Rows
After manipulating columns, the next logical step is to manipulate rows. Such operations include three aspects:
- Re-arranging the rows in a logical way;
- Selecting certain rows;
- Filtering entries based on given variables;
- Splitting the data in sub-groups based on the entries of a given variable.
4.2.1.2.1 Re-arranging Rows
The first step of re-arranging rows is done through the arrange()
function from the {dplyr}
package. This function allows sorting the data in the ascending order16. To arrange them in a descending order, the function desc()
is then required.
Let’s re-arrange the data by Judge
and Product
, Judge
being sorted in an ascending order whereas Product
is being sorted in a descending order:
%>%
sensory arrange(Judge, desc(Product))
## # A tibble: 99 x 34
## Judge Product Shiny `External color inten~` `Color evenness` `Qty of inclus~`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 J01 POpt 4.8 33.6 15.6 32.4
## 2 J01 P10 53.4 36.6 11.4 18
## 3 J01 P09 0 42.6 18 21
## 4 J01 P08 0 51.6 48.6 23.4
## 5 J01 P07 5.4 6.6 17.4 0
## 6 J01 P06 0 50.4 24 27.6
## 7 J01 P05 0 23.4 49.2 0
## 8 J01 P04 46.2 45.6 37.8 0
## 9 J01 P03 48 45.6 17.4 7.8
## 10 J01 P02 48.6 30 13.2 10.8
## # ... with 89 more rows, and 28 more variables: `Surface defects` <dbl>,
## # `Print quality` <dbl>, Thickness <dbl>, `Color contrast` <dbl>,
## # `Overall odor intensity` <dbl>, `Fatty odor` <dbl>, `Roasted odor` <dbl>,
## # `Cereal flavor` <dbl>, `RawDough flavor` <dbl>, `Fatty flavor` <dbl>,
## # `Dairy flavor` <dbl>, `Roasted flavor` <dbl>,
## # `Overall flavor persistence` <dbl>, Salty <dbl>, Sweet <dbl>, Sour <dbl>,
## # Bitter <dbl>, Astringent <dbl>, Warming <dbl>, ...
4.2.1.2.2 Selecting Rows
The next step is to select a subset of the data by keeping certain rows only. If the position of the rows to keep is know, this information can be used directly using the slice()
function. Let’s select from sensory
all the data that is related to P01
. A quick look at the data informs us that it corresponds to rows 1 to 89, with a step of 11:
%>%
sensory slice(seq(1,89,11))
## # A tibble: 9 x 34
## Judge Product Shiny `External color intens~` `Color evenness` `Qty of inclus~`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 J01 P01 52.8 30 22.8 9.6
## 2 J02 P01 44.4 34.2 14.4 18.6
## 3 J03 P01 40.2 23.4 9 7.8
## 4 J04 P01 37.8 26.4 15 23.4
## 5 J05 P01 43.8 30 12.9 0
## 6 J06 P01 43.2 27 22.8 31.2
## 7 J07 P01 44.4 18.6 10.8 25.8
## 8 J08 P01 25.2 28.2 6.6 16.8
## 9 J09 P01 45.6 18 28.2 6.6
## # ... with 28 more variables: `Surface defects` <dbl>, `Print quality` <dbl>,
## # Thickness <dbl>, `Color contrast` <dbl>, `Overall odor intensity` <dbl>,
## # `Fatty odor` <dbl>, `Roasted odor` <dbl>, `Cereal flavor` <dbl>,
## # `RawDough flavor` <dbl>, `Fatty flavor` <dbl>, `Dairy flavor` <dbl>,
## # `Roasted flavor` <dbl>, `Overall flavor persistence` <dbl>, Salty <dbl>,
## # Sweet <dbl>, Sour <dbl>, Bitter <dbl>, Astringent <dbl>, Warming <dbl>,
## # `Initial hardness` <dbl>, Brittle <dbl>, Crunchy <dbl>, ...
This is a manual way to select data. However, this procedure may generate an erroneous subset in the case that the row order in the data changes. To avoid mistakes, a more stable procedure of filtering data is proposed in the next section.
4.2.1.2.3 Filtering Data
To define sub-set of data, the filter()
function is being used. This function requires providing an argument that is expressed as a test, meaning that the outcome should either be TRUE (keep the value) or FALSE (discard the value) when the condition is verified or not respectively. In R, this is expressed by the double ‘=’ sign ==
.
Let’s filter the data to only keep the data related to sample P02
:
%>%
sensory filter(Product == "P02")
## # A tibble: 9 x 34
## Judge Product Shiny `External color intens~` `Color evenness` `Qty of inclus~`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 J01 P02 48.6 30 13.2 10.8
## 2 J02 P02 39.6 32.4 18 19.8
## 3 J03 P02 39 18.6 13.2 9
## 4 J04 P02 39.6 41.4 33 25.2
## 5 J05 P02 33.6 30.6 22.8 0
## 6 J06 P02 38.4 30.6 25.8 36
## 7 J07 P02 25.8 22.2 12.6 19.8
## 8 J08 P02 47.4 29.4 18 5.4
## 9 J09 P02 40.2 30.6 26.4 0
## # ... with 28 more variables: `Surface defects` <dbl>, `Print quality` <dbl>,
## # Thickness <dbl>, `Color contrast` <dbl>, `Overall odor intensity` <dbl>,
## # `Fatty odor` <dbl>, `Roasted odor` <dbl>, `Cereal flavor` <dbl>,
## # `RawDough flavor` <dbl>, `Fatty flavor` <dbl>, `Dairy flavor` <dbl>,
## # `Roasted flavor` <dbl>, `Overall flavor persistence` <dbl>, Salty <dbl>,
## # Sweet <dbl>, Sour <dbl>, Bitter <dbl>, Astringent <dbl>, Warming <dbl>,
## # `Initial hardness` <dbl>, Brittle <dbl>, Crunchy <dbl>, ...
Other relevant test characters are the following:
!Product == "P02"
orProduct != "P02"
means different from, and will keep all samples exceptP02
;%in% my_vector
keeps any value included within the vectormy_vector
(e.g.Product %in% c("P01","P02","P03")
keeps all data fromP01
,P02
, andP03
)
In some cases, the tests to perform are more complex as they require multiple conditions. There are two forms of conditions:
&
(read and) is multiplicative, meaning that all the conditions need to be true (Product == "P02" & Shiny > 40
);|
(read or) is additive, meaning that only one of the conditions needs to be true (Product == "P03" | Shiny > 40
)
This system of condition is particularly useful when you have missing values as you could remove all the rows that contain missing values for a given variable. Since we do not have missing values here, let’s create some by replacing all the evaluations for Shiny
that are larger than 40 by missing values. This is done here using ifelse()
, which takes three arguments (in this order): the test to perform (here Shiny > 40
), the instruction if the test passes (here replace with NA
), and the instruction if the test doesn’t pass (here keep the value stored in Shiny
).
<- sensory %>%
sensory_na ::select(Judge, Product, Shiny) %>%
dplyrmutate(Shiny = ifelse(Shiny > 40, NA, Shiny))
In a second step, we filter out all missing values from Shiny
. In practice, this is done by keeping all the values that are not missing:
%>%
sensory_na filter(!is.na(Shiny))
This procedure removed 20 rows since the original table had 99 rows and 3 columns, whereas the filtered table only has 79 rows and 3 columns.
4.2.1.2.4 Splitting Data
After filtering data, the next logical step is to split data into subsets based on a given variable (e.g. by gender). For such purpose, one could consider using filter()
by applying it to each subgroup. In a previous example, this is what we have done when we filtered data for sample P02
only. Of course, the same procedure can be performed until all the other sub-groups are created. However, this solution becomes tedious as the number of samples increases. Instead, we prefer to use split()
which takes as arguments the data and the column to split from:
%>%
sensory split(.$Product)
This function creates a list of n elements (n being the number of samples), each element corresponding to the data related to one sample. Such list can then be used in automated analyses by performing on each sub-data through the map()
function, as it will be illustrated in section REF.
4.2.2 Reshaping Data
Reshaping the data itself is done through pivoting which allows transitioning from a long and thin table to a short and wide table and vice-versa.
To illustrate this, let’s start with a fictive example in which we have 3 consumers providing their liking scores for 2 products, This table corresponds to a long and thin format:
Let’s imagine we need to re-structure the data where products are displayed in rows, and consumers in columns. This corresponds to the short and wide format:
As we will see in the following section, it is very easy to transition from one version to another thanks to pivot_longer()
and pivot_wider()
, both being functions from {tidyr}
.
4.2.2.1 Pivotting Longer
Currently, our sensory
data table is a table in which we have as many rows as Judge x Product, the different attributes being spread across multiple columns. However, in certain situations, it is relevant to have all the attributes stacked vertically, meaning that the table will have Judge x Product x Attributes rows. Such simple transformation can be done with pivot_longer()
which takes as inputs the attributes to pivot, the name of the variables that will contain these names (names_to
), and the name of the column that will contain their entries (values_to
)
With
pivot_longer()
and any other function that requires selecting variables, it is often easier to deselect variables that we do not want to include rather than selecting all the variables of interest. Throughout the book, both solutions are being considered.
%>%
sensory pivot_longer(Shiny:Melting, names_to="Attribute", values_to="Score")
## # A tibble: 3,168 x 4
## Judge Product Attribute Score
## <chr> <chr> <chr> <dbl>
## 1 J01 P01 Shiny 52.8
## 2 J01 P01 External color intensity 30
## 3 J01 P01 Color evenness 22.8
## 4 J01 P01 Qty of inclusions 9.6
## 5 J01 P01 Surface defects 22.8
## 6 J01 P01 Print quality 48.6
## 7 J01 P01 Thickness 38.4
## 8 J01 P01 Color contrast 37.8
## 9 J01 P01 Overall odor intensity 16.8
## 10 J01 P01 Fatty odor 6.6
## # ... with 3,158 more rows
This transformation converts a table of 99 rows and 34 columns into a table containing 3168 (9932) rows and 4 columns. In the pivoted table, the names of the variables (stored here in Attribute*) are in the same order as presented in the original table.
In case the attribute names are following a standard structure, say “attribute_name modality” as is the case in sensory
for some attributes, an additional parameter of pivot_longer()
becomes handy as it can split the Attribute
variable just created into say Attribute
and Modality
.
To illustrate this, let’s reduce sensory
to Judge
, Product
, and all the variables that end with odor or flavor (all other variables being discarded). After pivoting the subset of columns, we automatically split the attribute names into Attribute
and Modality
by informing the separator between names (here, a space):
%>%
sensory ::select(Judge, Product, ends_with("odor"), ends_with("flavor")) %>%
dplyrpivot_longer(-c(Judge,Product), names_to=c("Attribute","Modality"), values_to="Score", names_sep=" ")
## # A tibble: 693 x 5
## Judge Product Attribute Modality Score
## <chr> <chr> <chr> <chr> <dbl>
## 1 J01 P01 Fatty odor 6.6
## 2 J01 P01 Roasted odor 15.6
## 3 J01 P01 Cereal flavor 24.6
## 4 J01 P01 RawDough flavor 28.2
## 5 J01 P01 Fatty flavor 13.8
## 6 J01 P01 Dairy flavor 0
## 7 J01 P01 Roasted flavor 24.6
## 8 J01 P02 Fatty odor 5.4
## 9 J01 P02 Roasted odor 20.4
## 10 J01 P02 Cereal flavor 25.8
## # ... with 683 more rows
This parameter combines both the power of pivot_longer()
and separate()
in one unique process.
Note that more complex transformations through the use of regular expressions (and the names_pattern
option) can be considered. More information on this topic is provided in REF CHAPTER TEXTUAL (TO CHECK IF IT IS THE CASE!).
It can happen that with
pivot_longer()
, further transformation performed on the long and thin table may not maintain their original order (usually, the names are re-ordered alphabetically). If such re-ordering should not happen as it would impact the final results, there is a simple work around that ensure that the order is kept. The solution simply consists in transforming the newly created variable as a factor which takes as levels the order of the elements as they were in the original data (usefct_inorder()
to maintain the order as shown in the data). To illustrate this, let’s look at the two following results:%>% sensory ::select(Judge, Product, Shiny, Salty, Bitter, Light) %>% dplyrpivot_longer(-c("Judge","Product"), names_to="Variable", values_to="Score") %>% split(.$Variable) %>% names(.) %>% sensory ::select(Judge, Product, Shiny, Salty, Bitter, Light) %>% dplyrpivot_longer(-c("Judge","Product"), names_to="Variable", values_to="Score") %>% mutate(Variable = fct_inorder(Variable)) %>% split(.$Variable) %>% names(.)
Other examples using this trick will be used throughout this book (e.g. see Section REF)
As an alternative to pivot_longer()
, the package called {reshape2}
provides a function called melt()
which pivots automatically the entire set of numerical variables, the qualitative variables being considered as id variables. If performed on a matrix with row names, the new table will have two columns containing the row and column names.
library(reshape2)
melt(sensory)
4.2.2.2 Pivotting Wider
The complementary/opposite function to pivot_longer()
is pivot_wider()
. This function pivots data horizontally, hence reducing the number of rows and increasing the number of columns. In this case, the two main parameters to provide is which column will provide the new columns to create (name_from
), and what are the values to use to fill this table (values_from
).
From the previous example, we could set names_from = Attribute
and values_from = Score
to return to the original format of sensory
. However, let’s reduce the data set to Product
, Judge
, and Shiny
only, and let’s pivot the Judge
and Shiny
columns:
%>%
sensory ::select(Judge, Product, Shiny) %>%
dplyrpivot_wider(names_from = Judge, values_from = Shiny)
## # A tibble: 11 x 10
## Product J01 J02 J03 J04 J05 J06 J07 J08 J09
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 P01 52.8 44.4 40.2 37.8 43.8 43.2 44.4 25.2 45.6
## 2 P02 48.6 39.6 39 39.6 33.6 38.4 25.8 47.4 40.2
## 3 P03 48 36 35.4 15 37.2 33 16.2 15 39
## 4 P04 46.2 36 48 38.4 47.4 37.8 27 54 48.6
## 5 P05 0 5.4 35.4 10.2 21 17.4 0 15 21
## 6 P06 0 13.8 0 0 7.8 21 5.4 7.2 6.6
## 7 P07 5.4 41.4 37.2 30.6 46.2 26.4 28.2 29.4 47.4
## 8 P08 0 0 21 7.8 13.8 15 0 10.8 12
## 9 P09 0 0 0 0 9.6 9 6.6 0 6.6
## 10 P10 53.4 13.8 34.2 16.2 36 32.4 3 20.4 25.8
## 11 POpt 4.8 10.2 36 12 17.4 13.2 10.2 17.4 30.6
This procedure creates a table with as many rows as there are products, and as many columns as there are panelists (+1 since the product information is also in a column).
These procedures are particularly useful in consumer studies, since pivot_longer()
and pivot_wider()
allows restructuring the data for analysis such as ANOVA (pivot_longer()
output) and preference mapping or clustering (pivot_wider()
structure).
It is important to notice that the pivot_wider()
format potentially contains more data. Let’s imagine the sensory test was performed following an incomplete design, meaning that each panelist did not evaluate all the samples. Although the long and thin structure would not show missing values (the entire rows without data being removed), the shorter and larger version would contain missing values for the products that panelists did not evaluate. If the user wants to automatically replace these missing values with a fixed value, say, it is possible through the parameter values_fill
(e.g. values_fill=0
would replace each missing value with a 0). Additionally, after pivoting the data, if multiple entries exist for a combination row-column, pivot_wider()
will return a list of elements. In the next Section, an example illustrating such situation and how to present it will be presented.
4.2.3 Transformation that Alters the Data
In some cases, the final table to generate requires altering the data, by (say) computing the mean across multiple values, or counting the number of occurrences of factor levels for instance. In other words, we summarize the information, which also tend to reduce the size of the table. It is hence no surprise that the function used for such data reduction is called summarise()
(or summarize()
, both notation work) and belongs to the {dplyr}
package.
4.2.3.1 Introduction to Summary Statistics
In practice, summarise()
applies a function (whether it is the mean()
, or a simple count using n()
for instance) on a set of values. Let’s compute the mean on all numerical variables of sensory
:
%>%
sensory summarise(across(where(is.numeric), mean))
## # A tibble: 1 x 32
## Shiny `External color int~` `Color evenness` `Qty of inclus~` `Surface defec~`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 23.9 33.7 28.2 20.6 23.3
## # ... with 27 more variables: `Print quality` <dbl>, Thickness <dbl>,
## # `Color contrast` <dbl>, `Overall odor intensity` <dbl>, `Fatty odor` <dbl>,
## # `Roasted odor` <dbl>, `Cereal flavor` <dbl>, `RawDough flavor` <dbl>,
## # `Fatty flavor` <dbl>, `Dairy flavor` <dbl>, `Roasted flavor` <dbl>,
## # `Overall flavor persistence` <dbl>, Salty <dbl>, Sweet <dbl>, Sour <dbl>,
## # Bitter <dbl>, Astringent <dbl>, Warming <dbl>, `Initial hardness` <dbl>,
## # Brittle <dbl>, Crunchy <dbl>, `Fatty in mouth` <dbl>, Light <dbl>, ...
As can be seen, the grand mean is computed for each attribute. It can also be noticed that all the other variables that were not involved have been removed (e.g. Judge
and Product
as they are not numerical variables).
If multiple functions should be applied, we could perform all the transformation simultaneously as following:
%>%
sensory summarise(across(where(is.numeric), list(min=min, max=max)))
## # A tibble: 1 x 64
## Shiny_min Shiny_max `External color intens~` `External colo~` `Color evennes~`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 54 6.6 55.2 6.6
## # ... with 59 more variables: `Color evenness_max` <dbl>,
## # `Qty of inclusions_min` <dbl>, `Qty of inclusions_max` <dbl>,
## # `Surface defects_min` <dbl>, `Surface defects_max` <dbl>,
## # `Print quality_min` <dbl>, `Print quality_max` <dbl>, Thickness_min <dbl>,
## # Thickness_max <dbl>, `Color contrast_min` <dbl>,
## # `Color contrast_max` <dbl>, `Overall odor intensity_min` <dbl>,
## # `Overall odor intensity_max` <dbl>, `Fatty odor_min` <dbl>, ...
In this example, each attribute is duplicated with _min
and _max
to provide the minimum and maximum value for each attribute.
It would be a good exercise to restructure this table using
pivot_longer()
withnames_sep
followed bypivot_wider()
to build a new table that shows for each attribute (in rows) the minimum and the maximum in two different columns.
By following the same principles, many other functions can be performed, whether they are built-in R or created by the user.
Here is a recommendation of interesting descriptive functions to consider with summarise()
:
mean()
,median()
(or more generallyquantile()
) for the mean and median (or any other quantile);sd()
andvar()
for the standard deviation and the variance;min()
,max()
,range()
(provides both the min and max) ordiff(range())
(for the difference between min and max);n()
andsum()
for the number of counts and the sum respectively.
4.2.3.2 Introduction to grouping
It can appear that the interest is not in the grand mean, but in the mean per product (say), or per product and panelist (for test with duplicates). In such cases, summarize()
should aggregate set of values per product, or per product and panelist respectively. Such information can be passed on through group_by()
17.
%>%
sensory group_by(Product) %>%
summarise(across(where(is.numeric), mean)) %>%
ungroup()
## # A tibble: 11 x 33
## Product Shiny `External color intensity` `Color evenness` `Qty of inclusions`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 P01 41.9 26.2 15.8 15.5
## 2 P02 39.1 29.5 20.3 14
## 3 P03 30.5 43.6 30.7 17.6
## 4 P04 42.6 43.3 37.7 15.1
## 5 P05 13.9 17.3 36.1 26.7
## 6 P06 6.87 40.7 25.5 28
## 7 P07 32.5 24.1 22.9 17.9
## 8 P08 8.93 45.1 42.3 23.5
## 9 P09 3.53 39.9 27.7 22.3
## 10 P10 26.1 24.7 23.1 21.9
## 11 POpt 16.9 36.1 28.1 24.5
## # ... with 28 more variables: `Surface defects` <dbl>, `Print quality` <dbl>,
## # Thickness <dbl>, `Color contrast` <dbl>, `Overall odor intensity` <dbl>,
## # `Fatty odor` <dbl>, `Roasted odor` <dbl>, `Cereal flavor` <dbl>,
## # `RawDough flavor` <dbl>, `Fatty flavor` <dbl>, `Dairy flavor` <dbl>,
## # `Roasted flavor` <dbl>, `Overall flavor persistence` <dbl>, Salty <dbl>,
## # Sweet <dbl>, Sour <dbl>, Bitter <dbl>, Astringent <dbl>, Warming <dbl>,
## # `Initial hardness` <dbl>, Brittle <dbl>, Crunchy <dbl>, ...
This procedure creates a tibble with 11 rows (product) and 33 columns (32 sensory attributes + 1 column including the product information) which contains the mean per attribute for each sample, also known as the sensory profiles of the products.
In some cases, the data should not be aggregated across rows, but by rows. In such cases, it is important to specify that each computation should be done per row by using
rowwise()
prior to performing the transformation. For instance, if we would want to extract the minimum betweenShiny
,Salty
, andBitter
, we could write the following code:%>% sensory ::select(Judge, Product, Shiny, Salty, Bitter) %>% dplyrrowwise() %>% mutate(Min = min(Shiny, Salty, Bitter))
4.2.3.3 Illustrations of Data Manipulation
Let’s review the different transformations presented here by generating the sensory profiles of the samples through different approaches18.
In the previous example, we’ve seen how to obtain the sensory profile using summarise()
across()
all numerical variables. In case a selection of the attributes should have been done, we could use the same process by simply informing which attributes to transform:
%>%
sensory group_by(Product) %>%
summarise(across(Shiny:Melting, mean)) %>%
ungroup()
The list of attributes to include can also be stored in an external vector:
<- colnames(sensory)[5:ncol(sensory)]
sensory_attr %>%
sensory group_by(Product) %>%
summarise(across(all_of(sensory_attr), mean)) %>%
ungroup()
A different approach consists in combining summarise()
to pivot_longer()
and pivot_wider()
. This process requires summarizing only one column by Product and Attribute:
%>%
sensory pivot_longer(Shiny:Melting, names_to="Attribute", values_to="Scores") %>%
mutate(Attribute = fct_inorder(Attribute)) %>%
group_by(Product, Attribute) %>%
summarise(Scores = mean(Scores)) %>%
pivot_wider(names_from=Attribute, values_from=Scores) %>%
ungroup()
Here, we transformed Attribute
into a factor using fct_inorder()
to ensure that the double pivoting procedure maintains the original order. Without this line of code, the final table would have the columns reordered alphabetically.
As you can see, R provides the following message:
summarise()
has grouped output by ‘Product.’ You can override using the.groups
argument. This message is just informative, and can be hidden by adding the following code at the start of your script:options(dplyr.summarise.inform = FALSE)
.
What would happen if we would omit to summarise()
the data in between the two pivoting functions? In that case, we also remove the column Judge
since the means should be computed across panelists…
%>%
sensory pivot_longer(Shiny:Melting, names_to="Attribute", values_to="Scores") %>%
::select(-Judge) %>%
dplyrpivot_wider(names_from=Attribute, values_from=Scores)
As can be seen, each variable is of type list in which each cell contains dbl [9]
: This corresponds to the scores provided by the 9 panelists to that product and that attribute. Since we would ultimately want the mean of these 9 values to generate the sensory profiles, a solution comes directly within pivot_wider()
through the parameter values_fn
which applies the function provided here on each set of values:
%>%
sensory pivot_longer(Shiny:Melting, names_to="Attribute", values_to="Scores") %>%
::select(-Judge) %>%
dplyrpivot_wider(names_from=Attribute, values_from=Scores, values_fn=mean)
Through this simple example, we’ve seen that the same results can be obtained through different ways. It is important to keep this in mind as you may find the solution to your own challenges by simply considering different paths.
4.2.4 Combining Data from Different Sources
It often happens that the data to analyze is stored in different files, and need to be combined or merged. Depending on the situations, different solutions are required.
4.2.4.1 Binding Vertically
Let’s start with a simple example where the tables match in terms of variables, and should be combined vertically. To illustrate this situation, the data stored in the file excel-scrap.xlsx is used. This file contains a fictive example in which 12 assessors evaluated 2 samples on 3 attributes in triplicate, each replication being stored in a different sheet.
The goal here is to read the data stored in the different sheets, and to combine them vertically in one unique file for further analysis. Let’s start with importing the data:
<- file.path("data", "excel_scrap.xlsx")
path <- read_xlsx(path, sheet=1)
session1 <- read_xlsx(path, sheet=2)
session2 <- read_xlsx(path, sheet=3) session3
To combine the tables vertically, we could use the basic R function rbind()
. However, we prefer to use bind_rows()
from {dplyr}
since it better controls for the columns by ensuring that the order is respected. Moreover, if one of the tables contains a variable that the other don’t, this variable will be kept and filled in with missing values when the information is missing. Additionally, webind_rows()
allows keeping track of the origin of the data through the parameter .id
. This is of particular interest in this example since a new Session
column can be created (and used) to distinguish between tables. This process is used to avoid losing such useful information especially since it is not directly available within the data: If it were, the parameter .id
could have been ignored.
This solution works fine, especially since there were only 3 files to combine. Ultimately, we would prefer to automate the reading process so that all the files are directly imported and combined. This more efficient solution is presented in 8.4.3.
4.2.4.2 Binding Horizontally
In other cases, the tables to combine contain different information (variables) on the same entities (rows), and the tables should be merged horizontally. To do so, a first solution consists in using the functions cbind()
({base}
) and/or bind_cols()
({dplyr}
) can be used. However, some of these functions require that the tables to combine must already have the rows in the exact same order (no check is being done) and must be of the same size.
If that is not the case, merging tables should be done using merge()
({base}
), or preferably through the different *_join()
functions from ({dplyr}
). For illustration, let’s consider these two tables to merge:
Depending on the merging degree to consider between tables X and Y, there are four different *_join()
versions to consider:
full_join()
keeps all the cases from X and Y regardless whether they are present in the other table or not (in case they are not present, missing values are introduced) [it corresponds tomerge(all=TRUE)
];
inner_join()
only keeps the common cases, i.e. cases that are present in both X and Y [corresponds tomerge(all=FALSE)
];
left_join()
keeps all the cases from X [corresponds tomerge(all.x=TRUE, all.y=FALSE)
];
right_join()
keeps all the cases from Y [corresponds tomerge(all.x=FALSE, all.y=TRUE)
];
anti_join()
only keeps the elements from X that are not present in Y (this is particularly useful if you have a tibble Y of elements that you would like to remove from X).
The merging procedure requires the users to provide a key, i.e. a (set of) variable(s) used to combine the tables. For each unique element defined by the key, a line is being created. When needed, rows of a table are being duplicated. Within the different *_join()
functions, the key is informed by the by
parameter, which may contain one or more variables with the same or different names.
For illustration, let’s use the data set called biscuits_consumer_test.xlsx, which contains three tabs:
<- here("data","biscuits_consumer_test.xlsx")
file_path excel_sheets(file_path)
## [1] "Biscuits" "Time Consumption" "Weight"
The three sheets contain the following information, which need to be combined:
- Biscuits: The consumers’ evaluation of the 10 products and their assessment on liking, hunger, etc. at different moments of the test.
- Time Consumption: The amount of cookies and the time required to evaluate them in each sitting.
- Weight: The weight associated to each cookie.
Let’s start by combining Time Consumption and Weight so that we can compute the total weight of biscuits eaten by each consumer in each sitting. In this case, the joining procedure is done by Product
since the weight is only provided for each product. The total weight eaten (Amount
) is then computed by multiplying the number of cookies eaten (Nb biscuits
) by Weight
:
<- read_xlsx(file_path, sheet="Time Consumption")
time <- read_xlsx(file_path, sheet="Weight")
weight
<- time %>%
(consumption full_join(weight, by="Product") %>%
mutate(Amount = `Nb biscuits`*Weight))
## # A tibble: 1,070 x 7
## Judge Product `Time (min)` `Nb biscuits` `Qty drink (%)` Weight Amount
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 J1 1 10min51 5 100 10.2 51.2
## 2 J1 2 11min29 6 100 10.3 61.9
## 3 J1 3 8min25 4 100 10.2 40.8
## 4 J1 4 6min57 8 100 10.5 83.7
## 5 J1 5 3min09 3 100 10.2 30.6
## 6 J1 6 8min35 4 100 8.12 32.5
## 7 J1 7 8min05 5 100 10.3 51.3
## 8 J1 8 4min49 1 100 9.88 9.88
## 9 J1 9 10min58 4 100 9.53 38.1
## 10 J1 10 16min33 6 100 12.5 75
## # ... with 1,060 more rows
As can be seen, the Weight
information stored in the Weight sheet has been replicated every time each sample has been evaluated by another respondent.
The next step is then to merge this table to Biscuits
. In this case, since both data set contain the full evaluation of the cookies (each consumer evaluating each product), the joining procedure needs to be done by Judge
and Product
simultaneously. A quick look at the data shows two important things:
In Biscuits, the consumer names only contain the numbers whereas in
consumption
, they also contain aJ
in front of the name: This needs to be fixed as the names need to be identical to be merged, else they will be considered separately and missing values will be introduced. In practice, this will be done by mutatingConsumer
and by pasting aJ
in front of the number using the functionpaste0()
.The names that contain the product (
Samples
andProduct
) and consumers (Consumer
andJudge
) information are different in both data set. We could rename these columns in one data set to match the other, but instead we will keep the two names and inform it withinfull_join()
. This is done through theby
parameter as following:"name in dataset 1" = "name in dataset 2"
<- read_xlsx(file_path, sheet="Biscuits") %>%
(biscuits mutate(Consumer = str_c("J",Consumer)) %>%
full_join(consumption, by=c("Consumer"="Judge", "Samples"="Product")))
The three data sets are now flawlessly joined into one that can be further manipulated and/or analysed.
Bibliography
Note that some of the examples presented in this chapter emphasize the how to?, not the why?, and are not necessarily chosen to convey a scientific meaning.↩︎
For more details about the data set and/or data importation, please see Section 8.4.↩︎
By default,
between(x, value1, value2)
considers value1 <= x <= value2.↩︎For numerical order, this is simply re-arranging the values from the lowest to the highest. For strings, the entries are then sorted alphabetically unless the variable is a factor in which case the order of the levels for that factor is being used.↩︎
We strongly recommend you to
ungroup()
blocks of code that includesgroup_by()
once the computations are done to avoid any unexpected results. Otherwise, further computations may be done on the groups when it should be performed on the full data.↩︎It is important to realize that any data manipulation challenge can be tackled in many different ways, so don’t be afraid to think out of the box when solving them.↩︎