Chapter 12 Machine Learning

Artificial Intelligence (AI) and Machine Learning (ML) in particular have gained a lot of attention in recent years. With the increase of data availability, data storage, and computing power, many techniques that were just dreams back then are now easily accessible, and used. And of course, the sensory and consumer science field is not an exception to this rule as we start seeing more and more ML applications…although in our case, we do not have Big Data, instead we do have diverse data! For many of us, AI and ML seems to be a broad and complex topic. This assertion is true, and in fact it would deserve a whole book dedicated just to it. However, our intention in this chapter is to introduce and demistify the concept of ML, by:

explaining the differences between supervised and unsupervised ML models,

proving that we were already doing it long ago,

extending it to more advanced techniques,

highlighting its main applications in the field.

To do so, some basic code and steps will be provided to the reader to get familiar with such approach. Throughout this chapter, some more specialized resources is provided for those who have the courage and motivation to dig deeper into this topic.

12.1 Introduction

Machine Learning is currently a hot topic in the sensory and consumer science field. It is one of the most game-changing technological advancements to support CPG companies in the development of new products, playing a considerable role in speeding up (and at the same time reducing the costs) the steps involved in the R&D process. In today’s fast-moving and increasingly competitive corporate world, companies that are embracing, adopting and opening their minds to digital transformation and artificial intelligence (AI), moving towards the age of automation, are not one but many steps ahead of their competitors.

Machine Learning (ML) is a branch of AI, which is based on the idea that systems can learn from data, and that has the capability to evolve. Generally speaking, ML refers to various programming techniques that are able to process large amounts of data and extract useful information from it. It refers to a method of data analysis that build intelligent algorithms that can automatically improve through the experience gained from the data and identify patterns or make decisions with minimal human intervention, without being explicitly programmed. ML focuses on using data and algorithms to mimic the way humans learn, gradually improving their accuracy.

The definition of the objectives or the situation where ML would bring value refers to the very first step of the process. Once that is clear, the next step is to collect data or dig into the historical data sets to understand what information is available and/or has to be obtained. The data varies according to the situation, but it may refer to product composition or formulation, instrumental measurements (e.g., pH, color, rheology, GCMS, etc.), sensory attributes (e.g., creaminess, sweetness, bitterness, texture, consistency, etc.), or consumer behavior (e.g., frequency of consumption/use, consumption/use situation, etc.), demographics (e.g., age, gender, skin type, etc.) and consumer responses (e.g. liking, CATA questions, JAR questions, etc.).

The data set size and quality are very important as they impact directly the model’s robustness. Specific recommendations on that according to the situation, data type, and objectives are as following: In general, the higher the number of objects (usually the number of samples) the better, 12-15 being the minimum recommended. The number of measurements (instrumental, sensory and/or consumer measurements) and the number of consumers evaluating the products are also very relevant to the model’s quality. In sensory and consumer science studies, the number of measurements is usually sufficient, so nothing to worry about here. In practice, it is recommended to have a minimum of 100 consumers, although the more the better. Regarding data quality, one of the most important aspects (besides the standardization of data collection) corresponds to the variability of the samples. The larger the variability between samples, the broader the space the model will cover. Additionally, it is strongly recommended to capture the consumers’ individual differences, not only through demographic information, but also through Just About Right (JAR) or Ideal Profile Method (IPM) questions. Ultimately, within-subject design (sequential monadic design) provide better quality models.

Depending on the aims of the analysis, there are some variations on how to group or classify ML algorithms. They can be divided into three prominent methods: supervised learning, unsupervised learning, and reinforcement learning. In this section, we mainly focus on the first two approaches.

12.2 Machine Learning Methods

Supervised learning is the most popular type of machine learning that uses labeled data, it is a process that involves providing input data as well as correct output data to the machine learning model. The ultimate goal of the algorithm is to find a mapping function to map input variables with output variables. In supervised learning, models are initially trained using a subset of the data (training data set) and afterwards, the model is tested and validated using the other part of the data (test data set and validation data set). Once this process is done, the model can continuously improve, discovering new patterns and relationships as it trains itself on new data sets.

A very common situation where supervised learning is widely used is to predict consumer responses (E.g., Liking, Perception, Benefit) based on analytical measurements and/or sensory data. In this situation, ML models can provide insights and directions on how to improve product performance and has also the ability to predict consumer response based on analytical and/or sensory data, working as a powerful screening tool where only prototypes or products with the highest potential are moved to the next level, which may be the consumer testing. Another common situation is the use of supervised machine learning to predict product sensory profile or consumer response, based on the formulation or ingredients of a product. In this situation, the ML would be of great support for the developers, who can much easier understand and get clear directions on what has to be changed in the formulation to improve the product sensory profile or consumer performance.

Unlike supervised learning, unsupervised learning uses unlabeled or untagged data, which means inputs where the output values are not known. In this case, users do not need to supervise the model, instead, the algorithm operates independently to finds patterns and trends, trying to learn from the data distribution the distinguishing features and associations through similarity and dissimilarity measurements. Its ability to discover hidden patterns and get similarities/differences information is what make it the ideal solution for exploratory analysis and consumer segmentation, extracting insights from the data sets.

Unsupervised machine learning is commonly used to segment consumers into groups based on their similarities related to a variety of factors such as shopping or usage behavior, attitudes, interests, liking or preferences. As consumers in the same market segment tend to respond similarly, the segmentation process is a key strategy for companies to understand and tailor effectively their products or marketing approaches for different target groups. Similarly, unsupervised models are also very used to classify group of products in homogeneous analytical, sensory and/or consumer attributes in order to identify different segments in the market.

Let’s get started by Unsupervised learning first.

12.2.1 Unsupervised learning

In the sensory and consumer science field, unsupervised learning models are usually used for Clustering and Dimensionality Reduction.

Clustering is a technique used to discover in a high-dimensional data, groups of observations that are similar to each other and significantly different from the rest. In other words, it is a method that groups unlabeled data based on their similarities and differences in the way that objects with most similarities remains into a group and has less or no similarities with the objects of another group. As previously mentioned, a very common application is to cluster consumers and classify groups of products based on their similarities/dissimilarities. Clustering algorithms can be categorized into a few types, which are Exclusive (E.g. k-means), Hierarchical and Probabilistic clustering (E.g. Gaussian Mixture Model). The first two are the ones most used and well known in the sensory field.

Dimensionality reduction is a technique used to transform a high dimensional space into a low-dimensional space that retains some meaningful properties of the original data. It refers to the process of reducing the number of features/attributes in a data set while maintaining as much of the variation as possible. This method is a important not only as a technique for data visualization, but also as a pre processing step to reduce training time and computational resources, improve ML algorithms performance/accuracy due to less misleading data, avoid problems of overfitting, find latent variables, remove redundant features and noise, among others. There are numerous dimensionality reduction methods that can be used according to the data type and different requirements. The most common and well known dimensionality reduction methods used in the sensory and consumer science field are the ones that apply linear transformations, including Principal Components Analysis (PCA) and Factor Analysis (FA).

12.2.1.1 Clustering and Dimensionality Reduction

Let’s start installing/loading the necessary packages and by preparing the dataset for the cluster analysis, with PCA being used as a pre processing method. We will be using a wine dataset, from rattle package, that consists of the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample.

library(tidyverse)
library(rattle)

# load data

wine.fl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
wine <- read.csv(wine.fl,header = F)

# Names of the variables

wine.names=c("Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium",
             "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins",
             "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline")
colnames(wine)[2:14]=wine.names
colnames(wine)[1]="Wine_type"

wine_dataset <- tibble(wine)

Dimensionality reduction technique via PCA will be applied before clustering to allow coherent patterns to be detected more clearly. As well known, PCA retains most of the information/variance of the data set and by removing features with low variance, a more robust clustering can be generated. In other words, this technique reveals distinct groups if data exhibits clustering and by retaining the components with the highest variance, the clusters tends to be more visible.

As the data set being used contain variables with different units/scales, PCA with scaling will be applied. We will be using the {FactoMineR} package to perform the PCA analysis and {factoextra} to easily extract and visualize the results.

library (FactoMineR)
library (factoextra)

pca_results <- PCA(wine_dataset, scale.unit = TRUE)

pca_eig <- fviz_eig(pca_results, addlabels = TRUE, ylim = c(0,50))

pca_plot <- fviz_pca_biplot(pca_results, repel = TRUE,
                            label = "var",
                            col.var = "red",
                            col.ind = "black")

pca_plot

reduced_dataset <- data.frame(pca_results$ind$coord[, 1:2]) %>%
  tibble()

The PCA shows that there are apparently 3 distinct groups of wines.

Let’s now use its results for the clustering analysis. Although agglomerative hierarchical clustering is more common in the sensory and consumer research, it was already shown in Chapter 10, so for this example we decided to use a different solution - k-means clustering. K-means clustering is also a very known unsupervised machine learning algorithm for partitioning a given data set in a way that the total intra-cluster variation is minimized. Both approaches (hierarchical clustering and k-means) are fairly used, being the main different the rule of forming clustering and the requirement to pre-specify the number of cluster to be produced the main different between them. Detailed information about clustering methods and analysis can be found in the book Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning by Alboukadel Kassambara (Kassambara (2017)).

Here we use k-means clustering as an example. The first step of k-means clustering is to estimate the optimal number of clusters (k), and this can be conveniently done using the function fviz_nbclust () from {factoextra} package. The plot represents the variance within the clusters and the bend, also called “elbow,” indicates that additional clusters beyond the third have little value. In our situation, we will then classify the observations into 3 clusters.

fviz_nbclust(reduced_dataset, kmeans, method = "wss")

As the k-means algorithm starts with k randomly selected centroids, it’s recommend to initially set a seed, through the use of the function set.seed (), to make reproducible results. We will then perform k-means clustering with k = 3. In the k-means function, we can also choose the number of random sets, which means the number of times R will try different random starting assignments. The default is 1, but to get more stable results we will specify nstart = 20. The function fviz_cluster, from {factoextra} package, will be used for the visualization.

set.seed(123)

km_dw <- kmeans(reduced_dataset, 3, nstart = 20)

fviz_cluster(
  list(data = reduced_dataset, cluster = km_dw$cluster),
  ellipse.type = "norm",
  geom = "point",
  stand = FALSE
)

In the resulting plot, we can easily see the observation represented by points, and the color and ellipse identifying each of the 3 clusters. We suggest you to re-run the codes to define the optimal number of clustering and the cluster analysis without dimensionality reduction. You will see how patterns were detected much more clear when this technique was applied.

12.2.2 Supervised learning

There are many ways to carry out supervised machine learning, and it’s by itself a big topic. We recommend reading “Hands-On Machine Learning with R,” by Bradley Boehmke and Brandon Greewell (https://bradleyboehmke.github.io/HOML/) for more in-depth information. In sensory and consumer science, supervised learning is commonly carried out using a type of regression, where we use for instance consumer ratings as output (target), and products information (i.e. sensory profiles and analytical measurement) as input.

12.2.2.1 Regression

Regression methods approximate the target variable⁴⁷ with (usually linear) combination of predictor variables. There are multiple regression algorithms varying by type of data they can handle, type of target variable and additional aspects like ability to perform dimensionality reduction. We will take a walk through the ones most relevant for sensory and consumer science.

Linear regression: The simplest and most popular variant is linear regression in which continuous target variable is approximated as linear combination of predictors in a way that minimizes sum of squared estimates of errors (SSE). It can be for example used to predict consumer liking of a product based on it’s sensory profile, but user has to keep in mind that linear regression can in some cases return value outside reasonable range of target values. This can be addressed by capping predictions to desired range. Functions in R to apply linear regression are: lm() and glm() or parsnip::linear_reg() %>% parsnip::set_engine("lm") when using tidymodels workflow.

Logistic regression: Logistic regression is an algorithm which by use of logistic transformation allows to apply the same approach as linear regression to cases with binary target variables. It can be used in R with glm(family = "binomial") or parsnip::logistic_reg() %>% parsnip::set_engine("glm") when using tidymodels workflow.

Penalized regression: It is often that the data we want to use for modeling have a lot of predictor variables, possibly with lot of high correlations. In such cases, linear/logistic regression may become unstable and produce unreasonable predictions. This can be addressed by use of so called penalized regression. It is a special case where instead of minimizing pure error term, algorithm minimizes both error and regression coefficients at the same time. This leads to more stable predictions.

There are three variations of penalized regression and all of them can be accessed via function glmnet::glmnet() ( $β$ is set of regression coefficients and $λ$ is a parameter to be set by user or determined from cross-validation):

Ridge regression (L2 penalty) minimizes $S S E + λ \sum | β |^{2}$ and drives the coefficients to smaller values
Lasso regression (L1 penalty) $S S E + λ \sum | β |$ and forces some of the coefficients to vanish what can be used for variable selection
Elastic-net regression is a combination of the two previous variants $S S E + λ_{1} \sum | β | + λ_{2} \sum | β |^{2}$ .

Penalized regression can be also ran in tidymodels workflow with or parsnip::linear_reg() %>% parsnip::set_engine("glmnet").

MARS: One limitation of all mentioned so far methods is that they assume linear relationship between predictor and target variables. Multivariate adaptive regression spline (MARS) addresses this by modeling non-linearities with piece wise linear function. This gives a nice balance between simplicity and ability to fit complex data, for example $Λ$ -shaped once where there is a maximal point from which function decreases in both directions. In R this model can be accessed via earth::earth() function.

PLS: In case of multiple target variables one can apply partial least squares (PLS) regression which, similarly to PCA looks for components that maximizes explained variance of the predictors, but at the same time also maximizes their correlation to target variables. PLS can be applied with lm() specifying multiple targets or in tidymodels workflow with plsmod::pls() %>% parsnip::set_engine("mixOmics").

There are many other ways to perform machine learning as well. Below is some brief introduction of them.

12.2.2.2 K-nearest neighbors

A very simple, yet useful and robust algorithm that works for both numeric and nominal target variables is K-nearest neighbors. The idea is that for every new observation we want to predict the algorithms finds K closest points in training set and use either their mean value (for numeric targets) or most frequent value (for nominal targets) as prediction. This algorithm can be used with kknn::kknn() function or in tidymodels workflow with parsnip::nearest_neighbor() %>% parsnip::set_engine("kknn").

12.2.2.3 Decision trees

Decision tree models the data by splitting the training set in smaller subsets in a way that each split is done by a predictor variable so that it maximizes the difference in target variable between the subsets. One important advantage of decision trees is that they can model complex relationships and interactions between predictors. To use decision tree in R one can use rpart::rpart() or in tidymodels workflow with parsnip::decision_tree() %>% parsnip::set_engine("rpart").

12.2.2.4 Black boxes

So called black boxes are class of models that have too complex structure to directly interpret relationship between predictor variables and a value predicted by the model. Their advantage usually is ability to model more complicated data than in case of interpretable models, but they have greater risk of overfitting (fitting to noise in training data). Also, lack of clear interpretation may be not acceptable in some business specific use cases. The later problem can be addressed by use of explanation algorithms that will be discussed in later part of this chapter.

12.2.2.5 Random forests

A random forest is a set of decision trees, each one trained on random subset of observations and/or predictors. The final prediction is an average of individual trees’ predictions, in the way that increasing the number of trees increases the precision of the outcome. A random forest minimize the limitations of a decision tree algorithm, reducing the overfitting of datasets and increasing precision.

12.2.3 Practical Guide to Machine Learning

Now that we have a general idea of machine learning approach, we will try to build a simple machine learning model in the sensory and consumer context.

R contains fantastic systems for building machine learning models. The tidymodels framework (https://www.tidymodels.org/) is a collection of packages for modeling and machine learning using tidyverse principles, and among one of the most popular framework in R for machine learning. In this context, we strongly recommend the book “Tidy Modeling with R,” by Max Kuhn and Julia Silge (https://www.tmwr.org/).

Tidymodels contains several core packages, including rsample, parsnip, recipes, workflows, tune, yardstick, broom and dials, along with some other specialized packages. These packages will help you build a variety of models in R and check their performances.

Just like tidyverse, you don’t need to install/load all the packages separately, installing/loading {tidymodel} will load all the packages you will need for building machine learning models.

library(tidymodels)

We will still use the wine data set to build our supervised model using the tidymodel packages. Let’s prepare the data before we go to details for modeling.

12.2.3.1 Preparing data for ML

First we need to mutate the column “Wine_type” to be factors, as this column would be the target later.

wine_classification_dataset <- wine_dataset %>% 
  mutate(Wine_type = as.factor(Wine_type))

12.2.3.2 Sampling the data

The function rsample::initial_split() takes the original data and saves the information on how to make the partitions. We set the strata argument as "Wine_type" to conduct a stratified split. This ensures our training and test data sets would keep roughly the same proportions of all wine types as in the original data, despite the imbalance we noticed in our class variable. prop is set to 0.7 as we are splitting data to be 70% for training and 30% for testing.

After the initial_split, we can use the training() and testing() functions to obtain the two data frames for training and testing.

initial_split <- initial_split(data = wine_classification_dataset, strata = "Wine_type", prop = 0.7)

wine_train <- training(initial_split)
wine_testing <- testing(initial_split)

12.2.3.3 Cross Validation

Cross-validation is an important step for checking the model quality later. To do that, we need to use the resampling method to create a series of data sets similar to the training/testing split before building the model. For each data set, a subset is used for creating the model, and the other subset is used to measure the performance. Also noted that resampling is always used with the training set of the data defined earlier.

We use vfold_cv to resample the data. Here we begin with a 5-fold cross-validation first. strata is set to wind_type to conduct stratified sampling, so that each resample is created within the stratification variable.

wine_cv <- wine_train %>% vfold_cv(v = 5,strata = Wine_type)

Now we have a resampled wine_cv for further building the model.

12.2.3.4 Choose ML method using “recipe”

The recipe package is part of tidymodels. It contains a rich set of data manipulation tools which can be used to preprocess data and to define roles for each variable (e.g. outcome and predictor). To add a recipe, we simply use the function recipe. recipe has two arguments: a formula and the data. Any variable on the left-hand side of the tilde (~) is considered the model outcome. In our example, we want to use machine learning model to predict the type of the wine, therefore Wine_type would be the target on the left hand side of the ~. On the right-hand side of the tilde are the predictors. One can write out all the variables, but an easier option is to use the dot (.) to indicate all other variables as predictors.

# Random forest model definition
model_recipe <- wine_train %>% 
  recipe(Wine_type ~ .)

We will use here random forest classifier for the wine data. rand_forest has 3 hyper-parameters (mtry, trees, min_n) which can be tuned to achieve the best results.

We set all of them as tune() right now. Model tuning is the process to find the best values for the parameters. Most often you start with the default values, and change them along the way. Since the model is not executing when created, these parameters can be changed using the tune() function. This provides a simple placeholder for the value.

rf_spec <- rand_forest(
  mtry = tune(),
  trees = tune(),
  min_n = tune()) %>%
  set_mode("classification") %>% 
  set_engine(engine = "ranger")

12.2.3.5 Set the whole Process into a workflow

Let’s combine the model and recipe into a single workflow () object to better manage the two R objects.

rf_wf <- workflow() %>%
  add_recipe(model_recipe) %>% 
  add_model(rf_spec)

12.2.3.6 Tuning the parameters

We created placeholders for tuning hyper-parameters earlier. Now it’s time to define the scope of the search and choose the method for searching the parameter space, in this case it is grid_regular.

params_grid <- rf_spec %>%
  parameters() %>%
  update(mtry = mtry(range = c(1, 2)),
         trees = trees(range = c(10, 200))) %>% 
  grid_regular(levels = 5)

Now that We have defined the hyper-parameter search range, let’s look for the best combination using tune_grid() function. We will leave the test set aside, and use a cross-validation set for this purpose, so that the model during training certainly will not see the data on which we will be making predictions.

We can use autoplot() function to take a quick look at the tuning object.

Next, we will obtain a reliable estimate of the quality of the model. We first select the best combination of parameters using select_best() function. The quality of the model can be estimated with a number of metrics. Here we decided to use roc_auc (Area Under the Receiver Operating Characteristic Curve).

# Best parameters searching

tuning <- tune_grid(
  rf_wf,
  resamples = wine_cv,
  grid = params_grid
)

autoplot(tuning)

params_best <- select_best(tuning, "roc_auc")

12.2.3.7 Train the final model

Now that we have found the best parameters for the model, we can train the final model on the entire training set.

# Finalize model 

final_model <- rf_wf %>%
  finalize_workflow(params_best) %>%
  fit(wine_train)

12.2.3.8 Assess the model quality

A very important part in building machine learning models is to assess the quality of the models. To do so, first we can use the model to predict on the testing data set (wine_testing), which the model has not seen yet.

predict is a generic function for putting the prediction results from models to convenient tables for further analysis. We combine the wine_testing data with the column of predicted Wine_type from the model by adding the predict(final_model, wine_testing)in the pipe.

validation_data_pred <- wine_testing %>%
  bind_cols(predict(final_model, .))

The validation_data_pred is a data frame with testing data with Wine_type as the actual wine type, and a .pred_class column as the predicted Wind_type at the end. Now we can finally judge the quality of the model by comparing them.

A very pleasant method for visualize the classification results is to use a confusion matrix. Confusion matrix is a table where each row represents instances in the actual class, while each column represents the instances in a predicted class. Using the autoplot() function, We can see that in our wine type prediction, it’s almost a 100% match!

cm <- conf_mat(validation_data_pred, Wine_type, .pred_class)

autoplot(cm, type = "heatmap")

Bibliography

Kassambara, Alboukadel. 2017. Practical Guide to Cluster Analysis in r: Unsupervised Machine Learning. 1st ed. CreateSpace Independent Publishing Platform.

This is a bit of simplification. In some cases it is some transformation of combination of predictors that approximates target variable. An example of this is logistic regression.↩︎