[!! remember to **download the data** for this practical sessions: <http://www.jackdellequerce.com/data/GenRiz44.txt> !!]{style="color:red"}

[!! you should download the data into the data/ folder inside the introduction_to_gwas repository !!]{style="color:red"}

```{r}
# THERE ARE MULTIPLE OPTIONS TO DOWNLOAD THE DATA

## 1) FROM THE COMMAND LINE (LINUX TERMINAL, WINDOWS DOS) - UNCOMMENT IF NEEDED
## cd ../data
## wget http://www.jackdellequerce.com/data/GenRiz44.txt

## 2) USING SYSTEM CALLS WITHIN R - UNCOMMENT IF NEEDED
# setwd("../data")
# system2("wget", "http://www.jackdellequerce.com/data/GenRiz44.txt")
# setwd("../3.imputation")

## 3) FROM THE COURSE WEBSITE https://filippob.github.io/introduction_to_gwas/model_extensions_data

## 4) FROM THE SLACK CHANNEL OF THE COURSE
```

# KNN: K-Nearest Neighbors (for classification)

K-nearest neighbors (KNN) is a classification method that attempts to estimate the conditional distribution of the observations (e.g. vector $Y$) given a set of features (matrix $\mathbf{X}$):

$$
Pr(Y=j|X=x_0) = \frac{1}{K} \sum_{i \in \mathcal{N}_0} I(y_i=j)
$$

Let's look at this:

-   $K$ is the chosen **number of neighbors** to consider
-   $x_0$ is the **observation** we want to classify (we have features but not know the class it belongs to)
-   $\mathcal{N}_0$ is the neighborhood around $x_0$
-   $I(y_i=j)$ is the number of neighbors in $\mathcal{N}_0$ belonging to class $j$

The observation $x_0$ is then classified to the class with the largest probability ("Bayes rule").

## Distance function

A crucial aspect in KNN classification is of course how you *define the distance* between observations. A common choice is the Euclidean distance, which basically takes the square root of the sum of **squared differences in feature values** between any two points:

1.  Euclidean distance:

$$
D_E(x_i,x_j) = \sqrt{ \sum_{m=1}^M (x_{i,m}-x_{j,m})^2 }
$$

The Euclidean distance, though, is designed for **continuous features**, and is therefore not very suited for categorical features. Luckily, other distance functions exist for **categorical features**, like the Hamming distance function:

2.  Hamming distance (categorical features):

$$
D_H(x_i,x_j) = \frac{1}{M} \sum_{m=1}^M I(x_{i,m} \neq x_{j,m})
$$

In this brief practical/demonstration, we'll go through an illustration of how to apply KNN to the problem of imputing missing SNP genotypes in rice (*Oryza sativa*).

1.  Raw data
2.  Introduce (artificial) missing genotypes
3.  Imputation with R package
4.  Step by step KNN imputation: write our own code!

```{r,echo=FALSE,results='hide', message=FALSE}

#Demonstrating the use of KNN to impute missing SNP genotypes
library("knitr")
library("tidymodels") #for kNN imputation
library("data.table")
```

### The dataset

We'll be using SNP genotype data from 12 samples genotyped with the "**GeneChip Rice 44K SNP**".

```{r, echo=FALSE}
## read the data
rice44 <- fread("../data/GenRiz44.txt",header=TRUE)
```

```{r, echo=FALSE}
## recode from AA/BB/-- to -1/1/NA
oldcode <- c("AA","BB","--")
neucode <- c(-1,1,NA)

M <- apply(rice44[,5:ncol(rice44)],2,function(x) neucode[match(x,oldcode)])
print(paste("number of SNPs before filtering:", nrow(M)))
```

A summary of the real missing rate in the data is given below:

```{r, echo=FALSE}
missingRate <- apply(M,1,function(x) length(x[is.na(x)])/length(x))
summary(missingRate)
```

Now we select only SNPs without missing data:

```{r, echo=FALSE}
M <- M[missingRate==0,]
colnames(M) <- paste("sample",seq(1:ncol(M)),sep="_")
print(paste("n. of SNPs after filtering:",nrow(M)))
```

## MEASURING THE ACCURACY OF IMPUTATION

The objective of this demonstration is to show to you how the accuracy of imputing missing SNP genotypes can be measured:

-   we will be using the R package `tidymodels` with built-in functions for KNN imputation
-   we will remove real missing SNP genotypes from the data, so to have a starting point with 0 missing: we know the complete picture!
-   we will then "mask" some SNP genotypes by artificially turning them to missing: in this way, we will be able to compare the imputed genotypes/alleles with the known true genotypes/alleles and calculate the error rate

All real missing data are edited out of the dataset. This leaves `r nrow(M)` SNP. For the sake of the exercise, initially a much smaller subset with 100 SNPs is extracted.

```{r}
df <- M %>% as_tibble() %>% rowwise()
M <- df %>% mutate(m = sd(c_across(everything()))) %>% filter(m > 0) ## remove monomorphic SNP
rm(df)
print(paste("n. of SNPs after filtering:",nrow(M)))
```

```{r}
## setting seed
seed = 777
set.seed(seed)

matrix_M <- M %>% select(-m) %>% as.matrix()
vec <- sample(x = nrow(matrix_M), size = 100)
Mreduced <- t(matrix_M[vec,])
print(Mreduced[1:10,1:12])
```

Now, we randomly introduce 10% artificially missing SNP genotypes. In this way, we will be able to measure the accuracy of our imputation method.

```{r}
nSNP <- ncol(Mreduced)
nInd <- nrow(Mreduced)
schwellenWert <- 0.1 # 10% missing data

missing_matrix = matrix(runif(nInd*nSNP), ncol=nSNP) 
missing_matrix[missing_matrix < schwellenWert] = NA
missing_matrix[!is.na(missing_matrix)] = 0
injected_data = Mreduced + missing_matrix

actualMissingRate <- length(injected_data[is.na(injected_data)])/(nInd*nSNP) 

missing_map = missing_matrix
print(actualMissingRate)
```

The actual missing rate is `r actualMissingRate`.

### Imputation with the R package `tidymodels`

We'll now impute missing genotypes using the **tidymodels** *R package*. In this package there is the *step_impute_knn()* function, which uses by default the *Gower distance*, and allows the user to specify the number of neighbors $K$. The measure of \`\`probability'' of belonging to any given class is the **average** (continuous variables) or the **mode** (**majority vote**, categorical variables).

```{r, warning=TRUE, message=FALSE}
missing_data <- as.data.frame(injected_data)

rec <- recipe(missing_data) %>% 
  update_role(all_numeric(), new_role = "predictor")

ratio_recipe <- rec %>%
  step_impute_knn(all_predictors(), neighbors = 3)

ratio_recipe2 <- prep(ratio_recipe, training = missing_data, verbose = FALSE)
imputed = juice(ratio_recipe2)

results <- round(imputed[is.na(missing_map)])==Mreduced[is.na(missing_map)]
accuracy <- length(results[results==TRUE])/length(results)
print(accuracy)
```

[The accuracy of imputation with the R function *step_impute_knn()* from the *tidymodels* package is: <strong> `r accuracy`</strong>]{style="background-color:yellow"}

Comparison with **mean imputation** on the same data below:

```{r label='mean_imputation', warning=TRUE, message=FALSE}
missing_data <- as.data.frame(injected_data)

rec <- recipe(missing_data) %>% 
  update_role(all_numeric(), new_role = "predictor")

ratio_recipe <- rec %>%
  step_impute_mean(all_predictors())

ratio_recipe2 <- prep(ratio_recipe, training = missing_data, verbose = FALSE)
imputed = juice(ratio_recipe2)

results <- round(imputed[is.na(missing_map)])==Mreduced[is.na(missing_map)]
accuracy <- length(results[results==TRUE])/length(results)
print(accuracy)
```

## STEP-BY-STEP KNNI IMPLEMENTATION [optional]

Let's have some fun!!

We'll now write our own code. Instead of the Euclidean distance function, we will use the **Hamming distance** function, which is better suited for categorical features. SNP genotypes are categorical features (e.g. **AA/AB/BB**).

First, here's the Hamming distance function:

```{r}
#calculate distance matrix and nearest neighbours
Hamming <- function(ssr_data) {
  
  z <- matrix(0, nrow = nrow(ssr_data), ncol = nrow(ssr_data))
  
  for (k in 1:(nrow(ssr_data) - 1)) {
    for (l in (k + 1):nrow(ssr_data)) {
      z[k, l] <- sum(ssr_data[k, ] != ssr_data[l, ], na.rm=TRUE)
      z[l, k] <- z[k, l]
    }
  }
  
  return(z)
}
```

Again, we choose a small value of $K$ ($K=3$), given the small size of the exercise (and being sure of avoiding the "*curse of dimensionality*" :-)). The objective is to impute missing SNP genotypes at one SNP locus, using all remaining available information (except the target SNP locus).

```{r,echo=FALSE}
k <- 3

## take one SNP locus to impute e use the m-1 SNPs to train
y <- 0
while(length(y[is.na(y)])<2) { ## we want to have at least two missing SNP genotypes
  
  i <- sample(1:ncol(injected_data),1)
  X <- as.matrix(injected_data[,-i]) #global matrix of features (train + test sets)
  y <- injected_data[,i]
}
  
D <- Hamming(X)
#D <- 2-gVanRaden(X)
row.names(D) <- names(y)
colnames(D) <- names(y)
```

Let's have a look at the first elements of the matrix of Hamming distances (the smaller the number, the closer the samples):

```{r,echo=FALSE}
print(kable(D))
```

```{r}
print(kable(y))
```

This is the vector of SNP genotypes to be imputed: `r y`.

Non-missing genotypes at this SNP locus are the "*training*" observations, to be used for the imputation of the missing data points. The subset of distances for the missing observations is extracted and ordered by reverse distance

```{r}
testIDS <- names(y[is.na(y)])
trainIDS <- names(y[!is.na(y)])

NN <- apply(D[as.character(testIDS),as.character(trainIDS)],1,order)
NN <- t(NN)

print(NN)
```

We see here the indexes of the neighbors for the "test" observations (samples with missing data to impute), ordered from closest to farthest (mind you, missing elements from `y` are skipped).

Below we can see, for the first "test" sample (sample whose missing genotype we want to impute) the genotypes for the first `k` neighbors (that will be used for KNN-imputation by majority vote):

```{r}
neighbs <- NN[1,1:k] ## first k neighbors for first "test" sample
print(neighbs)
print(y[trainIDS][neighbs]) ## neighbor genotypes of first k samples 
```

We'll now use KNN to impute the missing genotypes, based on the most frequent genotype in the $K$ neighbors.

```{r}
ergebnisse <- apply(NN[,1:k, drop=F], 1, function(nn) {
    tab <- table(y[trainIDS][nn]); 
    maxClass <- names(which.max(tab))
    prob <- tab[maxClass]/k; 
    pred <- as.integer(maxClass); 
    return(c(pred,prob)) 
})

```

And these are the results for one single SNP locus:

```{r}
ergebnisse <- as.data.frame(t(ergebnisse))
names(ergebnisse) <- c("pred","prob")
ergebnisse[ergebnisse$pred==0,]$prob <- 1-ergebnisse[ergebnisse$pred==0,]$prob
print(kable(ergebnisse))
```

Now we can apply this code to the entire dataset, to impute missing genotypes at all SNP loci.

```{r}
imputedM <- matrix(rep(NA,nInd*nSNP),nrow=nInd)

for(i in 1:ncol(injected_data)) {
  
  X <- as.matrix(injected_data[,-i]) #global matrix of features (train + test sets)
  y <- injected_data[,i]
  k <- 3
  
  if(length(y[is.na(y)])<1) {
    
    imputedM[,i] <- y
  } else {
    
    D <- Hamming(X)
    row.names(D) <- names(y)
    colnames(D) <- names(y)
    
    testIDS <- names(y[is.na(y)])
    trainIDS <- names(y[!is.na(y)])
    
    if(length(testIDS)!=1) {
      
      NN <- apply(D[as.character(testIDS),as.character(trainIDS)],1,order)
      NN <- t(NN)
      ids <- row.names(NN) #for the output file
      
      ergebnisse <- apply(NN[,1:k, drop=F], 1, function(nn) {
        tab <- table(y[trainIDS][nn]); 
        maxClass <- names(which.max(tab))
        pred <- as.integer(maxClass); 
        return(pred)
      })
      y[testIDS] <- ergebnisse[testIDS]
    } else {
      
      n <- order(D[testIDS,trainIDS])[1:k]
      tab <- table(y[trainIDS][n]); 
      maxClass <- names(which.max(tab))
      prob <- tab[maxClass]/k; 
      pred <- as.integer(maxClass); 
      y[testIDS] <- pred
    }

    imputedM[,i] <- y
  }
}

#Accuracy of imputation
results <- imputedM[is.na(missing_map)]==Mreduced[is.na(missing_map)]
accuracy <- length(results[results==TRUE])/length(results)
print(accuracy)
```

[The accuracy of imputation with our R code is:<strong> `r accuracy`</strong>]{style="background-color:yellow"}

This is expected to be higher than the accuracy obtained with the native *knnImputation()* function, since we used the Hamming distance which is more appropriate for SNP genotype data.

NB: important caveat: the R code shown here is intended for demonstration only, and has not been optimized nor tested for applications.