!! remember to **download the data** for this practical sessions: http://www.jackdellequerce.com/data/GenRiz44.txt !! !! you should download the data into the data/ folder inside the introduction_to_gwas repository !! {r} # THERE ARE MULTIPLE OPTIONS TO DOWNLOAD THE DATA ## 1) FROM THE COMMAND LINE (LINUX TERMINAL, WINDOWS DOS) - UNCOMMENT IF NEEDED ## cd ../data ## wget http://www.jackdellequerce.com/data/GenRiz44.txt ## 2) USING SYSTEM CALLS WITHIN R - UNCOMMENT IF NEEDED # setwd("../data") # system2("wget", "http://www.jackdellequerce.com/data/GenRiz44.txt") # setwd("../3.imputation") ## 3) FROM THE COURSE WEBSITE https://filippob.github.io/introduction_to_gwas/ ## 4) FROM THE SLACK CHANNEL OF THE COURSE  # KNN: K-Nearest Neighbors (for classification) K-nearest neighbors (KNN) is a classification method that attempts to estimate the conditional distribution of the observations (e.g. vector $Y$) given a set of features (matrix $\mathbf{X}$): $$Pr(Y=j|X=x_0) = \frac{1}{K} \sum_{i \in \mathcal{N}_0} I(y_i=j)$$ Let's look at this: - $K$ is the chosen **number of neighbors** to consider - $x_0$ is the **observation** we want to classify (we have features but not know the class it belongs to) - $\mathcal{N}_0$ is the neighborhood around $x_0$ - $I(y_i=j)$ is the number of neighbors in $\mathcal{N}_0$ belonging to class $j$ The observation $x_0$ is then classified to the class with the largest probability ("Bayes rule"). ## Distance function A crucial aspect in KNN classification is of course how you *define the distance* between observations. A common choice is the Euclidean distance, which basically takes the square root of the sum of **squared differences in feature values** between any two points: 1. Euclidean distance: $$D_E(x_i,x_j) = \sqrt{ \sum_{m=1}^M (x_{i,m}-x_{j,m})^2 }$$ The Euclidean distance, though, is designed for **continuous features**, and is therefore not very suited for categorical features. Luckily, other distance functions exist for **categorical features**, like the Hamming distance function: 2. Hamming distance (categorical features): $$D_H(x_i,x_j) = \frac{1}{M} \sum_{m=1}^M I(x_{i,m} \neq x_{j,m})$$ In this brief practical/demonstration, we'll go through an illustration of how to apply KNN to the problem of imputing missing SNP genotypes in rice (*Oryza sativa*). 1. Raw data 2. Introduce (artificial) missing genotypes 3. Imputation with R package 4. Step by step KNN imputation: write our own code! {r,echo=FALSE,results='hide', message=FALSE} #Demonstrating the use of KNN to impute missing SNP genotypes library("tidymodels") #for kNN imputation library("data.table")  ### The dataset We'll be using SNP genotype data from 12 samples genotyped with the "**GeneChip Rice 44K SNP**". A summary of the real missing rate in the data is given below: {r, echo=FALSE} rice44 <- fread("../data/GenRiz44.txt",header=TRUE) oldcode <- c("AA","BB","--") neucode <- c(-1,1,NA) M <- apply(rice44[,5:ncol(rice44)],2,function(x) neucode[match(x,oldcode)]) print(paste("number of SNPs before filtering:", nrow(M))) missingRate <- apply(M,1,function(x) length(x[is.na(x)])/length(x)) summary(missingRate) M <- M[missingRate==0,] colnames(M) <- paste("sample",seq(1:ncol(M)),sep="_") print(paste("n. of SNPs after filtering:",nrow(M)))  ## MEASURING THE ACCURACY OF IMPUTATION The objective of this demonstration is to show to you how the accuracy of imputing missing SNP genotypes can be measured: - we will be using the R package tidymodels with built-in functions for KNN imputation - we will remove real missing SNP genotypes from the data, so to have a starting point with 0 missing: we know the complete picture! - we will then "mask" some SNP genotypes by artificially turning them to missing: in this way, we will be able to compare the imputed genotypes/alleles with the known true genotypes/alleles and calculate the error rate All real missing data are edited out of the dataset. This leaves r nrow(M) SNP. For the sake of the exercise, initially a much smaller subset with 100 SNPs is extracted. {r} df <- M %>% as_tibble() %>% rowwise() M <- df %>% mutate(m = sd(c_across(everything()))) %>% filter(m > 0) ## remove monomorphic SNP rm(df) M <- M %>% select(-m) %>% as.matrix() vec <- sample(x = nrow(M), size = 100) Mreduced <- t(M[vec,]) print(Mreduced[1:10,1:12])  Now, we randomly introduce 10\% artificially missing SNP genotypes. In this way, we will be able to measure the accuracy of our imputation method. {r} nSNP <- ncol(Mreduced) nInd <- nrow(Mreduced) schwellenWert <- 0.1 # 10% missing data missing_matrix = matrix(runif(nInd*nSNP), ncol=nSNP) missing_matrix[missing_matrix < schwellenWert] = NA missing_matrix[!is.na(missing_matrix)] = 0 injected_data = Mreduced + missing_matrix actualMissingRate <- length(injected_data[is.na(injected_data)])/(nInd*nSNP) missing_map = missing_matrix print(actualMissingRate)  The actual missing rate is r actualMissingRate. ### Imputation with the R package tidymodels We'll now impute missing genotypes using the **tidymodels** *R package*. In this package there is the *step_impute_knn()* function, which uses by default the *Gower distance*, and allows the user to specify the number of neighbors $K$. The measure of probability'' of belonging to any given class is the **average** (continuous variables) or the **mode** (**majority vote**, categorical variables). {r, warning=FALSE, message=FALSE} missing_data <- as.data.frame(injected_data) rec <- recipe(missing_data) %>% update_role(all_numeric(), new_role = "predictor") ratio_recipe <- rec %>% step_impute_knn(all_predictors(), neighbors = 9) ratio_recipe2 <- prep(ratio_recipe, training = missing_data, verbose = FALSE) imputed = juice(ratio_recipe2) results <- imputed[is.na(missing_map)]==Mreduced[is.na(missing_map)] accuracy <- length(results[results==TRUE])/length(results) print(accuracy)  The accuracy of imputation with the R function *step_impute_knn()* from the *tidymodels* package is: r accuracy ## STEP-BY-STEP KNNI IMPLEMENTATION [optional] Let's have some fun!! We'll now write our own code. Instead of the Euclidean distance function, we will use the **Hamming distance** function, which is better suited for categorical features. SNP genotypes are categorical features (e.g. **AA/AB/BB**). First, here's the Hamming distance function: {r} #calculate distance matrix and nearest neighbours Hamming <- function(ssr_data) { z <- matrix(0, nrow = nrow(ssr_data), ncol = nrow(ssr_data)) for (k in 1:(nrow(ssr_data) - 1)) { for (l in (k + 1):nrow(ssr_data)) { z[k, l] <- sum(ssr_data[k, ] != ssr_data[l, ], na.rm=TRUE) z[l, k] <- z[k, l] } } return(z) }  Again, we choose a small value of $K$ ($K=3$), given the small size of the exercise (and being sure of avoiding the "*curse of dimensionality*" :-)). The objective is to impute missing SNP genotypes at one SNP locus, using all remaining available information (except the target SNP locus). {r,echo=FALSE} k <- 3 ## take one SNP locus to impute e use the m-1 SNPs to train y <- 0 while(length(y[is.na(y)])<2) { i <- sample(1:ncol(injected_data),1) X <- as.matrix(injected_data[,-i]) #global matrix of features (train + test sets) y <- injected_data[,i] } D <- Hamming(X) #D <- 2-gVanRaden(X) row.names(D) <- names(y) colnames(D) <- names(y)  Let's have a look at the first elements of the matrix of Hamming distances: {r,echo=FALSE} print(D) print(y)  This is the vector of SNP genotypes to be imputed: r y. Non-missing genotypes at this SNP locus are the "*training*" observations, to be used for the imputation of the missing data points. The subset of distances for the missing observations is extracted and ordered by reverse distance {r} testIDS <- names(y[is.na(y)]) trainIDS <- names(y[!is.na(y)]) NN <- apply(D[as.character(testIDS),as.character(trainIDS)],1,order) NN <- t(NN) print(NN)  We see here the indexes of the neighbors for the "test" observations (samples with missing data to impute), ordered from closest to farthest (mind you, missing elements from y are skipped). Below we can see, for the first "test" sample (sample whose missing genotype we want to impute) the genotypes for the first k neighbors (that will be used for KNN-imputation by majority vote): {r} neighbs <- NN[1,1:k] ## first k neighbors for first "test" sample print(neighbs) print(y[trainIDS][neighbs]) ## neighbor genotypes of first k samples  We'll now use KNN to impute the missing genotypes, based on the most frequent genotype in the $K$ neighbors. {r} ergebnisse <- apply(NN[,1:k, drop=F], 1, function(nn) { tab <- table(y[trainIDS][nn]); maxClass <- names(which.max(tab)) prob <- tab[maxClass]/k; pred <- as.integer(maxClass); return(c(pred,prob)) })  And these are the results for one single SNP locus: {r} ergebnisse <- as.data.frame(t(ergebnisse)) names(ergebnisse) <- c("pred","prob") ergebnisse[ergebnisse$pred==0,]$prob <- 1-ergebnisse[ergebnisse$pred==0,]$prob print(ergebnisse)  Now we can apply this code to the entire dataset, to impute missing genotypes at all SNP loci. {r} imputedM <- matrix(rep(NA,nInd*nSNP),nrow=nInd) for(i in 1:ncol(injected_data)) { X <- as.matrix(injected_data[,-i]) #global matrix of features (train + test sets) y <- injected_data[,i] k <- 3 if(length(y[is.na(y)])<1) { imputedM[,i] <- y } else { D <- Hamming(X) row.names(D) <- names(y) colnames(D) <- names(y) testIDS <- names(y[is.na(y)]) trainIDS <- names(y[!is.na(y)]) if(length(testIDS)!=1) { NN <- apply(D[as.character(testIDS),as.character(trainIDS)],1,order) NN <- t(NN) ids <- row.names(NN) #for the output file ergebnisse <- apply(NN[,1:k, drop=F], 1, function(nn) { tab <- table(y[trainIDS][nn]); maxClass <- names(which.max(tab)) pred <- as.integer(maxClass); return(pred) }) y[testIDS] <- ergebnisse[testIDS] } else { n <- order(D[testIDS,trainIDS])[1:k] tab <- table(y[trainIDS][n]); maxClass <- names(which.max(tab)) prob <- tab[maxClass]/k; pred <- as.integer(maxClass); y[testIDS] <- pred } imputedM[,i] <- y } } #Accuracy of imputation results <- imputedM[is.na(missing_map)]==Mreduced[is.na(missing_map)] accuracy <- length(results[results==TRUE])/length(results) print(accuracy)  The accuracy of imputation with our R code is: r accuracy This is expected to be higher than the accuracy obtained with the native *knnImputation()* function, since we used the Hamming distance which is more appropriate for SNP genotype data. NB: important caveat: the R code shown here is intended for demonstration only, and has not been optimized nor tested for applications.