I’m working with a dataset of trace-metals concentrations in different streams, and I wanted to see the overall mean concentration for each metal, in each stream. I used a heatmap to plot a grid of streams vs. metals, with a color shading in each cell representing the mean concentration.
Light blue values are lower concentrations, dark red values are higher concentrations (grey cells contain no data). Since some metals occur at much higher concentrations than others (by a few orders of magnitude), all the data have been scaled (more on the methods below) — which is why the heatmap does not have a legend with actual values. It’s purely a high-low gradient.
Metals are in alphabetical order down the left-hand side. Streams are across the top, sorted so that the streams with the overall highest metals concentrations are on the left, going to the overall lowest metals concentrations on the right.
There are several webpages with instructions on how to build a heatmap like this in R, using ggplot2, and I’ve made my own modifications to both the aesthetics and the data-handling.
Colors: I found I wasn’t seeing enough of the differences with a single-color scale, so I used a light-blue to dark-red scale that adds a contrasting color gradient to the light/dark gradient. (Blue/red is a universally distinguishable color pair, even to colorblind people.)
Data scaling: There’s more than one way to scale data. All methods standardize the range of each metal’s data — for example, setting the minimum value to 0 and the maximum value to 1. But some methods also standardize the variance of each metal’s data, dividing by the standard deviation. Standardizing both range and variance seems to be a common practice, using the scale() function. The above plot was created this way. You’ll notice that the mid-tones are the most common colors in the plot.
The plot below was created by scaling only the range and leaving the variance alone, using the rescale() function. This conveys more of the skewness in my data — some metals have a lot of low values and one or two high outliers, which shows up as a row of blue with a couple of dark reds. In the variance-standardized version above, those show up as mostly mid-tone rather than blue. I’m still in the early data-exploratory stages, so I can’t say which is a more useful or accurate portrayal of my data. I built this function because I wanted to easily use both scaling options.
Below is the function I used to create the heatmap. It currently has no error-checking/reporting capabilities, so be sure your data are formatted the way it needs! It needs a long-format (also called molten) dataframe or tibble, with each observation on a separate row. Column names have to be “X”, “Y”, and “Value” (order doesn’t matter). “X” will go across the top (streams, in my examples), and “Y” will go down the left (metals, in my example).
You can pass the function only the dataframe/tibble, and it uses its default behaviors of how to handle the data. It defaults to using the scale() method to scale the data, and to sorting the X (streams) but not the Y (metals). Any of these can be changed with the method, reorder.X, and reorder.Y arguments. (Unfortunately the method argument is not universal enough to use any function name you pass in, so the only working options are currently “scale” or “rescale”.)
Note that this function requires both the “tidyverse” and “scales” packages to be loaded.
## Heatmap # This function requires a dataframe/tibble # with columns: X, Y, Value # Known methods are "scale" or "rescale". # Default is to reorder X but not Y. # Returns a ggplot2 object ready to plot or modify further. plot.heatmap<-function(data,method="scale",reorder.X=T,reorder.Y=F) { ## Convert to matrix-like format, scale, and reassemble d.matrix<-data %>% ungroup() %>% spread(key=Y,value=Value) # Scale data according to the chosen method if(method=="scale") { d<-d.matrix %>% select(-X) %>% scale() %>% as.tibble() %>% mutate(X=d.matrix$X) %>% gather(key=Y,value=Value,-X) } else if(method=="rescale") { d<-d.matrix %>% mutate_if(is.numeric,rescale,to=c(0,1)) %>% gather(key=Y,value=Value,-X) } # end methods if/else ## Y and X need to become factors, reordered or not if(reorder.X) { # Total for each X of all Ys' scaled values total.X<-d %>% group_by(X) %>% summarize(total=sum(Value,na.rm=T)) %>% arrange(-total) d$X<-ordered(d$X,levels=total.X$X) } else { d$X<-as.factor(d$X) } # end if/else to reorder X if(reorder.Y) { # Total for each Y of all Xs' scaled values total.Y<-d %>% group_by(Y) %>% summarize(total=sum(Value,na.rm=T)) %>% arrange(-total) d$Y<-ordered(d$Y,levels=total.Y$Y) } else { d$Y<-as.factor(d$Y) } # end if/else to reorder Y ## Plot hm<-ggplot(data=d,aes(x=X,y=Y,fill=Value))+ geom_tile()+ scale_fill_gradient(low = "lightblue",high = "darkred")+ scale_x_discrete(position="top")+ scale_y_discrete(limits = rev(levels(d$Y)))+ theme(legend.position="none", axis.text.x=element_text(angle=45,hjust=0))+ labs(y=NULL,x=NULL) hm } # Daniel Nidzgorski, July 2017 # https://dnidzgorski.wordpress.com/r-heatmap/ # CC BY-SA