Category Archives: R

R: Heatmap plots

I’m working with a dataset of trace-metals concentrations in different streams, and I wanted to see the overall mean concentration for each metal, in each stream. I used a heatmap to plot a grid of streams vs. metals, with a color shading in each cell representing the mean concentration.

Heatmap - wet weather metals

Light blue values are lower concentrations, dark red values are higher concentrations (grey cells contain no data). Since some metals occur at much higher concentrations than others (by a few orders of magnitude), all the data have been scaled (more on the methods below) — which is why the heatmap does not have a legend with actual values. It’s purely a high-low gradient.

Metals are in alphabetical order down the left-hand side. Streams are across the top, sorted so that the streams with the overall highest metals concentrations are on the left, going to the overall lowest metals concentrations on the right.

There are several webpages with instructions on how to build a heatmap like this in R, using ggplot2, and I’ve made my own modifications to both the aesthetics and the data-handling. Continue reading

Advertisement

Leave a comment

Filed under Data graphs, R

R: How to fix column names containing spaces

A basic rule of R is to avoid naming data-frame columns using names that contain spaces. R will accept a name containing spaces, but the spaces then make it impossible to reference the object in a function. Neither single or double quotes can work around this problem, and other data structures also share this limitation. (Note that commas also cause similar problems, as do many special characters.)

> select(x, Date, Depth, Total Phosphorus)
Error: unexpected symbol in "select(x, Date, Depth, Total Phosphorus"
> select(x, Date, Depth, "Total Phosphorus")
Error: All select() inputs must resolve to integer column positions.
The following do not:
*  "Total Phosphorus"

When I’m creating my own column names it’s easy to avoid the issue, but often I’m working with data that comes from other sources — and contains spaces in problematic places.

At the moment I’m working with lab data where every water-quality parameter (nitrogen, phosphorus, copper, temperature, etc.) is its own line in the dataset, with one column for the parameter name and another column for the value — often called “long” or “molten” format. The names-with-spaces problem comes when I spread out the data into “wide” format, where each row is a complete sampling date and each parameter becomes its own column.

Many of the parameters have spaces (and sometimes commas) in the name the lab uses, such as “Ammonia Nitrogen” or “Copper, Dissolved”. When I use the spread() function (from the “tidyr” package), these become column names containing spaces and commas. After trying a few approaches, I found a simple, elegant way to do it in a single line using the “stringr” package.

To fix a dataframe (or any other named structure) “x” that already has names containing spaces and/or commas:

names(x)<-str_replace_all(names(x), c(" " = "." , "," = "" ))

This replaces spaces with periods, and commas with nothing (to avoid creating double periods like “Copper..Dissolved”). Spaces and commas are the only two problematic characters I have in my dataset; if you have others, add them to the str_replace_all() call. The result from a names() call is a vector of strings, which the stringr functions can manipulate easily before passing it back through the names() function.

Another approach is to remove the spaces from the long/molten dataset before spreading it out. It’s very similar code, only on a column of parameter names:

y$Parameter<-str_replace_all(y$Parameter, c(" " = "." , "," = "" ))

 

UPDATE: I’ve discovered that tibbles — the tidyverse version of the dataframe — can handle names with spaces and other special characters. To refer to those non-standard names in functions, surround the name with backticks (on my keyboard, the backtick is on the  ~  key):

`​Copper, Dissolved`

 

Even though I’m working primarily with tibbles now, I’m going to keep renaming my columns so they’re easier to work with as I go through the rest of the data analysis. I prefer not to have to type the backticks all the time.

4 Comments

Filed under R