Tag Archives: data frame

R: How to fix column names containing spaces

A basic rule of R is to avoid naming data-frame columns using names that contain spaces. R will accept a name containing spaces, but the spaces then make it impossible to reference the object in a function. Neither single or double quotes can work around this problem, and other data structures also share this limitation. (Note that commas also cause similar problems, as do many special characters.)

> select(x, Date, Depth, Total Phosphorus)
Error: unexpected symbol in "select(x, Date, Depth, Total Phosphorus"
> select(x, Date, Depth, "Total Phosphorus")
Error: All select() inputs must resolve to integer column positions.
The following do not:
*  "Total Phosphorus"

When I’m creating my own column names it’s easy to avoid the issue, but often I’m working with data that comes from other sources — and contains spaces in problematic places.

At the moment I’m working with lab data where every water-quality parameter (nitrogen, phosphorus, copper, temperature, etc.) is its own line in the dataset, with one column for the parameter name and another column for the value — often called “long” or “molten” format. The names-with-spaces problem comes when I spread out the data into “wide” format, where each row is a complete sampling date and each parameter becomes its own column.

Many of the parameters have spaces (and sometimes commas) in the name the lab uses, such as “Ammonia Nitrogen” or “Copper, Dissolved”. When I use the spread() function (from the “tidyr” package), these become column names containing spaces and commas. After trying a few approaches, I found a simple, elegant way to do it in a single line using the “stringr” package.

To fix a dataframe (or any other named structure) “x” that already has names containing spaces and/or commas:

names(x)<-str_replace_all(names(x), c(" " = "." , "," = "" ))

This replaces spaces with periods, and commas with nothing (to avoid creating double periods like “Copper..Dissolved”). Spaces and commas are the only two problematic characters I have in my dataset; if you have others, add them to the str_replace_all() call. The result from a names() call is a vector of strings, which the stringr functions can manipulate easily before passing it back through the names() function.

Another approach is to remove the spaces from the long/molten dataset before spreading it out. It’s very similar code, only on a column of parameter names:

y$Parameter<-str_replace_all(y$Parameter, c(" " = "." , "," = "" ))

 

UPDATE: I’ve discovered that tibbles — the tidyverse version of the dataframe — can handle names with spaces and other special characters. To refer to those non-standard names in functions, surround the name with backticks (on my keyboard, the backtick is on the  ~  key):

`​Copper, Dissolved`

 

Even though I’m working primarily with tibbles now, I’m going to keep renaming my columns so they’re easier to work with as I go through the rest of the data analysis. I prefer not to have to type the backticks all the time.

Advertisement

4 Comments

Filed under R