We all work with data frames and it is important that we know how we can reshape them, as necessary to meet our needs. I think that there are, at least, four routine tasks that we need to be able to accomplish:
- subsetting
- sorting
- casting
- melting
Obviously, there is a wide array of possibilities; I’ll just mention a few, which I regularly use.
Subsetting the data
Subsetting means selecting the records (rows) or the variables (columns) which satisfy certain criteria. Let’s take the ‘students.csv’ dataset, which is available on one of my repositories. It is a database of student’s marks in a series of exams for different subjects.
students <- read.csv("https://www.casaonofri.it/_datasets/students.csv", header=T)
head(students)
## Id Subject Date Mark Year HighSchool
## 1 1 AGRONOMY 10/06/2002 30 2001 HUMANITIES
## 2 2 AGRONOMY 08/07/2002 24 2001 AGRICULTURE
## 3 3 AGRONOMY 24/06/2002 30 2001 AGRICULTURE
## 4 4 AGRONOMY 24/06/2002 26 2001 HUMANITIES
## 5 5 AGRONOMY 23/01/2003 30 2001 HUMANITIES
## 6 6 AGRONOMY 09/09/2002 28 2001 AGRICULTURE
Let’s say that we want a new dataset, which contains only the students with marks higher than 28.
subData <- subset(students, Mark >= 28)
head(subData)
## Id Subject Date Mark Year HighSchool
## 1 1 AGRONOMY 10/06/2002 30 2001 HUMANITIES
## 3 3 AGRONOMY 24/06/2002 30 2001 AGRICULTURE
## 5 5 AGRONOMY 23/01/2003 30 2001 HUMANITIES
## 6 6 AGRONOMY 09/09/2002 28 2001 AGRICULTURE
## 11 11 AGRONOMY 09/09/2002 28 2001 SCIENTIFIC
## 17 17 AGRONOMY 10/06/2002 30 2001 HUMANITIES
Let’s make it more difficult and extract the records were mark is ranging from 26 to 28 (margins included. Look at the AND clause):
subData <- subset(students, Mark <= 28 & Mark >=26)
head(subData)
## Id Subject Date Mark Year HighSchool
## 4 4 AGRONOMY 24/06/2002 26 2001 HUMANITIES
## 6 6 AGRONOMY 09/09/2002 28 2001 AGRICULTURE
## 7 7 AGRONOMY 24/02/2003 26 2001 HUMANITIES
## 8 8 AGRONOMY 09/09/2002 26 2001 SCIENTIFIC
## 10 10 AGRONOMY 08/07/2002 27 2001 HUMANITIES
## 11 11 AGRONOMY 09/09/2002 28 2001 SCIENTIFIC
Now we are interested in those students who got a mark ranging from 26 to 28 in MATHS (please note the equality relationship written as ‘==’):
subData <- subset(students, Mark <= 28 & Mark >=26 &
Subject == "MATHS")
head(subData)
## Id Subject Date Mark Year HighSchool
## 115 115 MATHS 15/07/2002 26 2001 AGRICULTURE
## 124 124 MATHS 16/09/2002 26 2001 SCIENTIFIC
## 138 138 MATHS 04/02/2002 27 2001 HUMANITIES
## 144 144 MATHS 10/02/2003 27 2001 HUMANITIES
## 145 145 MATHS 04/07/2003 27 2002 HUMANITIES
## 146 146 MATHS 28/02/2002 28 2001 AGRICULTURE
Lets’ look for good students either in MATHS or in CHEMISTRY (OR clause; note the ‘|’ operator):
subData <- subset(students, Mark <= 28 & Mark >=26 &
Subject == "MATHS" |
Subject == "CHEMISTRY")
head(subData)
## Id Subject Date Mark Year HighSchool
## 64 64 CHEMISTRY 18/06/2003 20 2002 AGRICULTURE
## 65 65 CHEMISTRY 06/06/2002 21 2001 HUMANITIES
## 66 66 CHEMISTRY 20/02/2003 21 2002 HUMANITIES
## 67 67 CHEMISTRY 20/02/2003 18 2002 AGRICULTURE
## 68 68 CHEMISTRY 04/06/2002 28 2001 OTHER SCHOOL
## 69 69 CHEMISTRY 26/06/2002 23 2001 ACCOUNTING
We can also select columns; for example we may want to display only the ‘Subject’, ‘Mark’ and ‘HighSchool’ columns:
subData <- subset(students, Mark <= 28 & Mark >=26 &
Subject == "MATHS" |
Subject == "CHEMISTRY",
select = c(Subject, Mark, HighSchool))
head(subData)
## Subject Mark HighSchool
## 64 CHEMISTRY 20 AGRICULTURE
## 65 CHEMISTRY 21 HUMANITIES
## 66 CHEMISTRY 21 HUMANITIES
## 67 CHEMISTRY 18 AGRICULTURE
## 68 CHEMISTRY 28 OTHER SCHOOL
## 69 CHEMISTRY 23 ACCOUNTING
We can as well drop the unwanted columns:
subData <- subset(students, Mark <= 28 & Mark >=26 &
Subject == "MATHS" |
Subject == "CHEMISTRY",
select = c(-Id,
-Date,
-Year))
head(subData)
## Subject Mark HighSchool
## 64 CHEMISTRY 20 AGRICULTURE
## 65 CHEMISTRY 21 HUMANITIES
## 66 CHEMISTRY 21 HUMANITIES
## 67 CHEMISTRY 18 AGRICULTURE
## 68 CHEMISTRY 28 OTHER SCHOOL
## 69 CHEMISTRY 23 ACCOUNTING
Using the function ‘subset()’ is very easy. However, we might have higher flexibility if we subset by using indices. We already know that the notation ‘dataframe[i,j]’ returns the element in the i-th row and j-th column in a data frame. We can of course replace i and j with some subsetting rules. For example, taking the exams where the mark is comprised between 25 and 29 is done as follows:
subData <- students[(students$Mark <= 29 & students$Mark >=25),]
head(subData)
## Id Subject Date Mark Year HighSchool
## 4 4 AGRONOMY 24/06/2002 26 2001 HUMANITIES
## 6 6 AGRONOMY 09/09/2002 28 2001 AGRICULTURE
## 7 7 AGRONOMY 24/02/2003 26 2001 HUMANITIES
## 8 8 AGRONOMY 09/09/2002 26 2001 SCIENTIFIC
## 10 10 AGRONOMY 08/07/2002 27 2001 HUMANITIES
## 11 11 AGRONOMY 09/09/2002 28 2001 SCIENTIFIC
This is useful to quickly edit the data. For example, if we want to replace all marks from 25 to 29 with NAs (Not Available), we can simply do:
subData <- students
subData[(subData$Mark <= 29 & subData$Mark >=25), "Mark"] <- NA
head(subData)
## Id Subject Date Mark Year HighSchool
## 1 1 AGRONOMY 10/06/2002 30 2001 HUMANITIES
## 2 2 AGRONOMY 08/07/2002 24 2001 AGRICULTURE
## 3 3 AGRONOMY 24/06/2002 30 2001 AGRICULTURE
## 4 4 AGRONOMY 24/06/2002 NA 2001 HUMANITIES
## 5 5 AGRONOMY 23/01/2003 30 2001 HUMANITIES
## 6 6 AGRONOMY 09/09/2002 NA 2001 AGRICULTURE
Please note that I created a new dataset to make the replacement, in order not to modify the original dataset. Of course, I can use the ‘is.na()’ function to find the missing data and edit them.
subData[is.na(subData$Mark) == T, "Mark"] <- 0
head(subData)
## Id Subject Date Mark Year HighSchool
## 1 1 AGRONOMY 10/06/2002 30 2001 HUMANITIES
## 2 2 AGRONOMY 08/07/2002 24 2001 AGRICULTURE
## 3 3 AGRONOMY 24/06/2002 30 2001 AGRICULTURE
## 4 4 AGRONOMY 24/06/2002 0 2001 HUMANITIES
## 5 5 AGRONOMY 23/01/2003 30 2001 HUMANITIES
## 6 6 AGRONOMY 09/09/2002 0 2001 AGRICULTURE
Sorting the data
Sorting is very much like subsetting by indexing. I just need to use the ‘order’ function. For example, let’s sort the students dataset by mark:
sortedData <- students[order(students$Mark), ]
head(sortedData)
## Id Subject Date Mark Year HighSchool
## 51 51 BIOLOGY 01/03/2002 18 2001 HUMANITIES
## 67 67 CHEMISTRY 20/02/2003 18 2002 AGRICULTURE
## 76 76 CHEMISTRY 24/02/2003 18 2002 OTHER SCHOOL
## 79 79 CHEMISTRY 18/06/2003 18 2002 AGRICULTURE
## 82 82 CHEMISTRY 18/07/2002 18 2001 AGRICULTURE
## 83 83 CHEMISTRY 23/01/2003 18 2001 SCIENTIFIC
We can also sort by decreasing order:
sortedData <- students[order(-students$Mark), ]
head(sortedData)
## Id Subject Date Mark Year HighSchool
## 1 1 AGRONOMY 10/06/2002 30 2001 HUMANITIES
## 3 3 AGRONOMY 24/06/2002 30 2001 AGRICULTURE
## 5 5 AGRONOMY 23/01/2003 30 2001 HUMANITIES
## 17 17 AGRONOMY 10/06/2002 30 2001 HUMANITIES
## 18 18 AGRONOMY 10/06/2002 30 2001 AGRICULTURE
## 19 19 AGRONOMY 09/09/2002 30 2001 AGRICULTURE
We can obviously use multiple keys. For example, let’s sort by subject within marks:
sortedData <- students[order(-students$Mark, students$Subject), ]
head(sortedData)
## Id Subject Date Mark Year HighSchool
## 1 1 AGRONOMY 10/06/2002 30 2001 HUMANITIES
## 3 3 AGRONOMY 24/06/2002 30 2001 AGRICULTURE
## 5 5 AGRONOMY 23/01/2003 30 2001 HUMANITIES
## 17 17 AGRONOMY 10/06/2002 30 2001 HUMANITIES
## 18 18 AGRONOMY 10/06/2002 30 2001 AGRICULTURE
## 19 19 AGRONOMY 09/09/2002 30 2001 AGRICULTURE
If I want to sort in decreasing order on a character variable (such as Subject), I need to use the helper function ‘xtfrm()’:
sortedData <- students[order(-students$Mark, -xtfrm(students$Subject)), ]
head(sortedData)
## Id Subject Date Mark Year HighSchool
## 116 116 MATHS 01/07/2002 30 2001 OTHER SCHOOL
## 117 117 MATHS 18/06/2002 30 2001 ACCOUNTING
## 118 118 MATHS 09/07/2002 30 2001 AGRICULTURE
## 121 121 MATHS 18/06/2002 30 2001 ACCOUNTING
## 123 123 MATHS 09/07/2002 30 2001 HUMANITIES
## 130 130 MATHS 07/02/2002 30 2001 SCIENTIFIC
Casting the data
When we have a dataset in the LONG format, we might be interested in reshaping it in the WIDE format. This is the same as what the ‘pivot table’ function in Excel does. For example, take the ‘rimsulfuron.csv’ dataset in my repository. This contains the results of a randomised block experiment, where we have 16 herbicides in four blocks. The dataset is in the LONG format, with one row per plot.
rimsulfuron <- read.csv("https://www.casaonofri.it/_datasets/rimsulfuron.csv", header=T)
head(rimsulfuron)
## Herbicide Plot Code Block Box WeedCover Yield
## 1 Rimsulfuron (40) 1 1 1 1 27.8 85.91
## 2 Rimsulfuron (45) 2 2 1 1 27.8 93.03
## 3 Rimsulfuron (50) 3 3 1 1 23.0 86.93
## 4 Rimsulfuron (60) 4 4 1 1 42.8 52.99
## 5 Rimsulfuron (50+30 split) 5 5 1 2 15.1 71.36
## 6 Rimsulfuron + thyfensulfuron 6 6 1 2 22.9 75.28
Lets’put this data frame in the WIDE format, with one row per herbicide and one column per block. To do so, I usually use the ‘cast()’ function in the ‘reshape’ package.
library(reshape)
castData <- cast(Herbicide ~ Block, data = rimsulfuron,
value = "Yield")
head(castData)
## Herbicide 1 2 3 4
## 1 Alachlor + terbuthylazine 12.06 49.58 41.34 16.37
## 2 Hand-Weeded 77.58 92.08 86.59 99.63
## 3 Metolachlor + terbuthylazine (pre) 51.77 52.10 49.46 34.67
## 4 Pendimethalin (post) + rimsuulfuron (post) 94.82 87.72 102.05 101.94
## 5 Pendimethalin (pre) + rimsulfuron (post) 65.51 88.72 95.52 82.39
## 6 Rimsulfuron (40) 85.91 91.09 111.42 93.15
Other similar functions are available within the ‘reshape2’ and ‘tidyr’ packages.
Melting the data
In this case we do the reverse: we transform the dataset from WIDE to LONG format. For this task, I like the ‘melt()’ function in the ‘reshape2’ package, which requires a data frame as input. I would like to use the ‘castData’ object which we have just created by using the ‘cast()’ function above. Unfortunately, this object has a ‘cast_df’ class. Therefore, in order to avoid weird results, I need to change ‘castData’ into a data frame, by using the ‘as.data.frame()’ function.
library(reshape2)
castData <- as.data.frame(castData)
mdati <- melt(castData,
variable.name = "Block",
value.name = "Yield",
id=c("Herbicide"))
head(mdati)
## Herbicide Block Yield
## 1 Alachlor + terbuthylazine 1 12.06
## 2 Hand-Weeded 1 77.58
## 3 Metolachlor + terbuthylazine (pre) 1 51.77
## 4 Pendimethalin (post) + rimsuulfuron (post) 1 94.82
## 5 Pendimethalin (pre) + rimsulfuron (post) 1 65.51
## 6 Rimsulfuron (40) 1 85.91
Have a nice work with these functions!
Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
Borgo XX Giugno 74
I-06121 - PERUGIA