install.packages("tidyverse")
Getting Started with Data Wrangling in R
Data Wrangling in R, dplyr tutorial, tidyr tutorial, R programming, Data cleaning, Data manipulation in R, Filtering rows in R, Summarizing data in R, Grouping data in R, Reshaping data in R, Wide to long format in R, Pivot tables in R, Handling missing values in R, mutate function in R, R data analysis, mtcars dataset, iris dataset, tidyverse tutorial, Beginner’s guide to dplyr, Beginner’s guide to tidyr
Data wrangling is a critical step in the data analysis process. It involves cleaning, reshaping, and preparing data for analysis. In R, the dplyr
and tidyr
packages from the tidyverse
ecosystem are popular tools for efficient and intuitive data manipulation.
This tutorial will guide you through the basics of using dplyr
and tidyr
for common data wrangling tasks, using built-in R datasets.
Why Use dplyr and tidyr?
dplyr
: Streamlines data manipulation tasks like filtering, selecting, and summarizing data.tidyr
: Specializes in reshaping data, making it easy to pivot between wide and long formats.
Both packages use a consistent syntax and integrate seamlessly with other tidyverse tools.
Setting Up Your Environment
Before we start, ensure you have the tidyverse package installed. If not, you can install it with:
Load the necessary libraries:
library(dplyr)
library(tidyr)
Exploring the Built-in Dataset: mtcars
We’ll use the mtcars
dataset, which contains data on different car models. Start by loading the dataset:
data(mtcars)
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Task 1: Filtering Rows with dplyr
Filtering helps you extract rows based on conditions. For example, to filter cars with more than 20 miles per gallon (mpg
):
<- mtcars %>%
mtcars_filtered filter(mpg > 20)
head(mtcars_filtered)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Explanation:
filter(mpg > 20)
selects rows where thempg
column is greater than 20.
Task 2: Selecting Columns with dplyr
To select specific columns, use the select()
function. For example, to select mpg
, cyl
, and hp
columns:
<- mtcars %>%
mtcars_selected select(mpg, cyl, hp)
head(mtcars_selected)
mpg cyl hp
Mazda RX4 21.0 6 110
Mazda RX4 Wag 21.0 6 110
Datsun 710 22.8 4 93
Hornet 4 Drive 21.4 6 110
Hornet Sportabout 18.7 8 175
Valiant 18.1 6 105
Explanation:
select()
keeps only the specified columns.
Task 3: Summarizing Data with dplyr
Summarization aggregates data. For instance, calculate the average horsepower (hp
) for the dataset:
<- mtcars %>%
average_hp summarise(mean_hp = mean(hp))
average_hp
mean_hp
1 146.6875
Explanation:
summarise(mean_hp = mean(hp))
computes the mean of thehp
column and renames itmean_hp
.
Task 4: Grouping and Summarizing
Combine group_by()
and summarise()
to analyze subsets of data. For example, calculate the average mpg
for each cylinder group:
<- mtcars %>%
mpg_by_cyl group_by(cyl) %>%
summarise(mean_mpg = mean(mpg))
mpg_by_cyl
# A tibble: 3 × 2
cyl mean_mpg
<dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1
Explanation:
group_by(cyl)
groups the data bycyl
.summarise(mean_mpg = mean(mpg))
calculates the mean ofmpg
for each group.
Task 5: Adding New Columns with mutate()
Use mutate()
to create new columns. For instance, add a column for the weight in kilograms (wt_kg
):
<- mtcars %>%
mtcars_with_kg mutate(wt_kg = wt * 453.592)
head(mtcars_with_kg)
mpg cyl disp hp drat wt qsec vs am gear carb wt_kg
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1188.411
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1304.077
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1052.333
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1458.298
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1560.356
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1569.428
Explanation:
mutate(wt_kg = wt * 453.592)
creates a new column by converting the weight (wt
) from pounds to kilograms.
Task 6: Reshaping Data with tidyr
Pivoting from Wide to Long Format
Convert the iris
dataset from wide to long format, focusing on the Sepal.Length
and Sepal.Width
columns:
<- iris %>%
iris_long pivot_longer(cols = starts_with("Sepal"),
names_to = "Measurement",
values_to = "Value")
head(iris_long)
# A tibble: 6 × 5
Petal.Length Petal.Width Species Measurement Value
<dbl> <dbl> <fct> <chr> <dbl>
1 1.4 0.2 setosa Sepal.Length 5.1
2 1.4 0.2 setosa Sepal.Width 3.5
3 1.4 0.2 setosa Sepal.Length 4.9
4 1.4 0.2 setosa Sepal.Width 3
5 1.3 0.2 setosa Sepal.Length 4.7
6 1.3 0.2 setosa Sepal.Width 3.2
Explanation:
pivot_longer(cols = starts_with("Sepal"))
reshapes all columns starting with “Sepal” into a long format.names_to
specifies the new column for variable names.values_to
specifies the column for values.
Pivoting from Long to Wide Format
Convert the data back to wide format:
<- iris_long %>%
iris_wide pivot_wider(names_from = Measurement,
values_from = Value)
head(iris_wide)
# A tibble: 6 × 5
Petal.Length Petal.Width Species Sepal.Length Sepal.Width
<dbl> <dbl> <fct> <list> <list>
1 1.4 0.2 setosa <dbl [8]> <dbl [8]>
2 1.3 0.2 setosa <dbl [4]> <dbl [4]>
3 1.5 0.2 setosa <dbl [7]> <dbl [7]>
4 1.7 0.4 setosa <dbl [1]> <dbl [1]>
5 1.4 0.3 setosa <dbl [3]> <dbl [3]>
6 1.5 0.1 setosa <dbl [2]> <dbl [2]>
Task 7: Handling Missing Data
Simulate missing data and handle it using replace_na()
:
<- mtcars %>%
mtcars_with_na mutate(hp = ifelse(row_number() %% 5 == 0, NA, hp))
<- mtcars_with_na %>%
mtcars_cleaned replace_na(list(hp = 0))
head(mtcars_cleaned)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 0 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Explanation:
replace_na(list(hp = 0))
replaces missing values in thehp
column with0
.
Conclusion
In this tutorial, you’ve learned how to: - Filter, select, and summarize data with dplyr
. - Reshape data with tidyr
. - Handle missing values efficiently.
Data wrangling is essential for preparing datasets for analysis. Mastering these tools will save you time and make your analysis more effective.
Ready to dive deeper into R? Explore more tutorials on InsightR and start building your data manipulation expertise today!
Resources
Let us know your thoughts in the comments or reach out with questions!