install.packages("tidyverse")Getting Started with Data Wrangling in R
Data Wrangling in R, dplyr tutorial, tidyr tutorial, R programming, Data cleaning, Data manipulation in R, Filtering rows in R, Summarizing data in R, Grouping data in R, Reshaping data in R, Wide to long format in R, Pivot tables in R, Handling missing values in R, mutate function in R, R data analysis, mtcars dataset, iris dataset, tidyverse tutorial, Beginner’s guide to dplyr, Beginner’s guide to tidyr
Data wrangling is a critical step in the data analysis process. It involves cleaning, reshaping, and preparing data for analysis. In R, the dplyr and tidyr packages from the tidyverse ecosystem are popular tools for efficient and intuitive data manipulation.
This tutorial will guide you through the basics of using dplyr and tidyr for common data wrangling tasks, using built-in R datasets.
Why Use dplyr and tidyr?
dplyr: Streamlines data manipulation tasks like filtering, selecting, and summarizing data.tidyr: Specializes in reshaping data, making it easy to pivot between wide and long formats.
Both packages use a consistent syntax and integrate seamlessly with other tidyverse tools.
Setting Up Your Environment
Before we start, ensure you have the tidyverse package installed. If not, you can install it with:
Load the necessary libraries:
library(dplyr)
library(tidyr)Exploring the Built-in Dataset: mtcars
We’ll use the mtcars dataset, which contains data on different car models. Start by loading the dataset:
data(mtcars)
head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Task 1: Filtering Rows with dplyr
Filtering helps you extract rows based on conditions. For example, to filter cars with more than 20 miles per gallon (mpg):
mtcars_filtered <- mtcars %>%
filter(mpg > 20)
head(mtcars_filtered) mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Explanation:
filter(mpg > 20)selects rows where thempgcolumn is greater than 20.
Task 2: Selecting Columns with dplyr
To select specific columns, use the select() function. For example, to select mpg, cyl, and hp columns:
mtcars_selected <- mtcars %>%
select(mpg, cyl, hp)
head(mtcars_selected) mpg cyl hp
Mazda RX4 21.0 6 110
Mazda RX4 Wag 21.0 6 110
Datsun 710 22.8 4 93
Hornet 4 Drive 21.4 6 110
Hornet Sportabout 18.7 8 175
Valiant 18.1 6 105
Explanation:
select()keeps only the specified columns.
Task 3: Summarizing Data with dplyr
Summarization aggregates data. For instance, calculate the average horsepower (hp) for the dataset:
average_hp <- mtcars %>%
summarise(mean_hp = mean(hp))
average_hp mean_hp
1 146.6875
Explanation:
summarise(mean_hp = mean(hp))computes the mean of thehpcolumn and renames itmean_hp.
Task 4: Grouping and Summarizing
Combine group_by() and summarise() to analyze subsets of data. For example, calculate the average mpg for each cylinder group:
mpg_by_cyl <- mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg))
mpg_by_cyl# A tibble: 3 × 2
cyl mean_mpg
<dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1
Explanation:
group_by(cyl)groups the data bycyl.summarise(mean_mpg = mean(mpg))calculates the mean ofmpgfor each group.
Task 5: Adding New Columns with mutate()
Use mutate() to create new columns. For instance, add a column for the weight in kilograms (wt_kg):
mtcars_with_kg <- mtcars %>%
mutate(wt_kg = wt * 453.592)
head(mtcars_with_kg) mpg cyl disp hp drat wt qsec vs am gear carb wt_kg
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1188.411
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1304.077
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1052.333
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1458.298
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1560.356
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1569.428
Explanation:
mutate(wt_kg = wt * 453.592)creates a new column by converting the weight (wt) from pounds to kilograms.
Task 6: Reshaping Data with tidyr
Pivoting from Wide to Long Format
Convert the iris dataset from wide to long format, focusing on the Sepal.Length and Sepal.Width columns:
iris_long <- iris %>%
pivot_longer(cols = starts_with("Sepal"),
names_to = "Measurement",
values_to = "Value")
head(iris_long)# A tibble: 6 × 5
Petal.Length Petal.Width Species Measurement Value
<dbl> <dbl> <fct> <chr> <dbl>
1 1.4 0.2 setosa Sepal.Length 5.1
2 1.4 0.2 setosa Sepal.Width 3.5
3 1.4 0.2 setosa Sepal.Length 4.9
4 1.4 0.2 setosa Sepal.Width 3
5 1.3 0.2 setosa Sepal.Length 4.7
6 1.3 0.2 setosa Sepal.Width 3.2
Explanation:
pivot_longer(cols = starts_with("Sepal"))reshapes all columns starting with “Sepal” into a long format.names_tospecifies the new column for variable names.values_tospecifies the column for values.
Pivoting from Long to Wide Format
Convert the data back to wide format:
iris_wide <- iris_long %>%
pivot_wider(names_from = Measurement,
values_from = Value)
head(iris_wide)# A tibble: 6 × 5
Petal.Length Petal.Width Species Sepal.Length Sepal.Width
<dbl> <dbl> <fct> <list> <list>
1 1.4 0.2 setosa <dbl [8]> <dbl [8]>
2 1.3 0.2 setosa <dbl [4]> <dbl [4]>
3 1.5 0.2 setosa <dbl [7]> <dbl [7]>
4 1.7 0.4 setosa <dbl [1]> <dbl [1]>
5 1.4 0.3 setosa <dbl [3]> <dbl [3]>
6 1.5 0.1 setosa <dbl [2]> <dbl [2]>
Task 7: Handling Missing Data
Simulate missing data and handle it using replace_na():
mtcars_with_na <- mtcars %>%
mutate(hp = ifelse(row_number() %% 5 == 0, NA, hp))
mtcars_cleaned <- mtcars_with_na %>%
replace_na(list(hp = 0))
head(mtcars_cleaned) mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 0 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Explanation:
replace_na(list(hp = 0))replaces missing values in thehpcolumn with0.
Conclusion
In this tutorial, you’ve learned how to: - Filter, select, and summarize data with dplyr. - Reshape data with tidyr. - Handle missing values efficiently.
Data wrangling is essential for preparing datasets for analysis. Mastering these tools will save you time and make your analysis more effective.
Ready to dive deeper into R? Explore more tutorials on InsightR and start building your data manipulation expertise today!
Resources
Let us know your thoughts in the comments or reach out with questions!