Getting Started with Data Wrangling in R

Data Wrangling

R Programming

Data Cleaning

Tutorials

A beginner-friendly guide to data wrangling in R using the powerful dplyr and tidyr packages. Learn how to filter, summarize, and reshape datasets step by step.

Author

InsightR

Published

December 1, 2024

Keywords

Data Wrangling in R, dplyr tutorial, tidyr tutorial, R programming, Data cleaning, Data manipulation in R, Filtering rows in R, Summarizing data in R, Grouping data in R, Reshaping data in R, Wide to long format in R, Pivot tables in R, Handling missing values in R, mutate function in R, R data analysis, mtcars dataset, iris dataset, tidyverse tutorial, Beginner’s guide to dplyr, Beginner’s guide to tidyr

Data wrangling is a critical step in the data analysis process. It involves cleaning, reshaping, and preparing data for analysis. In R, the dplyr and tidyr packages from the tidyverse ecosystem are popular tools for efficient and intuitive data manipulation.

This tutorial will guide you through the basics of using dplyr and tidyr for common data wrangling tasks, using built-in R datasets.

Why Use dplyr and tidyr?

dplyr: Streamlines data manipulation tasks like filtering, selecting, and summarizing data.
tidyr: Specializes in reshaping data, making it easy to pivot between wide and long formats.

Both packages use a consistent syntax and integrate seamlessly with other tidyverse tools.

Setting Up Your Environment

Before we start, ensure you have the tidyverse package installed. If not, you can install it with:

install.packages("tidyverse")

Load the necessary libraries:

library(dplyr)
library(tidyr)

Exploring the Built-in Dataset: mtcars

We’ll use the mtcars dataset, which contains data on different car models. Start by loading the dataset:

data(mtcars)
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Task 1: Filtering Rows with dplyr

Filtering helps you extract rows based on conditions. For example, to filter cars with more than 20 miles per gallon (mpg):

mtcars_filtered <- mtcars %>% 
  filter(mpg > 20)
head(mtcars_filtered)

                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2

Explanation:

filter(mpg > 20) selects rows where the mpg column is greater than 20.

Task 2: Selecting Columns with dplyr

To select specific columns, use the select() function. For example, to select mpg, cyl, and hp columns:

mtcars_selected <- mtcars %>% 
  select(mpg, cyl, hp)
head(mtcars_selected)

                   mpg cyl  hp
Mazda RX4         21.0   6 110
Mazda RX4 Wag     21.0   6 110
Datsun 710        22.8   4  93
Hornet 4 Drive    21.4   6 110
Hornet Sportabout 18.7   8 175
Valiant           18.1   6 105

Explanation:

select() keeps only the specified columns.

Task 3: Summarizing Data with dplyr

Summarization aggregates data. For instance, calculate the average horsepower (hp) for the dataset:

average_hp <- mtcars %>% 
  summarise(mean_hp = mean(hp))
average_hp

   mean_hp
1 146.6875

Explanation:

summarise(mean_hp = mean(hp)) computes the mean of the hp column and renames it mean_hp.

Task 4: Grouping and Summarizing

Combine group_by() and summarise() to analyze subsets of data. For example, calculate the average mpg for each cylinder group:

mpg_by_cyl <- mtcars %>% 
  group_by(cyl) %>% 
  summarise(mean_mpg = mean(mpg))
mpg_by_cyl

# A tibble: 3 × 2
    cyl mean_mpg
  <dbl>    <dbl>
1     4     26.7
2     6     19.7
3     8     15.1

Explanation:

group_by(cyl) groups the data by cyl.
summarise(mean_mpg = mean(mpg)) calculates the mean of mpg for each group.

Task 5: Adding New Columns with mutate()

Use mutate() to create new columns. For instance, add a column for the weight in kilograms (wt_kg):

mtcars_with_kg <- mtcars %>% 
  mutate(wt_kg = wt * 453.592)
head(mtcars_with_kg)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb    wt_kg
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 1188.411
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 1304.077
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 1052.333
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 1458.298
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 1560.356
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 1569.428

Explanation:

mutate(wt_kg = wt * 453.592) creates a new column by converting the weight (wt) from pounds to kilograms.

Task 6: Reshaping Data with tidyr

Pivoting from Wide to Long Format

Convert the iris dataset from wide to long format, focusing on the Sepal.Length and Sepal.Width columns:

iris_long <- iris %>% 
  pivot_longer(cols = starts_with("Sepal"), 
               names_to = "Measurement", 
               values_to = "Value")
head(iris_long)

# A tibble: 6 × 5
  Petal.Length Petal.Width Species Measurement  Value
         <dbl>       <dbl> <fct>   <chr>        <dbl>
1          1.4         0.2 setosa  Sepal.Length   5.1
2          1.4         0.2 setosa  Sepal.Width    3.5
3          1.4         0.2 setosa  Sepal.Length   4.9
4          1.4         0.2 setosa  Sepal.Width    3  
5          1.3         0.2 setosa  Sepal.Length   4.7
6          1.3         0.2 setosa  Sepal.Width    3.2

Explanation:

pivot_longer(cols = starts_with("Sepal")) reshapes all columns starting with “Sepal” into a long format.
names_to specifies the new column for variable names.
values_to specifies the column for values.

Pivoting from Long to Wide Format

Convert the data back to wide format:

iris_wide <- iris_long %>% 
  pivot_wider(names_from = Measurement, 
              values_from = Value)
head(iris_wide)

# A tibble: 6 × 5
  Petal.Length Petal.Width Species Sepal.Length Sepal.Width
         <dbl>       <dbl> <fct>   <list>       <list>     
1          1.4         0.2 setosa  <dbl [8]>    <dbl [8]>  
2          1.3         0.2 setosa  <dbl [4]>    <dbl [4]>  
3          1.5         0.2 setosa  <dbl [7]>    <dbl [7]>  
4          1.7         0.4 setosa  <dbl [1]>    <dbl [1]>  
5          1.4         0.3 setosa  <dbl [3]>    <dbl [3]>  
6          1.5         0.1 setosa  <dbl [2]>    <dbl [2]>

Task 7: Handling Missing Data

Simulate missing data and handle it using replace_na():

mtcars_with_na <- mtcars %>% 
  mutate(hp = ifelse(row_number() %% 5 == 0, NA, hp))

mtcars_cleaned <- mtcars_with_na %>% 
  replace_na(list(hp = 0))
head(mtcars_cleaned)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360   0 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Explanation:

replace_na(list(hp = 0)) replaces missing values in the hp column with 0.

Conclusion

In this tutorial, you’ve learned how to: - Filter, select, and summarize data with dplyr. - Reshape data with tidyr. - Handle missing values efficiently.

Data wrangling is essential for preparing datasets for analysis. Mastering these tools will save you time and make your analysis more effective.

Ready to dive deeper into R? Explore more tutorials on InsightR and start building your data manipulation expertise today!

Resources

Let us know your thoughts in the comments or reach out with questions!