SyntaxStudy
Sign Up
R Grouping, Summarising, and Joins
R Beginner 2 min read

Grouping, Summarising, and Joins

group_by() annotates a data frame with grouping information so that subsequent operations are performed within each group independently. When combined with summarise(), it produces a grouped summary data frame with one row per group. Common summary functions include n() (row count), n_distinct() (unique count), mean(), median(), sd(), min(), max(), sum(), and first()/last(). The .groups argument controls whether to keep grouping in the result; setting it to "drop" removes all grouping, which is usually what you want after a final summarise(). The across() helper inside mutate() or summarise() applies the same transformation to multiple columns selected by name or by type predicate. This replaces the older _at, _if, and _all scoped variants and is the idiomatic way to apply a function to many columns at once. where(is.numeric) selects all numeric columns; c(col1, col2) selects specific columns by name. dplyr provides six join functions that mirror SQL semantics. left_join() keeps all rows from the left table and matches from the right; inner_join() keeps only rows with matches in both; full_join() keeps all rows from both; right_join() is the mirror of left_join(); semi_join() keeps rows from the left that have a match in the right (filtering only, no new columns); and anti_join() keeps rows from the left that have no match in the right (useful for finding unmatched records). The by argument specifies the key columns.
Example
library(dplyr)

employees <- data.frame(
    id     = 1:8,
    name   = c("Alice","Bob","Carol","Dave","Eve","Frank","Gina","Hank"),
    dept   = c("HR","IT","IT","Finance","HR","IT","Finance","HR"),
    salary = c(55000,72000,48000,95000,61000,88000,102000,58000),
    years  = c(3,7,2,12,5,9,15,4),
    active = c(TRUE,TRUE,FALSE,TRUE,TRUE,TRUE,TRUE,FALSE)
)

# group_by + summarise
employees |>
    group_by(dept) |>
    summarise(
        headcount   = n(),
        avg_salary  = mean(salary),
        max_salary  = max(salary),
        avg_tenure  = mean(years),
        .groups     = "drop"
    )

# across() — apply function to multiple columns
employees |>
    group_by(dept) |>
    summarise(across(where(is.numeric), mean, .names = "avg_{.col}"),
              .groups = "drop")

# count() shortcut
employees |> count(dept, sort = TRUE)
employees |> count(dept, active)

# Joins
departments <- data.frame(
    dept    = c("HR", "IT", "Finance", "Legal"),
    manager = c("Susan", "Tom", "Rachel", "Eric"),
    budget  = c(200000, 500000, 800000, 150000)
)

# left_join — all employees, matched dept info
left_join(employees, departments, by = "dept")

# inner_join — only employees in departments table
inner_join(employees, departments, by = "dept")

# anti_join — departments with no employees
anti_join(departments, employees, by = "dept")

# semi_join — employees whose dept appears in departments
semi_join(employees, filter(departments, budget > 400000), by = "dept")

# Full pipeline example
employees |>
    filter(active) |>
    left_join(departments, by = "dept") |>
    group_by(dept, manager) |>
    summarise(total_salary = sum(salary),
              headcount    = n(),
              .groups      = "drop") |>
    mutate(salary_pct = total_salary / budget) |>
    arrange(desc(total_salary))