Frequency distribution 1

Qualitative variables

Published

September 30, 2024

1 Definitions

1.1 Raw data

  • It is defined as the original set of measurements or observations collected from a study or an experiment without being organized, summarized, or manipulated.

1.2 Ordered aray

  • It is a set of data arranged in ascending or descending order.

  • For example, consider the following set of raw data (5, 3, 7, 2, 8, 4, 6, 1)

    • The ordered array of the data is (1, 2, 3, 4, 5, 6, 7, 8).

    • This can be done using the sort() function in R as follows:

      Click to show/hide code
      # create a vector of raw data
      raw_data <- c(5, 3, 7, 2, 8, 4, 6, 1)
      
      # sort the raw data in ascending order
      ordered_array <- sort(raw_data)
      
      # display the ordered array
      ordered_array
      [1] 1 2 3 4 5 6 7 8

1.3 Frequency distribution

  • It is used to summarize raw data by grouping the data into classes and counting the number of observations in each class.

2 Frequency distribution of categorical variables

  • These variables can be summarized by counting the number of observations in each category (this count is known as the absolute frequency).

  • The relative frequency of each category is the proportion of observations in that category relative to the total number of observations.

  • For example, consider the variable vs (engine type) in the mtcars dataset:

    • This variable contains two categories: 0 (V-shaped) and 1 (Straight).

    • The frequency distribution and relative frequency of the variable vs can be calculated using base R functions or directly using custom functions from vtable and gtsummary packages:

      • Load the dataset and display the variable

        Click to show/hide code
        # load the mtcars dataset
        data(mtcars)
        
        # display the values of the vs variable
        mtcars$vs
         [1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
      • Check the type (class) of the variable

        [1] "numeric"
      • Convert the variable to a factor and modify labels

        Click to show/hide code
        # convert the variable to a factor
        mtcars$vs <- 
          factor(
            mtcars$vs,
            levels = c(0, 1),
            labels = c("V-shaped",
            "Straight"
          )
        )
        levels(mtcars$vs)
        [1] "V-shaped" "Straight"
      • Create a frequency distribution table

        Click to show/hide code
        # create a frequency distribution table for the vs variable
        freq_tbl <- table(mtcars$vs)
        freq_tbl
        
        V-shaped Straight 
              18       14 
      • Calculate the relative frequency

        Click to show/hide code
        # calculate the relative frequency
        rel_freq <- prop.table(freq_tbl)
        rel_freq
        
        V-shaped Straight 
          0.5625   0.4375 
      • Display the relative frequency as a percentage

        Click to show/hide code
        # display as percentage
        rbind(
          names(rel_freq), 
          paste0( 
            round(       
              (rel_freq * 100), 
              digits = 1
            ), 
            "%"
          )
        )
             [,1]       [,2]      
        [1,] "V-shaped" "Straight"
        [2,] "56.2%"    "43.8%"   
      Click to show/hide code
      library(vtable)
      library(dplyr)
      
      mtcars %>% 
          select(vs) %>%  # select the vs variable 
          sumtable( 
              labels = 'Engine Type', # change the label of the vs variable 
              digits = 3, # set the number of digits 
          )
      Summary Statistics
      Variable N Percent
      Engine Type 32
      ... V-shaped 18 56.2%
      ... Straight 14 43.8%
      Click to show/hide code
      library(gtsummary)
      library(dplyr)
      
      mtcars %>% 
      select(vs) %>%  # select the vs variable 
      tbl_summary(
          label = list(vs = "Engine Type"),  # change the label of the vs variable
          digits = vs ~ c(0, 1)  # set the number of digits, 0 for frequency and 1 for relative frequency
      )
      Characteristic N = 321
      Engine Type
          V-shaped 18 (56.3%)
          Straight 14 (43.8%)
      1 n (%)

2.1 Graphical representation

2.1.1 Bar chart

  • This type of charts is used to represent the frequency distribution of categorical variables.

  • It consists of a rectangular bar (column) for each category, where the height of the bar represents the absolute frequency or relative frequency of that category.

  • The bars can be arranged horizontally or vertically.

  • For ordinal scale variables, it is better to arrange the bars on the \(x\)-axis based on the order of the categories.

  • The frequency distribution of the vs variable can be represented using a bar chart as follows:

    Click to show/hide code
    # create a bar chart for the vs variable
    barplot(
      freq_tbl,
      main = "Frequency distribution of the Engine Type",
      xlab = "Engine Type",
      ylab = "Frequency",
      col = c("skyblue", "lightgreen"),
      border = "black",
      names.arg = c("V-shaped", "Straight"),
      ylim = c(0, 20),
      width = c(0.2,0.2), 
      xlim = c(0,1), 
      space = 1
      )

    Click to show/hide code
    library(tidyverse)
    library(ggthemes)
    library(patchwork)
    
    # create a data frame from the frequency table
    freq_tbl_df <- 
      data.frame(
        engine_type =
          factor(
            names(freq_tbl),
            levels = names(freq_tbl)
          ),
        freq = as.numeric(freq_tbl)
      )
    
    # create a new column for the relative frequency
    total_freq <- sum(freq_tbl)
    freq_tbl_df <- 
      freq_tbl_df %>%
        mutate(
          rel_freq = 
            round(
              (freq / total_freq) * 100,
              digits = 1
            )
        )
    
    # create a bar chart with frequency on y-axis
    fig1 <- 
      ggplot( 
        freq_tbl_df,
        aes(
          x = engine_type,
          y = freq,
          fill = engine_type
        )
     ) +
     geom_col(width = 0.5) + # you can also use geom_bar(stat = "identity")
     labs(
       title = "Frequency distribution \n of the Engine Type",
       x = "Engine Type",
       y = "Frequency"
     ) +
     scale_fill_manual(
       values = c("skyblue", "lightgreen")
     ) +
     scale_y_continuous(
       limits = c(0,20),
       expand = c(0, 0),
     ) +
     theme_few() +
     theme(
       axis.text.x = element_blank(), 
       legend.position = "none",
       plot.title = element_text(hjust = 0.5)
     )
    
    # create a bar chart with relative frequency on y-axis
    fig2 <- 
      ggplot(
        freq_tbl_df,
        aes(
          x = engine_type,
          y = rel_freq,
          fill = engine_type
        )
     ) +
     geom_col(width = 0.5) + 
     labs(
       title = "Relative Frequency \n distribution of the Engine Type",
       x = "Engine Type",
       y = "Relative Frequency (%)", 
       fill = "Engine Type"
     ) +
     scale_fill_manual(
     values = c("skyblue", "lightgreen")
     ) +
     scale_y_continuous(
       limits = c(0,60),
       expand = c(0, 0),
     ) +
     theme_few() +
     theme(
       axis.text.x = element_blank(),
       plot.title = element_text(hjust = 0.5)
     )
    
     fig1 + plot_spacer() + fig2 + plot_layout(widths = c(1, 0.05, 1))

    Click to show/hide code
    # Use the same code as described in the previous tab but just add coord_flip() to the ggplot object
    
    ggplot(
      freq_tbl_df,
      aes(
        x = engine_type,
        y = rel_freq,
        fill = engine_type
      )
    ) +
    geom_col(width = 0.5) + 
    labs(
      title = "Relative Frequency distribution of the Engine Type",
      x = "Engine Type",
      y = "Relative Frequency (%)", 
      fill = "Engine Type"
    ) +
    scale_fill_manual(
      values = c("skyblue", "lightgreen")
    ) +
    scale_y_continuous(
      limits = c(0,60),
      expand = c(0, 0),
    ) +
    coord_flip() +
    theme_few() +
    theme(
      axis.text.y = element_blank(), 
    )

  • Bar charts can be also used to compare the frequency distribution of a categorical variable across different groups.

  • For example, the following bar chart shows the differences in the frequency distribution of the vs variable across the am variable (transmission type; 0: automatic and 1: manual) in the mtcars dataset:

    Click to show/hide code
    library(tidyverse)
    library(ggthemes)
    
    ggplot(
      mtcars,
      aes(
        x = factor(am),
        fill = factor(vs),
      )
    ) +
    geom_bar(
      position = position_dodge(width = 0.55), # add small gap between bars 
      width = 0.5
    ) +
    labs(
      title = "Frequency distribution of Engine Type by Transmission Type",
      x = "Transmission Type",
      y = "Frequency",
      fill = "Engine Type", 
    ) +
    scale_fill_manual(
      values = c("skyblue", "lightgreen")
    ) +
    scale_x_discrete(
      labels = c("Automatic", "Manual")  # change the labels of the x-axis
    ) +
    scale_y_continuous(
      expand = c(0, 0), 
      limits = c(0, 15)
    ) +
    theme_few() +
    theme(
      legend.position = "top",
      plot.title = element_text(hjust = 0.5)
    )

    • The above figure shows that the V-shaped engine is more common in cars with automatic transmission compared to manual transmission.
Note
  • It is recommended to start the baseline of the y-axis at zero to avoid inaccurate interpretations.

2.1.2 Pie chart

  • It is also used to represent the frequency distribution of categorical variables.

  • It consists of a circle divided into parts, where each part represents a category.

  • The size of each part is equal to the angle calculated by multiplying the relative frequency of the category by \(360 ^\circ\) (the total angle of a circle).

  • For instance, the relative frequency of the V-shaped engine in the above example is \(\displaystyle \frac{18}{32} = 0.5625\), this corresponds to an angle of \(\displaystyle 0.5625 \times 360 = 202.5^\circ\) degrees.

  • The pie chart is suitable for variables with a small number of categories; it becomes cluttered and hard to read when the number of categories is large.

  • If the number of observations is small, it is recommended to display the absolute frequency on the chart rather than the percentage, which can be misleading.

    Click to show/hide code
    # extract the relative frequencies from the data frame created above
    percentages <- freq_tbl_df$rel_freq
    
    # create labels with the percentages
    labels <- 
      paste0(c("V-shaped", "Straight"), 
             "\n", 
             percentages, 
             "%"
      )
    
    # create a pie chart 
    pie(
     freq_tbl,
     main = "Pie chart of the Engine Type",
     col = c("skyblue", "lightgreen"),
     labels = labels,  # Use the percentage labels
     clockwise = TRUE
    )

    Click to show/hide code
    library(ggplot2)
    library(ggthemes)
    
    ggplot(
      freq_tbl_df,
      aes(
        x = "",
        y = rel_freq,
        fill = engine_type
      )
    ) +
    geom_bar(
      stat = "identity",
      width = 1
    ) +
    coord_polar("y", start = 0) +
    labs(
      title = "Pie chart of the Engine Type",
      fill = "Engine Type"
    ) +
    scale_fill_manual(
      values = c("skyblue", "lightgreen")
    ) +
    theme_void() +
    theme(
      legend.position = "none", 
      plot.title = element_text(hjust = 0.5)
    ) +
    geom_text(
      aes(label = labels),
      position = position_stack(vjust = 0.5)
    )

3 References

  • Daniel, W. W. and Cross, C. L. (2013). Biostatistics: A Foundation for Analysis in the Health Sciences, Tenth edition. Wiley

  • Heumann, C., Schomaker, M., and Shalabh (2022). Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in R. Springer

  • Lane, D. M. et al., (2019). Introduction to Statistics. Online Edition. Retrieved September 14, 2024, from https://openstax.org/details/introduction-statistics


4 Add your comments

Back to top