web analytics

logo Meta Data Science

By Massoud Seifi, Ph.D. Data Scientist

Plotting the Frequency Distribution Using R

Introduction

R is an open source language and environment for statistical computing and graphics. It’s an implementation of the S language which was developed at Bell Laboratories by John Chambers and colleagues. R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. It is also an interpreted language and can be accessed through a command-line interpreter: For example, if a user types “2+2” at the R command prompt and press enter, the computer replies with “4”. R is freely available under the GNU General Public License.

Plotting The Frequency Distribution

Frequency distribution

A frequency distribution shows the number of occurrences in each category of a categorical variable. For example, in a sample set of users with their favourite colors, we can find out how many users like a specific color.

Data set

Suppose a data set of 30 records including user ID, favorite color and gender:

Sample Set (sample.csv) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
ID,Color,Gender
792141B,Blue,Male
795156A,Blue,Male
795156B,Blue,Male
795156C,Blue,Male
795156E,Blue,Male
795156G,Blue,Female
795156I,Blue,Male
795156J,White,Male
795156M,Red,Male
795156N,Blue,Male
795156O,Green,Male
795156P,Red,Male
795156Q,Blue,Male
795156S,White,Male
795156T,Blue,Male
795156W,Red,Female
800731A,Red,Male
800731C,Blue,Male
800731E,Blue,Male
800731F,Blue,Female
800731G,Red,Male
800731I,Blue,Female
800731K,Blue,Female
800731M,Blue,Male
800731N,Blue,Female
800731O,Blue,Female
800731Q,Blue,Male
800731W,Blue,Male
800731X,Red,Male
800731Y,Red,Male

Reading the csv file

Let’s start with reading the csv file:

1
data <- read.csv(file = 'sample.csv', header = TRUE, sep = ',')

The first argument which is mandatory is the name of file. The second argument indicates whether or not the first row is a set of labels and the third argument indicates the delimiter. The above command will read in the csv file and assign it to a variable called “data”.

You can use the following command to see the list of column names:

1
names(data)

which results:

1
[1] "ID"     "Color"  "Gender"

Or you can use following command to see a summary of the data:

1
summary(data)
1
2
3
4
5
6
7
8
   ID       Color       Gender
 792141B: 1   Blue :20   Female: 7
 795156A: 1   Green: 1   Male  :23
 795156B: 1   Red  : 7
 795156C: 1   White: 2
 795156E: 1
 795156G: 1
 (Other):24

As you see, the number of occurrences of each color is shown in the summary.

Table function

table() uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.

1
table(data$Color)
1
2
 Blue Green   Red White
   20     1     7     2

Plotting

Now we can plot it easily using the barplot command:

1
barplot(table(data$Color))

Save the plot as an image

I can see the plot on my machine, but to put it here on my weblog, I have to save it as an image:

1
2
dev.copy(png, 'freq.png')
dev.off()

Here you go…

Factor

The factor function is used to create a factor (or category) from a vector.

1
factor(data$Color)
1
2
3
4
[1] Blue  Blue  Blue  Blue  Blue  Blue  Blue  White Red   Blue  Green Red
[13] Blue  White Blue  Red   Red   Blue  Blue  Blue  Red   Blue  Blue  Blue
[25] Blue  Blue  Blue  Blue  Red   Red
Levels: Blue Green Red White

Levels is a unique set of values in the vector.

Now, suppose that “Yellow” was also an option for the users but nobody has chosen it as the favourite color. We can use the factor command to customize the categories:

1
factor(data$Color, levels = c('Blue', 'Green', 'Yellow', 'Red', 'White'))
1
2
3
4
 [1] Blue  Blue  Blue  Blue  Blue  Blue  Blue  White Red   Blue  Green Red
[13] Blue  White Blue  Red   Red   Blue  Blue  Blue  Red   Blue  Blue  Blue
[25] Blue  Blue  Blue  Blue  Red   Red
Levels: Blue Green Yellow Red White

Now, we can see Yellow in the frequency distribution:

1
table(factor(data$Color, levels = c('Blue','Green','Yellow','Red','White')))
1
2
  Blue  Green Yellow    Red  White
    20      1      0      7      2

And we can see it on the plot:

1
barplot(table(factor(data$Color, levels = c('Blue', 'Green', 'Yellow', 'Red', 'White'))))

if you want to see the percentages instead of the values, you can try this:

1
2
t <- table(factor(data$Color, levels = c('Blue', 'Green', 'Yellow', 'Red', 'White')))
barplot(t / sum(t))

Filtering

Now, let’s imagine that we want to plot the frequency distribution of favourite colors for men and women separately. The following commands create two subsets of data by filtering the gender and store it to two different variables (Don’t forget the comma!):

1
2
3
men <- data[data$Gender == 'Male',]

women <- data[data$Gender == 'Female',]

now we can plot the distributions seperately:

1
2
3
4
5
l <- c('Blue', 'Green', 'Yellow', 'Red', 'White')

barplot(table(factor(men$Color, levels = l, main = 'Men')

barplot(table(factor(women$Color, levels = l, main = 'Women')

Colors and Labels

Do you like colors and labels?! Here you go…

1
2
3
l <- c('Blue','Green','Yellow','Red','White')

barplot(table(factor(data$Color, levels = l)) , col = c('blue', 'green', 'yellow', 'red', 'white'), xlab = 'Favourite Color', ylab = 'Number Of Users')

Comments