web analytics

logo Meta Data Science

By Massoud Seifi, Ph.D. Data Scientist

Plotting the Frequency Distribution Using R

Introduction

R is an open source language and environment for statistical computing and graphics. It’s an implementation of the S language which was developed at Bell Laboratories by John Chambers and colleagues. R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. It is also an interpreted language and can be accessed through a command-line interpreter: For example, if a user types “2+2” at the R command prompt and press enter, the computer replies with “4”. R is freely available under the GNU General Public License.

Plotting The Frequency Distribution

Frequency distribution

A frequency distribution shows the number of occurrences in each category of a categorical variable. For example, in a sample set of users with their favourite colors, we can find out how many users like a specific color.

Data set

Suppose a data set of 30 records including user ID, favorite color and gender:

Sample Set (sample.csv) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
ID,Color,Gender
792141B,Blue,Male
795156A,Blue,Male
795156B,Blue,Male
795156C,Blue,Male
795156E,Blue,Male
795156G,Blue,Female
795156I,Blue,Male
795156J,White,Male
795156M,Red,Male
795156N,Blue,Male
795156O,Green,Male
795156P,Red,Male
795156Q,Blue,Male
795156S,White,Male
795156T,Blue,Male
795156W,Red,Female
800731A,Red,Male
800731C,Blue,Male
800731E,Blue,Male
800731F,Blue,Female
800731G,Red,Male
800731I,Blue,Female
800731K,Blue,Female
800731M,Blue,Male
800731N,Blue,Female
800731O,Blue,Female
800731Q,Blue,Male
800731W,Blue,Male
800731X,Red,Male
800731Y,Red,Male

Reading the csv file

Let’s start with reading the csv file:

1
data <- read.csv(file = 'sample.csv', header = TRUE, sep = ',')

The first argument which is mandatory is the name of file. The second argument indicates whether or not the first row is a set of labels and the third argument indicates the delimiter. The above command will read in the csv file and assign it to a variable called “data”.

You can use the following command to see the list of column names:

1
names(data)

which results:

1
[1] "ID"     "Color"  "Gender"

Or you can use following command to see a summary of the data:

1
summary(data)
1
2
3
4
5
6
7
8
   ID       Color       Gender
 792141B: 1   Blue :20   Female: 7
 795156A: 1   Green: 1   Male  :23
 795156B: 1   Red  : 7
 795156C: 1   White: 2
 795156E: 1
 795156G: 1
 (Other):24

As you see, the number of occurrences of each color is shown in the summary.

Table function

table() uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.

1
table(data$Color)
1
2
 Blue Green   Red White
   20     1     7     2

Plotting

Now we can plot it easily using the barplot command:

1
barplot(table(data$Color))

Save the plot as an image

I can see the plot on my machine, but to put it here on my weblog, I have to save it as an image:

1
2
dev.copy(png, 'freq.png')
dev.off()

Here you go…

Factor

The factor function is used to create a factor (or category) from a vector.

1
factor(data$Color)
1
2
3
4
[1] Blue  Blue  Blue  Blue  Blue  Blue  Blue  White Red   Blue  Green Red
[13] Blue  White Blue  Red   Red   Blue  Blue  Blue  Red   Blue  Blue  Blue
[25] Blue  Blue  Blue  Blue  Red   Red
Levels: Blue Green Red White

Levels is a unique set of values in the vector.

Now, suppose that “Yellow” was also an option for the users but nobody has chosen it as the favourite color. We can use the factor command to customize the categories:

1
factor(data$Color, levels = c('Blue', 'Green', 'Yellow', 'Red', 'White'))
1
2
3
4
 [1] Blue  Blue  Blue  Blue  Blue  Blue  Blue  White Red   Blue  Green Red
[13] Blue  White Blue  Red   Red   Blue  Blue  Blue  Red   Blue  Blue  Blue
[25] Blue  Blue  Blue  Blue  Red   Red
Levels: Blue Green Yellow Red White

Now, we can see Yellow in the frequency distribution:

1
table(factor(data$Color, levels = c('Blue','Green','Yellow','Red','White')))
1
2
  Blue  Green Yellow    Red  White
    20      1      0      7      2

And we can see it on the plot:

1
barplot(table(factor(data$Color, levels = c('Blue', 'Green', 'Yellow', 'Red', 'White'))))

if you want to see the percentages instead of the values, you can try this:

1
2
t <- table(factor(data$Color, levels = c('Blue', 'Green', 'Yellow', 'Red', 'White')))
barplot(t / sum(t))

Filtering

Now, let’s imagine that we want to plot the frequency distribution of favourite colors for men and women separately. The following commands create two subsets of data by filtering the gender and store it to two different variables (Don’t forget the comma!):

1
2
3
men <- data[data$Gender == 'Male',]

women <- data[data$Gender == 'Female',]

now we can plot the distributions seperately:

1
2
3
4
5
l <- c('Blue', 'Green', 'Yellow', 'Red', 'White')

barplot(table(factor(men$Color, levels = l, main = 'Men')

barplot(table(factor(women$Color, levels = l, main = 'Women')

Colors and Labels

Do you like colors and labels?! Here you go…

1
2
3
l <- c('Blue','Green','Yellow','Red','White')

barplot(table(factor(data$Color, levels = l)) , col = c('blue', 'green', 'yellow', 'red', 'white'), xlab = 'Favourite Color', ylab = 'Number Of Users')

Lookup Table for Inferring Facebook Account Creation Date From Facebook User ID

In my previous post, I explained how we can estimate the account creation date of Facebook accounts that have a 15 digit UID without having to call the Facebook API and just based on the user’s Facebook UID.

Table below shows the correlation between Facebook UID and Facebook Account Creation Date for the sample set that I analysed. The table is represented in CSV format as follows:

Facebook UID, Account Creation Date(timestamp), Account Creation Date(date).

Note #1: To respect the users privacy I hided the last 5 digits of UIDs. You may replace ‘x’ by ‘0’ and it should not cause any problem.

Note #2: For a more accurate result, this table should get updated.

Correlation between Facebook UID and Facebook Account Creation Date (fbid_accountage.csv) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
1000053132xxxxx,1361330314,2013-02-19
1000049732xxxxx,1357606484,2013-01-07
1000047422xxxxx,1354021840,2012-11-27
1000047355xxxxx,1353417806,2012-11-20
1000046843xxxxx,1352662415,2012-11-11
1000046015xxxxx,1350999158,2012-10-23
1000035523xxxxx,1349467776,2012-10-05
1000040785xxxxx,1343717040,2012-07-30
1000041143xxxxx,1342928040,2012-07-21
1000038945xxxxx,1338526722,2012-05-31
1000036032xxxxx,1331873652,2012-03-15
1000031133xxxxx,1320583505,2011-11-06
1000031024xxxxx,1320464096,2011-11-04
1000029834xxxxx,1318571069,2011-10-13
1000029345xxxxx,1315974235,2011-09-13
1000026042xxxxx,1309652553,2011-07-02
1000023280xxxxx,1306728328,2011-05-29
1000024582xxxxx,1304995827,2011-05-09
1000023732xxxxx,1303537065,2011-04-22
1000022413xxxxx,1302326877,2011-04-08
1000022328xxxxx,1300810582,2011-03-22
1000019352xxxxx,1295628516,2011-01-21
1000014241xxxxx,1285972221,2010-10-01
1000013861xxxxx,1281882953,2010-08-15
1000014436xxxxx,1280116994,2010-07-25
1000012117xxxxx,1276055448,2010-06-08
1000010697xxxxx,1274090432,2010-05-17
1000010425xxxxx,1272438522,2010-04-28
1000008600xxxxx,1268201411,2010-03-09
1000008113xxxxx,1267667333,2010-03-03
1000006286xxxxx,1266618961,2010-02-19
1000006189xxxxx,1263726284,2010-01-17
1000006449xxxxx,1262406605,2010-01-01
1000003298xxxxx,1261112448,2009-12-17
1000005651xxxxx,1259793952,2009-12-02
1000005426xxxxx,1259605238,2009-11-30
1000005072xxxxx,1258400669,2009-11-16
1000004668xxxxx,1257502719,2009-11-06
1000002286xxxxx,1252567838,2009-09-10
1000001160xxxxx,1250562107,2009-08-17
1000001568xxxxx,1250382196,2009-08-15

Inferring Facebook Account Creation Date From Facebook User ID

Calling the Facebook API is a (relatively) slow operation; especially if you have to call it multiple times. So, when possible, it is a good idea to get the information you need, without making API calls.

Here I show you how to figure out the creation date of a Facebook account without having to call the Facebook API, just based on the user’s Facebook UID.

The Bad Way To Do It

As I explained in my previous post, it is possible to estimate the Facebook account creation date by retrieving the date of user’s oldest post. This method has a couple of draw backs:

Draw Back #1: You must have ‘read_stream’ permission which is an extended Facebook permission to read the user post stream. From a user’s point of view, this sounds scarier than the other basic permissions you probably ask for.

Draw Back #2: As an extended permission it triggers a second permission screen that dramatically increases the UX friction for the users. (You want low friction UX.)

Draw Back #3: The overhead of walking the entire post stream to determine age is very costly for the simple piece of information we synthesize. (You have to call the Facebook API over and over and over and over … again, since the post stream is paginated. I.e., this is at best an O(n) operation, where “n” relates to the user’s activity on Facebook.)

My Search For A Better Way

To overcome these issues I tried an to find an alternative, asynchronous approach. I was wondering if it is possible to estimate a Facebook account creation date by looking at Facebook User ID. I couldn’t find any official documentation on how Facebook generates a new Facebook user ID and how they are accomplishing that in a scalable fashion. One answer I could find was from Jack Lindamood, Software Engineer at Facebook 2008-2012 which I found here:

‘Lots’ of MySQL DBs. Each with their own unique number. Also, each has an autoincrement table. Then it’s just some math on the autoincrement value + unique_number * some_cap_per_db (it’s a bit more complicated due to special cases, but that’s pretty much how it works).

Another explanation was from Justin Mitchell, former engineering manager. He explains here the history of Facebook user ID numbering system:

Facebook’s user ID schema reflects the history of the site as it transitioned from a single-server single-school operation to 400 million users. User ID assignment has gone through several phases, notably:

Harvard only. Facebook (or thefacebook.com, as it was called back then) was opened up to Harvard running off a single box that had mysql and apache. IDs were auto-incremented, starting at 4 (hi Zuck).

Other schools. Other schools were initially completely separate sites, operating on their own boxes. IDs were still auto-increment per SQL box, but each server/school had a different prefix. For instance, all Columbia IDs are between 100000-199999 and all Stanford IDs are between 200000-299999. You can determine what school any early Facebook user attended based on his or her user ID.

High schools. Someone must have figured out that this ID system didn’t scale very well, so Facebook changed its DB layout when high schools were introduced. While all the college users maintained their current DB, high school users were randomly assigned to one of many many high school DBs. These users IDs hash to the correct database, rather than simply being floor(ID / 100000).

Open registration. Facebook maintained a similar layout once open reg was launched, except the new databases weren’t signified as “high school.”

64 bit. Given Facebook’s growth rate, it was estimated that the entire world would be on the site by 2011, overflowing 32-bit space. While we considered limiting the site to the first 4-billion people to register and lobbying governments to reduce the world’s population, the growth team pushed pretty hard to just increase the ID space to 64-bit.

Using Facecbook UIDs For Predictions

So it seems that new Facebook IDs are 64 bits and contain 15 digits. There is a post dating from October 2007 that mentions that Facebook had plans to do this long time ago but according to this post from May, 2009, Facebook was going to release 64 bit user IDs back to 2009.

I studied the correlation between Facebook User ID and Account Creation Date for a tiny sample set of 77 Facebook accounts. 41 accounts of this sample set had a user ID containing 15 digits and for the rest the user ID has less than 15 digits. Figures below illustrate this correlation seperately for 64 bit UIDS (left) and old style UIDs (right).

The graph on the left is for the new(er) Facebook UIDs. The graph on the right is for the old style Facebook IDs. You can see that the correlation between Facebook UID and its creation date is a lot better for the new(er) Facebook UIDs than the old ones.

Or in other words, as we observe, there is an interesting correlation between Facebook User ID and Account Creation Date for 64 bit user IDs (see figure on left). Also in this sample set, old UIDs are more than 800 days old (see figure on right). The overlap between two graphs might be a period that Facebook was moving from old UIDs to 64 bit ones.

Therefore as an alternative approach to estimate the Facebook account creation date, we may leverage the monotonically increasing property of 64 bit Facebook user IDs and create a table of bounds that would give us at least a quarterly estimate on the creation date for the account - an appropriate level of granularity for this purpose. Taking this approach will reduce the number of permissions your application need and dramatically decrease the amount of processing time and remove a variable around the elapsed time to deliver a response.

Update (March 14, 2013): See here to download the data set.