Add Unique ID Column to CSV Files

Sep 22nd, 2015

If you want to add a unique id column to the end of a CSV file, you can simply do it via a bash script following the steps below:

Count the number of lines in the input file using wc command and save it to a variable.
Create a temoprary file using tempfile command.
Using uuidgen command, generate n(=#lines) unique ids and save them to the temporary file.
Merge the input file with the temporary file containing unique ids using paste command.
Remove the temporary file.

Add unique ids to a CSV file (gen_uid.sh) download

#!/bin/bash
# Author: Massoud Seifi, Ph.D. @ MetaDataScience.com

# Count the number of lines in the input file
lines=$(wc -l < "$1")

# Generate unique ids and store them in a temporary file
uids=$(tempfile)
for i in `seq 1 $lines`; do uuidgen -t; done > "$uids"

# Insert the unique ids at the end of each line of the input file
paste --delimiter "," "$1" "$uids"

# Remove the temporary file
rm "$uids"

For example, running the script on the sample CSV file below:

Sample CSV file (test.csv) download

Mike,Smith,29
Jane,Clark,18
Brian,Jones,21

$./gen_uid.sh test.csv

The output will be:

Mike,Smith,29,e802dbee-617a-11e5-9c46-0862664fe7a7
Jane,Clark,18,e802ef30-617a-11e5-9c46-0862664fe7a7
Brian,Jones,21,e80302e0-617a-11e5-9c46-0862664fe7a7

If first line of your CSV file is the column headings, you can use code below:

Add unique ids to a CSV file with headers (gen_uid_header.sh) download

#!/bin/bash
# Author: Massoud Seifi, Ph.D. @ MetaDataScience.com

# Count the number of lines in the input file
lines=$(wc -l < "$1")

# Generate unique ids and store them in a temporary file
uids=$(tempfile)
for i in `seq 1 $lines`; do if [ $i == 1 ]; then echo "uid"; else uuidgen -t; fi; done > "$uids"

# Insert the unique ids at the end of each line of the input file
paste --delimiter "," "$1" "$uids"

# Remove the temporary file
rm "$uids"

Sample CSV file with headers (test_header.csv) download

first_name,last_name,age
Mike,Smith,29
Jane,Clark,18
Brian,Jones,21

$./gen_uid_heade.sh test_header.csv

first_name,last_name,age,uid
Mike,Smith,29,76de1cc4-617c-11e5-9c46-0862664fe7a7
Jane,Clark,18,76de2de0-617c-11e5-9c46-0862664fe7a7
Brian,Jones,21,76de3f2e-617c-11e5-9c46-0862664fe7a7

Random Sampling From Very Large Files

Feb 27th, 2014

Random Sampling

Random sampling from a set of entities means any entity has the same chance of selection as any other such entities. Suppose we want to randomly select $k$ lines from a large text file containing hundreds of millions of lines. We desire that the probability of being selected be the same for every line in the file.

Algorithm 1

The first approach which comes in mind is to

Count the number of lines in the file,
Create a sorted random set of $k$ integers between 1 and number of lines in the file,
Iterate over the random integers and read the file line by line. Pick the line if the line number matches one of the the random integers.

This algorithm in Python is shown below.

Random Sampling 1 (random_sampler1.py) download

import random

def random_sampler(filename, k):
	sample = []
	with open(filename, 'rb') as f:
		linecount = sum(1 for line in f)
		f.seek(0)

		random_linenos = sorted(random.sample(xrange(linecount), k), reverse = True)
		lineno = random_linenos.pop()
		for n, line in enumerate(f):
			if n == lineno:
				sample.append(line.rstrip())
				if len(random_linenos) > 0:
					lineno = random_linenos.pop()
				else:
					break
	return sample

Algorithm 2: Reservoir sampling

As you see, in previous algorithm, we scan the file two times. First time for counting the number of lines in the file, and second time to select random lines. There are some algorithms which even work without knowing in advance the total number of items. One classical algorithm form Alan Waterman called Reservoir sampling is exposed in the second volume of Donald Knuth’s “The Art of Computer Programming”.

Suppose we want to select $k$ items from a set of items. We start by filling the “reservoir” with the first $k$ items, and then for each $i^{th}$ item remaining in the set, we generate a random number $r$ between $1$ and $i$. If $r$ is less than $k$, we replace the $r^{th}$ item of the reservoir with the $i^{th}$ item of the set. We continue processing items until we reach the end of the set.

Random Sampling 2 (random_sampler2.py) download

import random

def random_sampler(filename, k):
	sample = []
	with open(filename) as f:
		for n, line in enumerate(f):
			if n < k:
				sample.append(line.rstrip())
			else:
				r = random.randint(0, n)
				if r < k:
					sample[r] = line.rstrip()
	return sample

It is easy to prove by induction that this approach works and each line has the same probability of being selected as the other lines:

Suppose we need to collect a random sample of $k$ items from a list of items coming as an online stream. We desire that after seeing $n$ item, each item in the sample set had $\frac{k}{n}$ chance to be there.

For example, suppose $k=10$. According to the algorithm, the first $10$ items go directly to the reservoir. So for them, the probability of being selected is $\frac{10}{10} = 1 \checkmark$.

Now, suppose the $11^{th}$ item comes. The desired probability is now $\frac{k}{n} = \frac{10}{11}$. We have:

According to the reservoir sampling algorithm above, the probability of $11^{th}$ item to being selected is $\frac{10}{11} \checkmark$.
For the items already in the reservoir, the chance of being in the sample set and also remaining in the sample set after seeing the $11^{th}$ item, is their previous probability to be there, multiple the probability of not being replace by the $11^{th}$. So we have:

Pr = Probability that a selected item remains in the reservoir

= Previous probability to be there * Probability of not being replaced

= Previous probability to be there * ( 1 - Probability of being replaced by $11^{th}$ item)

The chance that an item in the reservoir being replaced with $11^{th}$ item is the probability of $11^{th}$ item to be selected, which is $\frac{10}{11}$, multiple the probability of being the replacement candidate between 10 items, which is $\frac{1}{10}$. So we have: $Pr = \frac{10}{10}*(1-\frac{10}{11}*\frac{1}{10})=\frac{10}{11} \checkmark$ .

Likewise, for the $12^{th}$ item we have:

Probability of $12^{th}$ item to being selected is $\frac{10}{12} \checkmark$.
For the items already in the reservoir: $Pr = \frac{10}{11}*(1-\frac{10}{12}*\frac{1}{10})=\frac{10}{12} \checkmark$

And this can be extended for the $n^{th}$ item. Although reservoir sampling is an interesting approach but it is too slow for our problem here.

Algorithm 3

There is another interesting approach when the lines have approximately the same length (for example, we deal with a huge list of email addresses). In this case, there is a correlation between line numbers and the file size. So, we can use the algorithm below:

Random Sampling 3 (random_sampler3.py) download

import random

def random_sampler(filename, k):
	sample = []
	with open(filename, 'rb') as f:
		f.seek(0, 2)
		filesize = f.tell()

		random_set = sorted(random.sample(xrange(filesize), k))

		for i in xrange(k):
			f.seek(random_set[i])
			# Skip current line (because we might be in the middle of a line) 
			f.readline()
			# Append the next line to the sample set 
			sample.append(f.readline().rstrip())

	return sample

Basically, we get the file size. Create a sorted random set of k random positions in the file (between 1 and the file size). For each random position, we seek that position, skip a line, and put the next line to the sample set.

Benchmark

The table below shows the elapsed time for selecting 1000 lines from a large (~ 40M lines) and a very large file(~ 300M lines) for each algorithm. We see that the algorithm 3 is much faster. As I mentioned before, the only assumption is that the lines should have approximately the same length.

Algorithm	File 1 (~ 40M lines)	File 2 (~ 300M lines)
random_sampler1.py	6.641s	1m14.184s
random_sampler2.py	50.406s	6m51.078s
random_sampler3.py	0.019s	3.119s

How to Get Twitter Username From Twitter ID

Aug 13th, 2013

Twitter ID and Username

The Twitter ID is a unique number identifying an account on Twitter. Upon registration, users can also choose a username (a.k.a. screen name or @handle). However an account can never change its Twitter ID, but it can change its username.

Getting Twitter username from Twitter ID using the Twitter API (rate limited)

I was trying to lookup some Twitter IDs and find out what their corresponding usernames are. My first approach was using the code below using http://twitter.com/users/show/[ID]?format=json as the endpoint:

Getting Twitter username from Twitter ID using the Twitter API (limited) (TwitterUserIdToScreenName_limited.php) download

<?php
/**
 * Getting the Twitter usename from Twitter ID
 * This approach does not work in large scale. It's limited to 150 requests per hour.
 *
 * @author Massoud Seifi, Ph.D. @ MetaDataScience.com
 */

class TwitterUserIdToScreenName {

	protected $href_base = 'http://twitter.com/users/show/';

	public function getScreenNameFromUserId($twitter_user_id){

		if (!is_numeric($twitter_user_id)){
		    return false;
		}

		$href = $this->href_base . $twitter_user_id . '?format=json';

		$ch = curl_init();
		curl_setopt($ch, CURLOPT_URL, $href);
		curl_setopt($ch, CURLOPT_HEADER         , false);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER , true);
		curl_setopt($ch, CURLOPT_FOLLOWLOCATION , true);

		$result = curl_exec($ch);
		curl_close($ch);

		$json = json_decode($result, true);
		
		if (isset($json['screen_name'])){
			return $json['screen_name'];
		}

	}

}

$p = new TwitterUserIdToScreenName();

while ($line = fgets(STDIN)){
	$id = trim($line);
	$screen_name = $p->getScreenNameFromUserId($id);
	if ( is_string($screen_name) ){
		echo "$id,$screen_name\n";
	}
}

This code calls Twitter API which has two issues:

It needs an access token after the release of Twitter API version 1.1.
The number of requests is limited to 150 requests per hour.

Getting Twitter username from Twitter ID without using the Twitter API (scalable)

I tried another approach using http://twitter.com/account/redirect_by_id?id= which doesn’t call the Twitter API and consequently has no rate limit. This approach worked fine for me and I could easily look up thousands of Twitter IDs. Here is the code:

Getting Twitter username from Twitter ID without using the Twitter API (TwitterUserIdToScreenName.php) download

<?php
/**
 * Getting the Twitter usename from Twitter ID
 * This code works fine in large scale.
 *
 * @author Massoud Seifi, Ph.D. @ MetaDataScience.com
 */

class TwitterUserIdToScreenName {

	protected $href_base = 'http://twitter.com/account/redirect_by_id?id=';

	public function getScreenNameFromUserId($twitter_user_id)
	{

		if (!is_numeric($twitter_user_id)){
		    return false;
		}

		$href = $this->href_base . $twitter_user_id;

		$ch = curl_init();
		curl_setopt($ch, CURLOPT_URL, $href);
		curl_setopt($ch, CURLOPT_NOBODY, true);
		curl_setopt($ch, CURLOPT_HEADER         , true);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER , true);
		curl_setopt($ch, CURLOPT_FOLLOWLOCATION , true);

		$result = curl_exec($ch);
		$info =  curl_getinfo($ch);
		curl_close($ch);

		if (isset($info['url'])){
			return  str_replace( '/', '', parse_url($info['url'], PHP_URL_PATH));
		}

	}

}

$p = new TwitterUserIdToScreenName();

while ($line = fgets(STDIN)){
	$id = trim($line);
	$screen_name = $p->getScreenNameFromUserId($id);
	if ( is_string($screen_name) ){
		echo "$id,$screen_name\n";
	}
}

Usage Example:

> echo 1180576736 | php TwitterUserIdToScreenName.php
1180576736,AccessToken

or you can lookup a list of IDs:

> cat list_IDs | php TwitterUserIdToScreenName.php

← Older Blog Archives

Meta Data Science

By Massoud Seifi, Ph.D. Data Scientist