#!/bin/bash# Author: Massoud Seifi, Ph.D. @ MetaDataScience.com# Count the number of lines in the input filelines=$(wc -l < "$1")# Generate unique ids and store them in a temporary fileuids=$(tempfile)for i in `seq 1 $lines`; do uuidgen -t; done > "$uids"# Insert the unique ids at the end of each line of the input filepaste --delimiter ",""$1""$uids"# Remove the temporary filerm "$uids"
For example, running the script on the sample CSV file below:
If first line of your CSV file is the column headings, you can use code below:
Add unique ids to a CSV file with headers (gen_uid_header.sh)download
123456789101112131415
#!/bin/bash# Author: Massoud Seifi, Ph.D. @ MetaDataScience.com# Count the number of lines in the input filelines=$(wc -l < "$1")# Generate unique ids and store them in a temporary fileuids=$(tempfile)for i in `seq 1 $lines`; do if[$i== 1 ]; then echo"uid"; else uuidgen -t; fi; done > "$uids"# Insert the unique ids at the end of each line of the input filepaste --delimiter ",""$1""$uids"# Remove the temporary filerm "$uids"
Sample CSV file with headers (test_header.csv)download
Random sampling from a set of entities means any entity has the same chance of selection as any other such entities.
Suppose we want to randomly select $k$ lines from a large text file containing hundreds of millions of lines. We desire that the probability of being selected be the same for every line in the file.
Algorithm 1
The first approach which comes in mind is to
Count the number of lines in the file,
Create a sorted random set of $k$ integers between 1 and number of lines in the file,
Iterate over the random integers and read the file line by line.
Pick the line if the line number matches one of the the random integers.
As you see, in previous algorithm, we scan the file two times. First time for counting the number of lines in the file, and second time to select random lines.
There are some algorithms which even work without knowing in advance the total number of items. One classical algorithm form Alan Waterman called Reservoir sampling is exposed in the second volume of Donald Knuth’s “The Art of Computer Programming”.
Suppose we want to select $k$ items from a set of items. We start by filling the “reservoir” with the first $k$ items, and then for each $i^{th}$ item remaining in the set, we generate a random number $r$ between $1$ and $i$. If $r$ is less than $k$, we replace the $r^{th}$ item of the reservoir with the $i^{th}$ item of the set. We continue processing items until we reach the end of the set.
It is easy to prove by induction that this approach works and each line has the same probability of being selected as the other lines:
Suppose we need to collect a random sample of $k$ items from a list of items coming as an online stream. We desire that after seeing $n$ item, each item in the sample set had $\frac{k}{n}$ chance to be there.
For example, suppose $k=10$. According to the algorithm, the first $10$ items go directly to the reservoir. So for them, the probability of being selected is $\frac{10}{10} = 1 \checkmark$.
Now, suppose the $11^{th}$ item comes. The desired probability is now $\frac{k}{n} = \frac{10}{11}$. We have:
According to the reservoir sampling algorithm above, the probability of $11^{th}$ item to being selected is $\frac{10}{11} \checkmark$.
For the items already in the reservoir, the chance of being in the sample set and also remaining in the sample set after seeing the $11^{th}$ item, is their previous probability to be there, multiple the probability of not being replace by the $11^{th}$. So we have:
Pr = Probability that a selected item remains in the reservoir
= Previous probability to be there * Probability of not being replaced
= Previous probability to be there * ( 1 - Probability of being replaced by $11^{th}$ item)
The chance that an item in the reservoir being replaced with $11^{th}$ item is the probability of $11^{th}$ item to be selected, which is $\frac{10}{11}$, multiple the probability of being the replacement candidate between 10 items, which is $\frac{1}{10}$. So we have: .
Likewise, for the $12^{th}$ item we have:
Probability of $12^{th}$ item to being selected is $\frac{10}{12} \checkmark$.
For the items already in the reservoir:
And this can be extended for the $n^{th}$ item. Although reservoir sampling is an interesting approach but it is too slow for our problem here.
Algorithm 3
There is another interesting approach when the lines have approximately the same length (for example, we deal with a huge list of email addresses). In this case, there is a correlation between line numbers and the file size. So, we can use the algorithm below:
importrandomdefrandom_sampler(filename,k):sample=[]withopen(filename,'rb')asf:f.seek(0,2)filesize=f.tell()random_set=sorted(random.sample(xrange(filesize),k))foriinxrange(k):f.seek(random_set[i])# Skip current line (because we might be in the middle of a line) f.readline()# Append the next line to the sample set sample.append(f.readline().rstrip())returnsample
Basically, we get the file size. Create a sorted random set of k random positions in the file (between 1 and the file size). For each random position, we seek that position, skip a line, and put the next line to the sample set.
Benchmark
The table below shows the elapsed time for selecting 1000 lines from a large (~ 40M lines) and a very large file(~ 300M lines) for each algorithm. We see that the algorithm 3 is much faster. As I mentioned before, the only assumption is that the lines should have approximately the same length.
The Twitter ID is a unique number identifying an account on Twitter.
Upon registration, users can also choose a username (a.k.a. screen name or @handle).
However an account can never change its Twitter ID, but it can change its username.
Getting Twitter username from Twitter ID using the Twitter API (rate limited)
I was trying to lookup some Twitter IDs and find out what their corresponding usernames are. My first approach was using the code below using
http://twitter.com/users/show/[ID]?format=json as the endpoint:
Getting Twitter username from Twitter ID using the Twitter API (limited) (TwitterUserIdToScreenName_limited.php)download
<?php/** * Getting the Twitter usename from Twitter ID * This approach does not work in large scale. It's limited to 150 requests per hour. * * @author Massoud Seifi, Ph.D. @ MetaDataScience.com */classTwitterUserIdToScreenName{protected$href_base='http://twitter.com/users/show/';publicfunctiongetScreenNameFromUserId($twitter_user_id){if(!is_numeric($twitter_user_id)){returnfalse;}$href=$this->href_base.$twitter_user_id.'?format=json';$ch=curl_init();curl_setopt($ch,CURLOPT_URL,$href);curl_setopt($ch,CURLOPT_HEADER,false);curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);$result=curl_exec($ch);curl_close($ch);$json=json_decode($result,true);if(isset($json['screen_name'])){return$json['screen_name'];}}}$p=newTwitterUserIdToScreenName();while($line=fgets(STDIN)){$id=trim($line);$screen_name=$p->getScreenNameFromUserId($id);if(is_string($screen_name)){echo"$id,$screen_name\n";}}
This code calls Twitter API which has two issues:
It needs an access token after the release of Twitter API version 1.1.
The number of requests is limited to 150 requests per hour.
Getting Twitter username from Twitter ID without using the Twitter API (scalable)
I tried another approach using http://twitter.com/account/redirect_by_id?id= which doesn’t call the Twitter API and consequently has no rate limit. This approach worked fine for me and I could easily look up thousands of Twitter IDs. Here is the code:
Getting Twitter username from Twitter ID without using the Twitter API (TwitterUserIdToScreenName.php)download
<?php/** * Getting the Twitter usename from Twitter ID * This code works fine in large scale. * * @author Massoud Seifi, Ph.D. @ MetaDataScience.com */classTwitterUserIdToScreenName{protected$href_base='http://twitter.com/account/redirect_by_id?id=';publicfunctiongetScreenNameFromUserId($twitter_user_id){if(!is_numeric($twitter_user_id)){returnfalse;}$href=$this->href_base.$twitter_user_id;$ch=curl_init();curl_setopt($ch,CURLOPT_URL,$href);curl_setopt($ch,CURLOPT_NOBODY,true);curl_setopt($ch,CURLOPT_HEADER,true);curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);$result=curl_exec($ch);$info=curl_getinfo($ch);curl_close($ch);if(isset($info['url'])){returnstr_replace('/','',parse_url($info['url'],PHP_URL_PATH));}}}$p=newTwitterUserIdToScreenName();while($line=fgets(STDIN)){$id=trim($line);$screen_name=$p->getScreenNameFromUserId($id);if(is_string($screen_name)){echo"$id,$screen_name\n";}}