The Power of sampling

To create a random sample you need

If these two conditions are satisfied a random sample of size n can be created by first creating a randomly ordered vector of integers from 1 to N, and then picking the first n-elements of this vector and use this sample to access the n records.

To extract subsets from relational databases you use queries, typically written in SQL. To extract a random set of elements from a relational database you need a sequential identifier, i.e. a vector of values ranging from 1 to the number of records in the database without gaps. Most naturally occuring data comes without sequential identifiers, which is why relational databases often are defined so that for each record added to the database, an auto-incremented id-variable is added to the new record. Since such a id-variable is not random (it is perfectly correlated to the order in which the rows were added), it cannot be used directly to create a random sample, you have to create a random set of elements and use that set to create a random sample of your data. This post describes the situation where you have some non-sequential identifier in your large database, and need to create a random sample.

Since the dataset is large, we cannot rely on methods that require the full vectors to be loaded in RAM.

Use case: tweets

Tweets have a natural identifier twitterid, which can be used to access individual tweets (satisfying the first condition above).

select id from tweets INTO OUTFILE '/tmp/ids.tsv';
random.vector <- sample.int(n)
write.table(random.vector, file = "/tmp/rids.tsv")
paste /tmp/rids.tsv ids.tsv > mapping.tsv;
alter table tweets ADD COLUMN random_id int(11) FIRST;
create table mapping(randomid INT(11), twitterid BIGINT(20), primary key (`twitterid`)) engine=MYISAM;
LOAD DATA INFILE '/tmp/mapping.tsv' INTO TABLE mapping;
update tweets inner join mapping on tweets.id = mapping.twitterid set tweets.random_id = mapping.randomid

comments powered by Disqus


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: oktober 17, 2019