Big Data and theory of database optimisation

Executive summary: when the data become big, partititioning performs better than indices, and you should migrate from normalisation to duplication of data.

Introduction

Conventional wisdom says:

Thou shalt normalise your database!

This dogma as been around for more than 50 years. When you type "why normalisation is" in to the search field of your browser, what completions will it suggest?

It will not suggest: "harming performance".

Normalisation is the structuration of data so that each datum only appears at one place in the database.

Storing big data is one problem, but hard disk storage is cheap nowadays. Accessing and analysing the data is another problem. When the databases are too big to be contained in the RAM of a computer, there are clever ways to organise the data so that queries will answered quickly even if the data must be read from a harddisk. When data become really big, then not only is the data itself to big to be held in RAM, even indices of where the each record is stored are too big to be stored in RAM.

I suggest two strategies to speed up access to big data stored in a local relational database:

Partitioning

The database used, the myisam engine of mysql, allows each table to be partitioned into 1024 parts and a rule determines in which of these parts a particular record will be. When processing a query, only the relevant partition of the table needs to be opened and search for the record.

Weak normalisation

In small-to-normal sized databases, an index of all records in all tables in the database can be held in RAM, so information on where to find the actual records needed to process the query can be gathered fast - the processing time is then mainly the time it takes to actually fetch those records. When the number of records goes up to the point where not even an index of where the records are stored can be held in RAM, partitioning becomes more efficient than indices. However, when a database is partitioned, the rule for where to store a particular record gives primacy to one of the variables in the database, and if the query is based on, or depends on, another variable, partitioning actually increases the time it takes to fetch the records, since these are now spread over a lot of files each of which has to be opened.

The solution to this problem is to use weak normalisation, that is to create two tables that holds the same information but which are partitioned on different variables.

For example, the relation between users, posts and threads could be structured in one table like the following:

postid threadid userid
12616 1102245 1776
20504 3165 4115
22744 3624 358
38016 6315 3791
38024 6315 2296
57056 9601 2880
5440 689 450
5448 689 431
21040 3270 548
8360 1060 382

Here, postid is the only variable that will have unique values for each row, values of both threadid and userid will appear in more than one row, e.g. threadid = 689 appears twice in this extract.

The questions that we will put to this data are not about posts, but about users and threads (threads indicate subjects, ie contexts for the posts). A question could be "In what threads have user 393931 posted?" or "Which users have posted in thread "100029". The first question can only be answered quickly if there is an index of where in the table all records that have a particular value of userid exist, and the second question requires a different index, an index which tells for each threadid where all records with that threadid exist. The traditional way of optimizing the table for such queries is to add one index for postid, and one index for threadid and one index for userid, and to leave the table unpartitioned.

The way to optimize for big data is to only normalise weakly (ie to accept duplication of information) and to make two tables with the same contents, that differ only in the variable on which they are partitioned.

In this scenario, the table above will be turned into two tables: one which is partitioned on threadid and one which is partioned on userid, and based on the variable in the question, we will select the proper table for the query.

Let us compare the query times, and storage requirements, for the traditional way, strong normalisation and multiple indices on one table, and the alternative way: weak normalisation in combination with partitions (and no indices).

The dataset(s) to which these queries were applied had 44.534.432 rows, and the server used was a very slow one.

strong normalisation, indices, one table

select count(threadid) from thread_user_post where userid = 483737; (2.88 sec)
select postid from thread_user_post where userid = 982; (5.3 sec)
select distinct(threadid) from thread_user_post where userid = 982; (5.31 sec)
select count(postid) from thread_user_post where threadid = 711660; (2.01 sec)
select count(postid) from thread_user_post where threadid = 86525; (1.29 sec)

weak normalisation, partitioning, no indices, two tables

select count(threadid) from user_thread_post_partitioned where userid = 483737; (0.06 sec)
select postid from user_thread_post_partitioned where userid = 982; (0.21 sec)
select distinct(threadid) from user_thread_post_partitioned where userid = 982; (0.30 sec)
select count(postid) from thread_user_post_partitioned where threadid = 711660; (0.07 sec)
select count(postid) from thread_user_post_partitioned where threadid = 86525; (0.09 sec)

Weak normalisation with partitioning is way faster, 15-50 times faster in the test above.

Strong normalisation gives, in theory, the least storage usage, since each datum is only stored in one table. However, when indices are added to a table, the storage requirements increase.

Somewhat unexpectedly, the two partioned tables occupy LESS hard disk space than the one table with the two indices.

Conclusion: When the data get big, partititioning performs better than indices, and you should migrate from normalisation to duplication of data.

comments powered by Disqus


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: oktober 12, 2017