Friday 24 September 2010

Nested database systems - how I see them

This article is about nested data model for the analytical DBMS systems, how I see it and why I am going to develop it open source.

We were inspired by this paper describing nested data store developed by google and shared with us - mere mortals: Dremel paper I would recommend to go through the text , at least briefly, before reading further. Alternatively you can read this:
Big Query home page
it is public frontend for the Dremel.
Lets define this concept in a nutshell, compare it with closest known species and then see - what can we gain from this model.
First of all, the record in this model is not flat (like in RDBMS) but hierarchical. Elements can be both scalar and lists. For example one record can contain the person’s ID, regular personal data, list of all its previous jobs, list of previous addresses. And we have a query language which enables data analysis in this form.
Example of the record:

Name: John
Age: 44
JOB:
    company: Google
    from: 1999
    to: 2002
JOB:
    company: Microsoft
    from: 1999
    to: 2002

The query can be written as follows:
Select count(*) from People where JOB.Company=”Google”
To get people who ever worked for Google.

You can ask - why is it good? We can have relational model with all this information. I agree - we can. And it will be perfectly flexible and easy to access. The main problem is a scalability. Retrieving information about the person will require the join. In case of the serious data warehouse - this join will be a disaster, if we need to do some analysis.
So we have come to the conclusion I want to present - having native support for the nested data model we can store One-to-many relationship pre-joined, without replication of the “One” side of the relationship.
If we try to peek into performance analysis of such engine vs RDBMS engine - many data models will have one table, instead of a few, and more queries will be completed by the one pass algorithms. For the big data volumes it can be game-changing advantage.

In fact some application logs are hierarchical by their nature - for example web application session contain its clicks, as well as some session level information and statistics. Usage of nested model will enable native analysis of them.

I am going to take very active part in the development of the open source implementation of this wonderful concept. I see it as an opportunity to take part in the developing state of the art analytical engine aimed to work in the cloud. In my opinion such experience is invaluable to the professional interested in scalable cloud base system design and development. And I hardly see the other way to get such experience when you are not working in MS/Google/IBM database labs.

The resulting system is supposed to have the same scalability as Map-Reduce, while providing interactive response time. It is not a dream - but capability achieved in Google internally. So this task presents a serious challenge, though there is a proof that it is possible.