While reading the Google's Percolator paper I have noticed that Percolator supports observers - the capability to run specific logic
when data in the particular column has changed. This reminded me of triggers. Usual database triggers. As a common knowledge - something bad to use.
I will not list here the disadvantages of the triggers - they are well known. Lets look at the positive
side of them.
I realized that the triggers are almost perfect decoupling facility. It is a declarative way to set
the logic to be invoked when some row (object) in my data base (data model) has changed.
I am not aware of OO languages / frameworks which do allow such perfect flexibility. I can assume that aspect oriented
programming will allow something similar, although I doubt that it will be the match.
If you have heard of the way to achieve the same or similar capability without usage of the triggers - I would be happy to know.
Saturday, 16 October 2010
Friday, 24 September 2010
Nested database systems - how I see them
This article is about nested data model for the analytical DBMS systems, how I see it and why I am going to develop it open source.
We were inspired by this paper describing nested data store developed by google and shared with us - mere mortals: Dremel paper I would recommend to go through the text , at least briefly, before reading further. Alternatively you can read this:
Big Query home page
it is public frontend for the Dremel.
Lets define this concept in a nutshell, compare it with closest known species and then see - what can we gain from this model.
First of all, the record in this model is not flat (like in RDBMS) but hierarchical. Elements can be both scalar and lists. For example one record can contain the person’s ID, regular personal data, list of all its previous jobs, list of previous addresses. And we have a query language which enables data analysis in this form.
Example of the record:
Name: John
Age: 44
JOB:
    company: Google
    from: 1999
    to: 2002
JOB:
    company: Microsoft
    from: 1999
    to: 2002
The query can be written as follows:
Select count(*) from People where JOB.Company=”Google”
To get people who ever worked for Google.
You can ask - why is it good? We can have relational model with all this information. I agree - we can. And it will be perfectly flexible and easy to access. The main problem is a scalability. Retrieving information about the person will require the join. In case of the serious data warehouse - this join will be a disaster, if we need to do some analysis.
So we have come to the conclusion I want to present - having native support for the nested data model we can store One-to-many relationship pre-joined, without replication of the “One” side of the relationship.
If we try to peek into performance analysis of such engine vs RDBMS engine - many data models will have one table, instead of a few, and more queries will be completed by the one pass algorithms. For the big data volumes it can be game-changing advantage.
In fact some application logs are hierarchical by their nature - for example web application session contain its clicks, as well as some session level information and statistics. Usage of nested model will enable native analysis of them.
I am going to take very active part in the development of the open source implementation of this wonderful concept. I see it as an opportunity to take part in the developing state of the art analytical engine aimed to work in the cloud. In my opinion such experience is invaluable to the professional interested in scalable cloud base system design and development. And I hardly see the other way to get such experience when you are not working in MS/Google/IBM database labs.
The resulting system is supposed to have the same scalability as Map-Reduce, while providing interactive response time. It is not a dream - but capability achieved in Google internally. So this task presents a serious challenge, though there is a proof that it is possible.
We were inspired by this paper describing nested data store developed by google and shared with us - mere mortals: Dremel paper I would recommend to go through the text , at least briefly, before reading further. Alternatively you can read this:
Big Query home page
it is public frontend for the Dremel.
Lets define this concept in a nutshell, compare it with closest known species and then see - what can we gain from this model.
First of all, the record in this model is not flat (like in RDBMS) but hierarchical. Elements can be both scalar and lists. For example one record can contain the person’s ID, regular personal data, list of all its previous jobs, list of previous addresses. And we have a query language which enables data analysis in this form.
Example of the record:
Name: John
Age: 44
JOB:
    company: Google
    from: 1999
    to: 2002
JOB:
    company: Microsoft
    from: 1999
    to: 2002
The query can be written as follows:
Select count(*) from People where JOB.Company=”Google”
To get people who ever worked for Google.
You can ask - why is it good? We can have relational model with all this information. I agree - we can. And it will be perfectly flexible and easy to access. The main problem is a scalability. Retrieving information about the person will require the join. In case of the serious data warehouse - this join will be a disaster, if we need to do some analysis.
So we have come to the conclusion I want to present - having native support for the nested data model we can store One-to-many relationship pre-joined, without replication of the “One” side of the relationship.
If we try to peek into performance analysis of such engine vs RDBMS engine - many data models will have one table, instead of a few, and more queries will be completed by the one pass algorithms. For the big data volumes it can be game-changing advantage.
In fact some application logs are hierarchical by their nature - for example web application session contain its clicks, as well as some session level information and statistics. Usage of nested model will enable native analysis of them.
I am going to take very active part in the development of the open source implementation of this wonderful concept. I see it as an opportunity to take part in the developing state of the art analytical engine aimed to work in the cloud. In my opinion such experience is invaluable to the professional interested in scalable cloud base system design and development. And I hardly see the other way to get such experience when you are not working in MS/Google/IBM database labs.
The resulting system is supposed to have the same scalability as Map-Reduce, while providing interactive response time. It is not a dream - but capability achieved in Google internally. So this task presents a serious challenge, though there is a proof that it is possible.
Monday, 8 March 2010
TPCH on Postgresql
This article intended for the people who are somewhat familiar with the TPCH and interested to find out what are obstacles for running it on Postgresql, and how to overcome them.
As a part of my research for the Petascan, I study performance of the TPCH on Postgresql.
It is a common knowledge that vanilla Postgresql is not capable of processing all TPCH queries. Indeed tests show some queries do not return in reasonable time, even for the scale factor 1. So I took a deep breath and dived into the optimization of the queries as well as into server tuning.
It turned out that the main reason of the poor performance is that Postgresql's prefers the nested loops operator for part of queries. In my understanding – nested loops are efficient when there is a small number of relevant rows in the driving table. It looks like optimizer does not have accurate information about the number of relevant rows and therefore its choice is not always optimal. I can define hash joins as not always the best, but much safer choice than the nested loops. So I started to look for the way to get rid of the nested loops in my query plans.
Fortunately Postgresql has an option to disable nested loops completely. It can be done by executing the following command:
enable_nestloop = false
or in several other ways. Disabling of nested loops improves performance for majority of TPCH queries. But a Q17 and Q20 still take too much time. So I decided to manually rewrite them to equivalent queries, which can be better processed by the Postgresql.
Q17. Original query is:
The rewritten version looks like this:
Pic 1. Q17 query execution plan before the change
Pic 2. Q17 query execution plan after the change
I am not 100% sure what exactly caused the improvement. It looks like 3 way join is much less efficient than 2 way join. Anyway - speedup is at least 20 times. I was not patient enough to wait for the result of the non optimized query. I would be happy if you correct me and give more insights here.
For the Q20 the similar optimization was done.
The last important observation concerns the sensitivity of Postgresql to the amount of the system free memory. From my experience with commercial RDBMS systems, you should give enough memory to the database server and it will perform better, mostly by caching the data. I intentionally omit other usage of the memory by database servers, for simplicity. Contrary to many commercial RDBMS servers Postgre SQL relies on the OS file system caching. It expects OS to cache frequently accessed data. So queries execution time is seriously affected by the amount of the system free memory. OS uses free memory as a file cache. I can deduce from it – that Postgresql should be used as a standalone server, since it is not capable of “defending” its cache against other memory consumers. I am curious if this caching architecture is a best choice?
Here is the summary of the actions which had an impact on the TPCH performance:
Test was performed on a very modest system (Windows XP, Core 2 Duo 2.13 GHz and one SATA drive with 7200 RPM). The scale factor was modest as well: 1 GB.
As a next step of the research I will look for the balance between CPU and IO for the TPCH on Postgresql, then move to bigger scales. The goal will be optimal price/performance of the TPCH on Postgresql.
As a part of my research for the Petascan, I study performance of the TPCH on Postgresql.
It is a common knowledge that vanilla Postgresql is not capable of processing all TPCH queries. Indeed tests show some queries do not return in reasonable time, even for the scale factor 1. So I took a deep breath and dived into the optimization of the queries as well as into server tuning.
It turned out that the main reason of the poor performance is that Postgresql's prefers the nested loops operator for part of queries. In my understanding – nested loops are efficient when there is a small number of relevant rows in the driving table. It looks like optimizer does not have accurate information about the number of relevant rows and therefore its choice is not always optimal. I can define hash joins as not always the best, but much safer choice than the nested loops. So I started to look for the way to get rid of the nested loops in my query plans.
Fortunately Postgresql has an option to disable nested loops completely. It can be done by executing the following command:
enable_nestloop = false
or in several other ways. Disabling of nested loops improves performance for majority of TPCH queries. But a Q17 and Q20 still take too much time. So I decided to manually rewrite them to equivalent queries, which can be better processed by the Postgresql.
Q17. Original query is:
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part
where
p_partkey = l_partkey
and p_brand = 'Brand#22'
and p_container = 'LG CAN'
and l_quantity < (select 0.2 * avg(l_quantity)
from
lineitem
where
l_partkey = p_partkey
);
The rewritten version looks like this:
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part,
(select
l_partkey as tmp_partkey , 0.2 * avg(l_quantity) as part_avg
from
lineitem
group by l_partkey
) avg_quantity
where
p_partkey = l_partkey
and p_brand = 'Brand#22'
and p_container = 'LG CAN'
and l_quantity < tmp_partkey =" p_partkey;">
Pic 1. Q17 query execution plan before the change
Pic 2. Q17 query execution plan after the change
I am not 100% sure what exactly caused the improvement. It looks like 3 way join is much less efficient than 2 way join. Anyway - speedup is at least 20 times. I was not patient enough to wait for the result of the non optimized query. I would be happy if you correct me and give more insights here.
For the Q20 the similar optimization was done.
The last important observation concerns the sensitivity of Postgresql to the amount of the system free memory. From my experience with commercial RDBMS systems, you should give enough memory to the database server and it will perform better, mostly by caching the data. I intentionally omit other usage of the memory by database servers, for simplicity. Contrary to many commercial RDBMS servers Postgre SQL relies on the OS file system caching. It expects OS to cache frequently accessed data. So queries execution time is seriously affected by the amount of the system free memory. OS uses free memory as a file cache. I can deduce from it – that Postgresql should be used as a standalone server, since it is not capable of “defending” its cache against other memory consumers. I am curious if this caching architecture is a best choice?
Here is the summary of the actions which had an impact on the TPCH performance:
- Disable Nested loops
- Rewrite Q17, Q20
- Free as much system memory as possible.
Test was performed on a very modest system (Windows XP, Core 2 Duo 2.13 GHz and one SATA drive with 7200 RPM). The scale factor was modest as well: 1 GB.
As a next step of the research I will look for the balance between CPU and IO for the TPCH on Postgresql, then move to bigger scales. The goal will be optimal price/performance of the TPCH on Postgresql.
Thursday, 4 February 2010
Relational vs non-Relational databases - paper review
I just finished reading of the very interesting paper “No Relation: The Mixed Blessings of Non-Relational Databases” by Ian Thomas Varley which compares a relational and a non-relational databases. I enjoyed the reading very much.
The work describes principal RDBMS services in the perspective of their replacements in the non RDBMS world. The services discussed in this perspective are ACID properties and the expressive power of the SQL language.
The author seems to be a bit biased towards non-relational databases, but this text still going to be very interesting to anyone exploring this new area in the database management.
In my opinion the paper suggests very good classification and organization of the information on the subject.
The bottom line – read this article if you want to clarify your understating of RDBMS alternatives.
The work describes principal RDBMS services in the perspective of their replacements in the non RDBMS world. The services discussed in this perspective are ACID properties and the expressive power of the SQL language.
The author seems to be a bit biased towards non-relational databases, but this text still going to be very interesting to anyone exploring this new area in the database management.
In my opinion the paper suggests very good classification and organization of the information on the subject.
The bottom line – read this article if you want to clarify your understating of RDBMS alternatives.
Sunday, 24 January 2010
The Memcache vs Datastore on google app engine
I was curious to know - what are benefits of using GAE caching mechanism, how it compares to the datastore.
My expectations were that memchace is much cheaper then datastore. It should be order(s) of magnitude faster and should have much bigger quota of calls to make. If my expectations were true, memcache would be an ideal place to cache datastore data. The reality appears to be different ... . Memcache is only 2-3 times faster and has less quota then datastore.
We have performed several tests of certain (100 bytes) data value storing and retrieving by both mechanisms. The performance time of several tries has been averaged.
Results of the first calls where dismissed since they represent application warm-up and not the speed of the services being tested.
The results are:
Now lets try to understand the meaning of these numbers. Are they high or low?
Speed:
Dozens of microseconds for read / write data is similar to the performance you would expect to from the regular database server when accessing indexed data.
So neither datastore nor memcache are faster then regular RDBMS. They are also not slower. The memcache is 2-3 times faster then the datastore.
Number of calls:
We have something like 9 memcache calls for 10 datastore calls. I could not deduce any special relation between these two services based on these numbers.
The above considerations suggest the following conclusions:
Memcache is simply another mechanism which should be used when:
a) Loss of data is not critical. In other words - data can be rebuilt.
b) Expiration capability is useful.
c) Access by key is the only way the data can be accessed.
Memcache is not a significant speedup to the datastore, unless results of many datastore calls are cached as one memcache item. At the same time it might
be significant speedup for data acquired some other way.
And some speculation on my usual question - what lesson Google tries to teach us here? - Use a persistent store when data is significant, and use a caching
mechanism when data is transient.
My expectations were that memchace is much cheaper then datastore. It should be order(s) of magnitude faster and should have much bigger quota of calls to make. If my expectations were true, memcache would be an ideal place to cache datastore data. The reality appears to be different ... . Memcache is only 2-3 times faster and has less quota then datastore.
We have performed several tests of certain (100 bytes) data value storing and retrieving by both mechanisms. The performance time of several tries has been averaged.
Results of the first calls where dismissed since they represent application warm-up and not the speed of the services being tested.
The results are:
Capability | Datastore | Memcache |
Read data | 20 milliseconds | 13 milliseconds |
Store data | 40 milliseconds | 13 milliseconds |
Access by the key | Yes | Yes |
Lookup by attributes | Yes | No |
Expiration capability | No | Yes |
Max Data size | 1MB | 1MB |
Quote (free) | 10,000,000 | 8,600,000 |
Quote (Billed) | 140,000,000 | 96,000,000 |
Now lets try to understand the meaning of these numbers. Are they high or low?
Speed:
Dozens of microseconds for read / write data is similar to the performance you would expect to from the regular database server when accessing indexed data.
So neither datastore nor memcache are faster then regular RDBMS. They are also not slower. The memcache is 2-3 times faster then the datastore.
Number of calls:
We have something like 9 memcache calls for 10 datastore calls. I could not deduce any special relation between these two services based on these numbers.
The above considerations suggest the following conclusions:
Memcache is simply another mechanism which should be used when:
a) Loss of data is not critical. In other words - data can be rebuilt.
b) Expiration capability is useful.
c) Access by key is the only way the data can be accessed.
Memcache is not a significant speedup to the datastore, unless results of many datastore calls are cached as one memcache item. At the same time it might
be significant speedup for data acquired some other way.
And some speculation on my usual question - what lesson Google tries to teach us here? - Use a persistent store when data is significant, and use a caching
mechanism when data is transient.
Tuesday, 22 December 2009
Impressions from the development of the applications under Google App Engine
Impressions from the development of the applications under Google App Engine are quite positive.
It is finally java. The development on familiar j2ee platform is pleasant. It also gives the sense of freedom to move to other platform if needed.
The good part: very easy to kick off the development. Eclipse plugin does everything and smiles to you. No more headache integrating debugger with tomcat or other container. Most of the debugging is done locally, which is very convenient. Deployment to GAE is performed by one button click.
Briefly, the limitations are following: 30 seconds per request, no threads and the mandatory use of datastore to persist data. First two are more or less easy to understand. The third one is a game changer: no more SQL. We are given the distributed hash map wrapped with JDO. Its capabilities are impressive - especially the ability to query by non key properties. Here comes interesting feature of mandatory indexing. You must have index on all properties used in WHERE clause. The nice part is that the development server builds them automatically by analyzing queries during the debugging phase. The main limitation, widely discussed over the net is maximum of 1000 objects per result set. It sounds weird and it is a first time I see such an intrusion into my programming freedom. But instead of complaining I would like to analyze why such limitation is enforced.
I think Google is trying to ensure proper practice on building scalable applications in the cloud. They ask us to do pure OLTP system. Do small things at a time. Use async requests for other services when you need more work done per request (here I refer to URL fetching service). Spawn small async tasks whenever possible (good for multicore architecture). All this sounds perfectly reasonable to me. However I already hear questions – what about the batch processing? How to do analytics? How to make mailings to many users? How to perform heavy upgrade process? And these complaints are perfectly valid. You can not build a system without those long, CPU intensive tasks. But I do not think that the OLTP optimized framework and runtime can be suitable for this type of processing. I assume that Google will introduce some other framework for the batch processes. We can speculate that it might be a kind of MapReduce, which is perfect for performing massive processing, while does not provide any kind of real time response.
To summarize things: It seems that by now Google has realized that building both short and long processing procedures in the same framework is not a good practice and tries to prevent us from using this practice by brute force.
And we are yet to see the second part of the GAE - slow and powerful. One could imagine that it will look rather like a bulldozer, not a jet plane.
It is finally java. The development on familiar j2ee platform is pleasant. It also gives the sense of freedom to move to other platform if needed.
The good part: very easy to kick off the development. Eclipse plugin does everything and smiles to you. No more headache integrating debugger with tomcat or other container. Most of the debugging is done locally, which is very convenient. Deployment to GAE is performed by one button click.
Briefly, the limitations are following: 30 seconds per request, no threads and the mandatory use of datastore to persist data. First two are more or less easy to understand. The third one is a game changer: no more SQL. We are given the distributed hash map wrapped with JDO. Its capabilities are impressive - especially the ability to query by non key properties. Here comes interesting feature of mandatory indexing. You must have index on all properties used in WHERE clause. The nice part is that the development server builds them automatically by analyzing queries during the debugging phase. The main limitation, widely discussed over the net is maximum of 1000 objects per result set. It sounds weird and it is a first time I see such an intrusion into my programming freedom. But instead of complaining I would like to analyze why such limitation is enforced.
I think Google is trying to ensure proper practice on building scalable applications in the cloud. They ask us to do pure OLTP system. Do small things at a time. Use async requests for other services when you need more work done per request (here I refer to URL fetching service). Spawn small async tasks whenever possible (good for multicore architecture). All this sounds perfectly reasonable to me. However I already hear questions – what about the batch processing? How to do analytics? How to make mailings to many users? How to perform heavy upgrade process? And these complaints are perfectly valid. You can not build a system without those long, CPU intensive tasks. But I do not think that the OLTP optimized framework and runtime can be suitable for this type of processing. I assume that Google will introduce some other framework for the batch processes. We can speculate that it might be a kind of MapReduce, which is perfect for performing massive processing, while does not provide any kind of real time response.
To summarize things: It seems that by now Google has realized that building both short and long processing procedures in the same framework is not a good practice and tries to prevent us from using this practice by brute force.
And we are yet to see the second part of the GAE - slow and powerful. One could imagine that it will look rather like a bulldozer, not a jet plane.
Friday, 18 December 2009
Datastore performance
During work on my project based on AppEngine I have made a mini benchmark of the datastore.
The operation under the test was update of the object in the datastore, and read object by the key from the datastore.
Both operation took from 20 to 40 milliseconds. It is not that fast. It mean that usual web service can not make more then a few access operations during request serving. Taking into account lack of muftithreading this latency makes problematic building complicated web applications build from the different services.
The operation under the test was update of the object in the datastore, and read object by the key from the datastore.
Both operation took from 20 to 40 milliseconds. It is not that fast. It mean that usual web service can not make more then a few access operations during request serving. Taking into account lack of muftithreading this latency makes problematic building complicated web applications build from the different services.
Subscribe to:
Posts (Atom)