Monday, 8 March 2010

TPCH on Postgresql

This article intended for the people who are somewhat familiar with the TPCH and interested to find out what are obstacles for running it on Postgresql, and how to overcome them.
As a part of my research for the Petascan, I study performance of the TPCH on Postgresql.
It is a common knowledge that vanilla Postgresql is not capable of processing all TPCH queries. Indeed tests show some queries do not return in reasonable time, even for the scale factor 1. So I took a deep breath and dived into the optimization of the queries as well as into server tuning.

It turned out that the main reason of the poor performance is that Postgresql's prefers the nested loops operator for part of queries. In my understanding – nested loops are efficient when there is a small number of relevant rows in the driving table. It looks like optimizer does not have accurate information about the number of relevant rows and therefore its choice is not always optimal. I can define hash joins as not always the best, but much safer choice than the nested loops. So I started to look for the way to get rid of the nested loops in my query plans.
Fortunately Postgresql has an option to disable nested loops completely. It can be done by executing the following command:

enable_nestloop = false

or in several other ways. Disabling of nested loops improves performance for majority of TPCH queries. But a Q17 and Q20 still take too much time. So I decided to manually rewrite them to equivalent queries, which can be better processed by the Postgresql.

Q17. Original query is:
 
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part
where
p_partkey = l_partkey
and p_brand = 'Brand#22'
and p_container = 'LG CAN'
and l_quantity < (select 0.2 * avg(l_quantity)
from
lineitem
where
l_partkey = p_partkey
);





The rewritten version looks like this:

select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part,
(select
l_partkey as tmp_partkey , 0.2 * avg(l_quantity) as part_avg
from
lineitem
group by l_partkey
) avg_quantity

where
p_partkey = l_partkey
and p_brand = 'Brand#22'
and p_container = 'LG CAN'
and l_quantity < tmp_partkey =" p_partkey;">




Pic 1. Q17 query execution plan before the change








Pic 2. Q17 query execution plan after the change


I am not 100% sure what exactly caused the improvement. It looks like 3 way join is much less efficient than 2 way join. Anyway - speedup is at least 20 times. I was not patient enough to wait for the result of the non optimized query. I would be happy if you correct me and give more insights here.
For the Q20 the similar optimization was done.

The last important observation concerns the sensitivity of Postgresql to the amount of the system free memory. From my experience with commercial RDBMS systems, you should give enough memory to the database server and it will perform better, mostly by caching the data. I intentionally omit other usage of the memory by database servers, for simplicity. Contrary to many commercial RDBMS servers Postgre SQL relies on the OS file system caching. It expects OS to cache frequently accessed data. So queries execution time is seriously affected by the amount of the system free memory. OS uses free memory as a file cache. I can deduce from it – that Postgresql should be used as a standalone server, since it is not capable of “defending” its cache against other memory consumers. I am curious if this caching architecture is a best choice?

Here is the summary of the actions which had an impact on the TPCH performance:
  • Disable Nested loops
  • Rewrite Q17, Q20
  • Free as much system memory as possible.
The following article was of great help in setting up the test: Preparations to use TPCH data on Postgresql

Test was performed on a very modest system (Windows XP, Core 2 Duo 2.13 GHz and one SATA drive with 7200 RPM). The scale factor was modest as well: 1 GB.
As a next step of the research I will look for the balance between CPU and IO for the TPCH on Postgresql, then move to bigger scales. The goal will be optimal price/performance of the TPCH on Postgresql.