<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-3820162693712878902</id><updated>2011-08-30T13:28:12.018-07:00</updated><category term='tuning'/><category term='postgres'/><category term='TPCH'/><category term='optimization'/><title type='text'>Database aspects</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://dbaspects.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://dbaspects.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>David Gruzman</name><uri>http://www.blogger.com/profile/12513781804510839403</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>9</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3820162693712878902.post-262856162675745417</id><published>2010-10-16T08:51:00.000-07:00</published><updated>2010-10-17T15:00:40.297-07:00</updated><title type='text'>Database triggers - as (almost) perfect decoupling facililty</title><content type='html'>While reading the Google's Percolator paper I have noticed that Percolator supports observers - the capability to run specific logic&lt;br /&gt;when data in the particular column has changed. This reminded me of triggers. Usual database triggers. As a common knowledge - something bad to use.&lt;br /&gt;&lt;br /&gt; I will not list here the disadvantages of the triggers - they are well known. Lets look at the positive&lt;br /&gt;side of them.&lt;br /&gt;I realized  that the triggers are almost perfect decoupling facility. It is a declarative way to set&lt;br /&gt;the logic to be invoked when some row (object) in my data base (data model) has changed.&lt;br /&gt;I am not aware  of  OO languages / frameworks which  do allow such perfect flexibility. I can assume that aspect oriented&lt;br /&gt;programming will allow something similar, although I  doubt  that it will be the match.&lt;br /&gt;If you  have heard of  the way to achieve the same or similar capability without usage of the triggers - I would be happy to know.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3820162693712878902-262856162675745417?l=dbaspects.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbaspects.blogspot.com/feeds/262856162675745417/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbaspects.blogspot.com/2010/10/database-triggers-as-almost-perfect.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/262856162675745417'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/262856162675745417'/><link rel='alternate' type='text/html' href='http://dbaspects.blogspot.com/2010/10/database-triggers-as-almost-perfect.html' title='Database triggers - as (almost) perfect decoupling facililty'/><author><name>David Gruzman</name><uri>http://www.blogger.com/profile/12513781804510839403</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3820162693712878902.post-2020983715606452545</id><published>2010-09-24T01:13:00.001-07:00</published><updated>2010-09-24T01:21:52.276-07:00</updated><title type='text'>Nested database systems - how I see them</title><content type='html'>This article is about nested data model for the analytical DBMS systems, how I see it and why I am going to develop it open source.&lt;br /&gt;&lt;br /&gt;We were inspired by this paper describing nested data store developed by google and shared with us - mere mortals: &lt;a href="http://sergey.melnix.com/pub/melnik_VLDB10.pdf"&gt;Dremel paper&lt;/a&gt;    I would recommend to go through the text ,  at least briefly, before reading further. Alternatively you can read this: &lt;br /&gt;&lt;a href="http://code.google.com/apis/bigquery/"&gt;Big Query home page&lt;/a&gt;&lt;br /&gt; it is public frontend for the Dremel.&lt;br /&gt;Lets define this concept in a nutshell, compare it with closest known species and then see - what can we gain from this model.&lt;br /&gt;First of all,  the record in this model is not flat (like in RDBMS) but hierarchical. Elements can be both scalar and lists. For example one record can contain the person’s ID,  regular personal data, list of all its previous jobs, list of  previous addresses.  And we have a query language which enables data analysis in this form.&lt;br /&gt;Example of the record:&lt;br /&gt;&lt;br /&gt;Name: John&lt;br /&gt;Age: 44&lt;br /&gt;JOB:&lt;br /&gt;&amp;nbsp &amp;nbsp company: Google&lt;br /&gt;&amp;nbsp &amp;nbsp from: 1999&lt;br /&gt;&amp;nbsp &amp;nbsp to:   2002&lt;br /&gt;JOB:&lt;br /&gt;&amp;nbsp &amp;nbsp company: Microsoft&lt;br /&gt;&amp;nbsp &amp;nbsp from: 1999&lt;br /&gt;&amp;nbsp &amp;nbsp to:   2002&lt;br /&gt;&lt;br /&gt;The  query can be written as follows:&lt;br /&gt;Select count(*) from People where JOB.Company=”Google”&lt;br /&gt;To get people who ever worked for  Google.           &lt;br /&gt;&lt;br /&gt;You can ask - why is it  good? We can have relational model with all this information. I agree - we can. And it will be perfectly flexible and easy to access. The main problem is a scalability. Retrieving information about the person will require the join. In case of the serious data warehouse - this join will be a disaster, if we need to do some analysis.&lt;br /&gt;So we have come  to the conclusion I want to present - having native support for the nested data model we can store One-to-many relationship pre-joined, without replication of the “One” side of the relationship.&lt;br /&gt;If we try to peek into performance analysis of such engine vs RDBMS engine - many data models will have one table, instead of a few, and more queries will be completed by the one pass algorithms.  For the big data volumes it can be game-changing advantage.&lt;br /&gt;&lt;br /&gt;In fact  some application logs are hierarchical by their nature - for example web application session contain its clicks, as well as some session level information and statistics.  Usage of nested model will enable native analysis of them.&lt;br /&gt;&lt;br /&gt;I am going to take very active part in the development of the open source implementation of this wonderful concept.  I see  it as an opportunity to take part in the developing state of the art analytical engine aimed to work in the cloud. In my opinion such experience is invaluable to the professional interested in scalable cloud base system design and development. And I hardly see the other way to get such experience when you are not working in MS/Google/IBM database labs.&lt;br /&gt;&lt;br /&gt;The resulting system is supposed to have the same scalability as Map-Reduce, while providing interactive response time. It is not a dream - but capability achieved in Google internally. So this task presents a  serious challenge, though  there is a proof  that it is possible.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3820162693712878902-2020983715606452545?l=dbaspects.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbaspects.blogspot.com/feeds/2020983715606452545/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbaspects.blogspot.com/2010/09/nested-database-systems-how-i-see-them.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/2020983715606452545'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/2020983715606452545'/><link rel='alternate' type='text/html' href='http://dbaspects.blogspot.com/2010/09/nested-database-systems-how-i-see-them.html' title='Nested database systems - how I see them'/><author><name>David Gruzman</name><uri>http://www.blogger.com/profile/12513781804510839403</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3820162693712878902.post-7138489173394644289</id><published>2010-03-08T13:53:00.000-08:00</published><updated>2010-03-12T03:43:01.509-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='tuning'/><category scheme='http://www.blogger.com/atom/ns#' term='TPCH'/><category scheme='http://www.blogger.com/atom/ns#' term='optimization'/><category scheme='http://www.blogger.com/atom/ns#' term='postgres'/><title type='text'>TPCH on Postgresql</title><content type='html'>This article intended for the people who are somewhat familiar with the TPCH and interested to find out what are obstacles for running it on Postgresql, and how to overcome them.&lt;br /&gt;As a part of my research for the Petascan, I study performance of the TPCH on Postgresql.&lt;br /&gt;It is a common knowledge that vanilla Postgresql is not capable of processing all TPCH queries. Indeed tests show some queries do not return in reasonable time, even for the scale factor 1. So I took a deep breath and dived into the optimization of the queries as well as into server tuning.&lt;br /&gt;&lt;br /&gt;It turned out that the main reason of the poor performance is that Postgresql's prefers the nested loops operator for part of queries. In my understanding – nested loops are efficient when there is a small number of relevant rows in the driving table. It looks like optimizer does not have accurate information about the number of relevant rows  and therefore its choice is not always optimal. I can define hash joins as not always the best, but much safer choice than the nested loops. So I started to look for the way to get rid of the nested loops in my query plans.&lt;br /&gt;Fortunately Postgresql has an option to disable nested loops completely. It can be done by executing the following command:&lt;br /&gt;&lt;br /&gt;enable_nestloop = false&lt;br /&gt;&lt;br /&gt;or in several other ways. Disabling of nested loops improves performance for majority of TPCH queries. But a Q17 and Q20  still take too much time. So I decided to manually rewrite them to equivalent queries, which can be better processed by the Postgresql.&lt;br /&gt;&lt;br /&gt;Q17. Original query is:&lt;br /&gt;&lt;pre&gt; &lt;br /&gt;select&lt;br /&gt; sum(l_extendedprice) / 7.0 as avg_yearly&lt;br /&gt;from&lt;br /&gt; lineitem,&lt;br /&gt; part&lt;br /&gt;where&lt;br /&gt; p_partkey = l_partkey&lt;br /&gt; and p_brand = 'Brand#22'&lt;br /&gt; and p_container = 'LG CAN'&lt;br /&gt; and l_quantity &lt; &lt;span style="font-weight: bold;"&gt;(select 0.2 * avg(l_quantity)  &lt;br /&gt;                            from&lt;br /&gt;          lineitem&lt;br /&gt;                            where&lt;br /&gt;                             l_partkey = p_partkey&lt;br /&gt;                     ); &lt;/span&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The rewritten version looks like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;select&lt;br /&gt; sum(l_extendedprice) / 7.0 as avg_yearly&lt;br /&gt;from&lt;br /&gt; lineitem,&lt;br /&gt; part,&lt;br /&gt; &lt;span style="font-weight: bold;"&gt;(select&lt;br /&gt;   l_partkey as tmp_partkey , 0.2 * avg(l_quantity) as part_avg&lt;br /&gt;  from&lt;br /&gt;   lineitem&lt;br /&gt;   group by l_partkey&lt;br /&gt; ) avg_quantity&lt;/span&gt;&lt;br /&gt;where&lt;br /&gt; p_partkey = l_partkey&lt;br /&gt; and p_brand = 'Brand#22'&lt;br /&gt; and p_container = 'LG CAN'&lt;br /&gt; and l_quantity &lt; tmp_partkey =" p_partkey;"&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_SNVQDBU1Flg/S5VzV0k7NoI/AAAAAAAADPU/PpKDNqZGHgo/s1600-h/q17_plan_before.JPG"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; cursor: pointer; width: 320px; height: 184px;" src="http://2.bp.blogspot.com/_SNVQDBU1Flg/S5VzV0k7NoI/AAAAAAAADPU/PpKDNqZGHgo/s320/q17_plan_before.JPG" alt="" id="BLOGGER_PHOTO_ID_5446386143102449282" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Pic 1. Q17 query execution plan before the change&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_SNVQDBU1Flg/S5VyljaQ2LI/AAAAAAAADPM/FC3ol8hB6tI/s1600-h/q17_plan_after.JPG"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 320px; height: 107px;" src="http://2.bp.blogspot.com/_SNVQDBU1Flg/S5VyljaQ2LI/AAAAAAAADPM/FC3ol8hB6tI/s320/q17_plan_after.JPG" alt="" id="BLOGGER_PHOTO_ID_5446385313860606130" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Pic 2. Q17 query execution plan after the change&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I am not 100% sure what exactly caused the improvement. It looks like 3 way join is much less efficient than 2 way join. Anyway - speedup is at least 20 times. I was not patient enough to wait for the result of the non optimized query. I would be happy if you correct me and give more insights here.&lt;br /&gt;For the Q20 the similar optimization was done.&lt;br /&gt;&lt;br /&gt;The last important observation concerns the sensitivity of Postgresql to the amount of the system free memory. From my experience with commercial RDBMS systems, you should give enough memory to the database server and it will perform better, mostly by caching the data. I intentionally omit other usage of the memory by database servers, for simplicity. Contrary to many commercial RDBMS servers Postgre SQL relies on the OS file system caching. It expects OS to cache frequently accessed data. So queries execution time is seriously affected by the amount of the system free memory. OS uses free memory as a file cache. I can deduce from it – that Postgresql should be used as a standalone server, since it is not capable of “defending” its cache against other memory consumers. I am curious if this caching architecture is a best choice?&lt;br /&gt;&lt;br /&gt;Here is the summary of the actions which had an impact on the TPCH performance:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Disable Nested loops&lt;/li&gt;&lt;li&gt;Rewrite Q17, Q20&lt;/li&gt;&lt;li&gt;Free as much system memory as possible.&lt;/li&gt;&lt;/ul&gt;The following article was of great help in setting up the test:  &lt;a href=http://www.phacai.com/preparations-to-use-tpch-data-on-postgresql&gt;   Preparations to use TPCH data on Postgresql &lt;/a&gt; &lt;br /&gt;&lt;br /&gt;Test was performed on a very modest system (Windows XP, Core 2 Duo 2.13 GHz and one SATA drive with 7200 RPM). The scale factor was modest as well: 1 GB.&lt;br /&gt;As a next step of the research I will look for the balance between CPU and IO for the TPCH on Postgresql, then move to bigger scales. The goal will be optimal price/performance of the TPCH on Postgresql.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3820162693712878902-7138489173394644289?l=dbaspects.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbaspects.blogspot.com/feeds/7138489173394644289/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbaspects.blogspot.com/2010/03/tpch-on-postgresql.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/7138489173394644289'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/7138489173394644289'/><link rel='alternate' type='text/html' href='http://dbaspects.blogspot.com/2010/03/tpch-on-postgresql.html' title='TPCH on Postgresql'/><author><name>David Gruzman</name><uri>http://www.blogger.com/profile/12513781804510839403</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_SNVQDBU1Flg/S5VzV0k7NoI/AAAAAAAADPU/PpKDNqZGHgo/s72-c/q17_plan_before.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3820162693712878902.post-7973459149556874767</id><published>2010-02-04T07:39:00.000-08:00</published><updated>2010-02-04T10:51:33.916-08:00</updated><title type='text'>Relational vs non-Relational databases - paper review</title><content type='html'>I just finished reading of the very interesting paper &lt;a href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf"&gt;“No Relation: The Mixed Blessings of Non-Relational Databases”&lt;/a&gt; by Ian Thomas Varley which compares a relational and a non-relational databases. I enjoyed the reading very much.&lt;br /&gt;The work describes principal RDBMS services in the perspective of their replacements in the non RDBMS world. The services discussed  in this perspective are ACID properties and the expressive power of the SQL language. &lt;br /&gt;&lt;br /&gt;The author seems to be  a bit biased towards non-relational databases, but this text still going to be very interesting to anyone exploring this new area in the  database management. &lt;br /&gt;&lt;br /&gt;In my opinion the paper suggests  very good classification and organization of the information on the subject.&lt;br /&gt;&lt;br /&gt;     The bottom line – read this article if you want to clarify  your understating of RDBMS alternatives.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3820162693712878902-7973459149556874767?l=dbaspects.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbaspects.blogspot.com/feeds/7973459149556874767/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbaspects.blogspot.com/2010/02/relational-vs-non-relational-databases.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/7973459149556874767'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/7973459149556874767'/><link rel='alternate' type='text/html' href='http://dbaspects.blogspot.com/2010/02/relational-vs-non-relational-databases.html' title='Relational vs non-Relational databases - paper review'/><author><name>David Gruzman</name><uri>http://www.blogger.com/profile/12513781804510839403</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3820162693712878902.post-7984163124803736676</id><published>2010-01-24T06:41:00.000-08:00</published><updated>2010-01-26T14:18:15.954-08:00</updated><title type='text'>The Memcache vs Datastore  on google app engine</title><content type='html'>I was curious to know - what are benefits of using GAE caching mechanism, how it compares to the datastore.&lt;br /&gt; My expectations were that memchace is much cheaper then datastore. It should be  order(s) of magnitude faster and should  have  much bigger quota of calls to make. If my expectations were true, memcache would be an ideal place to cache datastore data. The reality appears to be different ... . Memcache is only 2-3 times faster and has less quota then datastore.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt; We have performed several tests of certain (100 bytes) data value storing and retrieving by both mechanisms. The performance time of several tries has been averaged.&lt;br /&gt;  Results of the first calls where dismissed since they represent  application warm-up and not the  speed of the services being tested.&lt;br /&gt;&lt;br /&gt;The results are:&lt;br /&gt;&lt;style type="text/css"&gt;.nobrtable br { display: none }&lt;/style&gt;&lt;br /&gt;&lt;div class="nobrtable"&gt;&lt;br /&gt;&lt;br /&gt;&lt;table style="text-align: left; width: 10px; height: 40px;" border="1" &gt;&lt;br /&gt; &lt;tbody&gt;&lt;br /&gt;   &lt;tr&gt;&lt;br /&gt;     &lt;td&gt;Capability&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;Datastore&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;Memcache&lt;/td&gt;&lt;br /&gt;   &lt;/tr&gt;&lt;br /&gt;   &lt;tr&gt;&lt;br /&gt;     &lt;td&gt;Read data&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;20 milliseconds&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;13 milliseconds&lt;/td&gt;&lt;br /&gt;   &lt;/tr&gt;&lt;br /&gt;   &lt;tr&gt;&lt;br /&gt;     &lt;td&gt;Store data&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;40 milliseconds&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;13 milliseconds&lt;/td&gt;&lt;br /&gt;   &lt;/tr&gt;&lt;br /&gt;&lt;br /&gt;   &lt;tr&gt;&lt;br /&gt;     &lt;td&gt;Access by the key&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;Yes&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;Yes&lt;/td&gt;&lt;br /&gt;   &lt;/tr&gt;&lt;br /&gt;   &lt;tr&gt;&lt;br /&gt;     &lt;td&gt;Lookup by attributes&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;Yes&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;No&lt;/td&gt;&lt;br /&gt;   &lt;/tr&gt;&lt;br /&gt;   &lt;tr&gt;&lt;br /&gt;     &lt;td&gt;Expiration capability&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;No&lt;br /&gt;&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;Yes&lt;/td&gt;&lt;br /&gt;   &lt;/tr&gt;&lt;br /&gt;   &lt;tr&gt;&lt;br /&gt;     &lt;td&gt;Max Data size&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;1MB&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;1MB&lt;/td&gt;&lt;br /&gt;   &lt;/tr&gt;&lt;br /&gt;   &lt;tr&gt;&lt;br /&gt;     &lt;td&gt;Quote (free)&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;10,000,000&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;8,600,000&lt;/td&gt;&lt;br /&gt;   &lt;/tr&gt;&lt;br /&gt;   &lt;tr&gt;&lt;br /&gt;     &lt;td&gt;Quote (Billed)&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;140,000,000&lt;/td&gt;&lt;br /&gt;     &lt;td&gt;96,000,000&lt;/td&gt;&lt;br /&gt;   &lt;/tr&gt;&lt;br /&gt;&lt;br /&gt; &lt;/tbody&gt;&lt;br /&gt;&lt;/table&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;Now lets try to understand  the meaning  of  these numbers. Are they high or low?&lt;br /&gt;&lt;br /&gt;Speed:&lt;br /&gt;Dozens of microseconds for read / write data is similar to the performance you would expect to from the regular database server when accessing indexed data.&lt;br /&gt;So neither datastore nor memcache are faster then regular RDBMS. They are also not slower. The memcache is 2-3 times faster then the datastore.&lt;br /&gt;&lt;br /&gt;Number of calls:&lt;br /&gt;We have something like 9 memcache calls for 10 datastore calls. I could not deduce  any special relation between these two services based on these numbers.&lt;br /&gt;The  above considerations suggest the  following conclusions:&lt;br /&gt;&lt;br /&gt;Memcache is simply another mechanism which should be used when:&lt;br /&gt;&lt;br /&gt;a) Loss of data is not critical. In other words - data can be rebuilt.&lt;br /&gt;b) Expiration capability is useful.&lt;br /&gt;c) Access by key is the only way the data can be accessed.&lt;br /&gt;&lt;br /&gt;Memcache is not a significant speedup to the datastore, unless results of many datastore calls are cached as one memcache item. At the same time it might&lt;br /&gt;be significant speedup for data  acquired  some other way.&lt;br /&gt;And some speculation on my usual question - what lesson Google tries to teach us here?   - Use a persistent store when data is significant, and use a caching&lt;br /&gt;mechanism when data is transient.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3820162693712878902-7984163124803736676?l=dbaspects.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbaspects.blogspot.com/feeds/7984163124803736676/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbaspects.blogspot.com/2010/01/memcache-vs-datastore-on-google-app.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/7984163124803736676'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/7984163124803736676'/><link rel='alternate' type='text/html' href='http://dbaspects.blogspot.com/2010/01/memcache-vs-datastore-on-google-app.html' title='The Memcache vs Datastore  on google app engine'/><author><name>David Gruzman</name><uri>http://www.blogger.com/profile/12513781804510839403</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3820162693712878902.post-3592054276913220953</id><published>2009-12-22T10:59:00.000-08:00</published><updated>2009-12-22T13:11:34.873-08:00</updated><title type='text'>Impressions from the development of the applications under Google App Engine</title><content type='html'>Impressions from the development of the applications under Google App Engine are quite positive. &lt;br /&gt;It is finally java. The development on familiar j2ee platform is pleasant. It also gives the sense of freedom to move to other platform if needed.&lt;br /&gt;The good part: very easy to kick off the development. Eclipse plugin does everything and smiles to you. No more headache integrating debugger with tomcat or other container. Most of the debugging is done locally, which is very convenient.  Deployment to GAE is performed by one button click.&lt;br /&gt;Briefly, the limitations are following:  30 seconds per request, no threads and the mandatory use of datastore to persist data.  First two are more or less easy to understand. The third one is a game changer: no more SQL. We are given the distributed hash map wrapped with JDO.  Its capabilities are impressive - especially the ability to query by non key properties. Here comes interesting feature of mandatory indexing. You must have index on all properties used in WHERE clause. The nice part is that the development server builds them automatically by analyzing queries during the debugging phase. The main limitation, widely discussed over the net is maximum of 1000 objects per result set. It sounds weird and it is a first time I see such an intrusion into my programming freedom.   But instead of complaining I would like to analyze why such limitation is enforced. &lt;br /&gt;I think Google is trying to ensure proper practice on building scalable applications in the cloud. They ask us to do pure OLTP system. Do small things at a time. Use async requests for other services when you need more work done per request (here I refer to URL fetching service). Spawn small async tasks whenever possible  (good for multicore architecture).  All this sounds perfectly reasonable to me. However I already hear questions – what about the batch processing? How to do analytics? How to make mailings to many users? How to perform heavy upgrade process? And these complaints are perfectly valid. You can not build a system without those long, CPU intensive tasks. But I do not think that the OLTP optimized framework and runtime can be suitable for this type of processing. I assume that Google will introduce some other framework for the batch processes. We can speculate that it might be a  kind of MapReduce, which is  perfect for performing  massive processing,  while  does not provide any kind of real time response.&lt;br /&gt;To summarize things:  It seems that by now Google has realized that  building both short and long processing procedures in the same framework is not a good practice and  tries to  prevent us from using this practice by  brute force. &lt;br /&gt;And we are yet to see the second part of the GAE - slow and powerful.     One could imagine that it will look rather like a bulldozer, not a jet plane.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3820162693712878902-3592054276913220953?l=dbaspects.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbaspects.blogspot.com/feeds/3592054276913220953/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbaspects.blogspot.com/2009/12/impressions-from-development-of.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/3592054276913220953'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/3592054276913220953'/><link rel='alternate' type='text/html' href='http://dbaspects.blogspot.com/2009/12/impressions-from-development-of.html' title='Impressions from the development of the applications under Google App Engine'/><author><name>David Gruzman</name><uri>http://www.blogger.com/profile/12513781804510839403</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3820162693712878902.post-3690950566379444355</id><published>2009-12-18T07:09:00.000-08:00</published><updated>2009-12-18T07:16:26.572-08:00</updated><title type='text'>Datastore performance</title><content type='html'>During work on my project based on AppEngine I have made a mini benchmark of the datastore. &lt;br /&gt;The operation under the test was update of the object in the datastore, and read object by the key from the datastore.&lt;br /&gt;Both operation took from 20 to 40 milliseconds. It is not that fast. It mean that usual web service can not make more then a few access operations during request serving. Taking into account lack of muftithreading this latency makes problematic building complicated web applications build from the different services.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3820162693712878902-3690950566379444355?l=dbaspects.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbaspects.blogspot.com/feeds/3690950566379444355/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbaspects.blogspot.com/2009/12/datastore-performance.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/3690950566379444355'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/3690950566379444355'/><link rel='alternate' type='text/html' href='http://dbaspects.blogspot.com/2009/12/datastore-performance.html' title='Datastore performance'/><author><name>David Gruzman</name><uri>http://www.blogger.com/profile/12513781804510839403</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3820162693712878902.post-2840003463580710581</id><published>2009-12-06T11:46:00.001-08:00</published><updated>2009-12-06T11:47:33.978-08:00</updated><title type='text'></title><content type='html'>&lt;meta equiv="CONTENT-TYPE" content="text/html; charset=utf-8"&gt;&lt;title&gt;&lt;/title&gt;&lt;meta name="GENERATOR" content="OpenOffice.org 2.2  (Win32)"&gt;&lt;meta name="AUTHOR" content="David Gruzman"&gt;&lt;meta name="CREATED" content="20091206;21424987"&gt;&lt;meta name="CHANGED" content="16010101;0"&gt;&lt;style type="text/css"&gt; 	&lt;!-- 		@page { size: 8.5in 11in; margin: 0.79in } 		P { margin-bottom: 0.08in } 	--&gt; 	&lt;/style&gt; &lt;p style="margin-bottom: 0in; font-weight: bold;"&gt;Distributed Hash Map (DHT)  research - setting evaluation criteria.&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;In this post I want to set up formal criteria to compare different DHT systems:&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;1) &lt;span style="font-weight: bold;"&gt;Queries support&lt;/span&gt;. It is very important topic in my opinion, since the complex queries is a main&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;advantage of the RDBMS solutions against DHT. The main things to check here I see  &lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;- lookup by the key, lookup by the attributes, lookup by the logical conditions over attribute. The way to  emulate join will be a separate topic of research.&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;2) &lt;span style="font-weight: bold;"&gt;Transactions support&lt;/span&gt;. It is not exactly the black and white, since there are different parts of the ACID properties implemented with some limitations. Each property of ACID will be evaluated separately.&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;3) &lt;span style="font-weight: bold;"&gt;Performance&lt;/span&gt;. There is no industry standard benchmark for the DHT performance. So I will measure simple response  &lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;time for the available types of queries. It will not be regular benchmarking because I can not run it on the same hardware.&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;But if difference in orders of magnitude will be discovered it can give some hints what solution is faster.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;If you see other important criteria to be evaluated – I will add them to the evaluation process.&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;&lt;br /&gt;&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;&lt;br /&gt;&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;&lt;br /&gt;&lt;/p&gt; &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3820162693712878902-2840003463580710581?l=dbaspects.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbaspects.blogspot.com/feeds/2840003463580710581/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbaspects.blogspot.com/2009/12/distributed-hash-map-dht-research.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/2840003463580710581'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/2840003463580710581'/><link rel='alternate' type='text/html' href='http://dbaspects.blogspot.com/2009/12/distributed-hash-map-dht-research.html' title=''/><author><name>David Gruzman</name><uri>http://www.blogger.com/profile/12513781804510839403</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3820162693712878902.post-487375408275329644</id><published>2009-11-07T01:20:00.000-08:00</published><updated>2009-12-06T13:17:28.448-08:00</updated><title type='text'></title><content type='html'>&lt;p style="margin: 0cm 0cm 10pt;" class="MsoNormal"&gt;&lt;span style="font-size:100%;"&gt;&lt;b&gt;&lt;span style="line-height: 115%;" lang="EN-IE"&gt;&lt;span style="font-family:Calibri;"&gt;The look into RDBMS versus DHT (Distributed Hash Table) &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p style="margin: 0cm 0cm 10pt;" class="MsoNormal"&gt;&lt;span style="line-height: 115%;font-size:100%;"  lang="EN-IE"&gt;&lt;span style="font-family:Calibri;"&gt;Nearly all of the modern web-based information systems are using data storage engines of some kind. Some are using RDBMS’es and others Distributed Hash Tables. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p style="margin: 0cm 0cm 10pt;" class="MsoNormal"&gt;&lt;span style=";font-family:Calibri;font-size:100%;"  &gt;&lt;span style="line-height: 115%;" lang="EN-IE"&gt;However most of the biggest players on large scale markets are using DHT’s for their data storage requirement&lt;/span&gt;&lt;span style="line-height: 115%;" lang="EN-IE"&gt;,&lt;/span&gt;&lt;span style="line-height: 115%;" lang="EN-IE"&gt; i.e.: &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p style="margin: 0cm 0cm 10pt;" class="MsoNormal"&gt;&lt;span style="line-height: 115%;font-size:100%;"  lang="EN-IE"&gt;&lt;span style="font-family:Calibri;"&gt;Amazon uses Dynamo (http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html). &lt;/span&gt;&lt;/span&gt;&lt;/p&gt; &lt;p style="margin: 0cm 0cm 10pt;" class="MsoNormal"&gt;&lt;span style="line-height: 115%;font-size:100%;"  lang="EN-IE"&gt;&lt;span style="font-family:Calibri;"&gt;Facebook uses Cassandra (http://en.wikipedia.org/wiki/Cassandra_(database))&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt; &lt;p style="margin: 0cm 0cm 10pt;" class="MsoNormal"&gt;&lt;span style="line-height: 115%;font-size:100%;"  lang="EN-IE"&gt;&lt;span style="font-family:Calibri;"&gt;Google is using BigTable (http://labs.google.com/papers/bigtable.html). &lt;/span&gt;&lt;/span&gt;&lt;/p&gt; &lt;p style="margin: 0cm 0cm 10pt;" class="MsoNormal"&gt;&lt;span style="line-height: 115%;font-size:100%;"  lang="EN-IE"&gt;&lt;span style="font-family:Calibri;"&gt;Over the next few posts I’m trying to analyse differences between the two and identify cases where each approach Is more suitable. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p style="margin: 0cm 0cm 10pt;" class="MsoNormal"&gt;&lt;span style="line-height: 115%;font-size:100%;"  lang="EN-IE"&gt;&lt;span style="font-family:Calibri;"&gt;If you are aware of other real world usages of DHTs please drop a line so I can add them to my analysis. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3820162693712878902-487375408275329644?l=dbaspects.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbaspects.blogspot.com/feeds/487375408275329644/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dbaspects.blogspot.com/2009/11/detailed-look-into-rdbms-versus-dht.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/487375408275329644'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3820162693712878902/posts/default/487375408275329644'/><link rel='alternate' type='text/html' href='http://dbaspects.blogspot.com/2009/11/detailed-look-into-rdbms-versus-dht.html' title=''/><author><name>David Gruzman</name><uri>http://www.blogger.com/profile/12513781804510839403</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
