Metabase Frequently Asked Questions
Author: Manuel Lemos (mlemos-at-acm.org)
Version control: @(#) $Id: faq.documentation,v 1.3 2006/10/18 17:22:45 mlemos Exp $
The Metabase project was started in late 1998. It was meant to address the needs of development of a Web based ERP (Enterprise Resource Planning) platform. That platform had as one of its goals to not be tied to any particular database. This way it could interest a user base with a wide range of database preferences.
The first public release of Metabase occurred in early 2000. It supported 5 database APIs: MySQL, PostgreSQL, Oracle, ODBC and mini-SQL. Later other developers started contributing with driver classes to support other database APIs.
PEAR MDB is a friendly fork of the Metabase project. It was proposed by Manuel Lemos in 2002 in order to provide database abstraction package to PEAR users. It allows the development of database independent applications that are compliant with PEAR coding rules.
PEAR MDB was started by Lukas Smith and provides a wrapper that emulates PEAR DB package in order to allow smooth migration of PEAR DB based applications to PEAR MDB.
Lukas also started a follow-up project named PEAR MDB2. It enhances PEAR MDB but it is not totally backwards compatible with it. PEAR MDB2 also provides a PEAR DB wrapper.
No. Metabase continues to be independently developed, mainly by Manuel Lemos. Its development is motivated mainly by the use in applications built with the Metastorage object-relational mapping code generation tool.
Despite Metabase is a database abstraction package that is almost feature complete, more features may be added to Metabase to address needs of the Metastorage project or to address requests of Metabase users.
Currently, Metabase does not provide a way to retrieve query result rows as associative arrays. There is no way to provide a database independent solution that works in all situations.
The problem is that some databases perform changes in the names of the columns of the queries, like truncating or changing the case of the column names.
In a future release of Metabase there will be a portable solution that lets you assign names to result set columns, making it possible to access result set rows using associative arrays in a way that works with all databases.
No. Metabase is not inherently slow. The benchmarks that conclude that Metabase is slow are usually cooked up by the authors of other database abstraction packages. So it is only natural that the benchmark tests are intentionally distorted to prove that those other abstractions are better. That is a trick used by those others in order to attract less informed users that are unable to verify the facts.
Often those benchmarks employ tricks like using the slowest methods to call Metabase API.
Also those benchmarks just perform many repeated SQL SELECT queries to the same table, which makes the database server hit the same query cache or file system caches. That makes the queries take much less time to execute than in real world situations.
In real world database applications many read and write accesses to multiple database tables happen concurrently. This puts database clients on hold most of the time, regardless how fast such database client APIs can be.
Consequently, these forges benchmark tests make minor performance differences between each database API appear to be much more noticeable then they really are for most database applications. However, those minor differences are necessary to let Metabase provide a greater level of database independence, and usually those tests are only performed with a single type of database.
This situation usually happens when you are using a multi-process Web server, like Apache 1.x, and your applications use persistent connections.
It may happen even when your site is not too busy, but the site pages use too much static content items, like images or CSS style sheets files.
What happens is that when your Web server is having a surge with many users accessing your site simultaneously, Apache needs to fork more processes to deal with the surge.
Each new process may eventually create a new database connection. Each database connection may consume a lot of memory. When the physical memory is exhausted, your server starts using virtual memory and the server becomes very slow.
You could use non-persistent database connections, but that would slow down the database access because the re-connection overhead is often very high.
Alternatively, you can use a connection pool manager tool. That is a separate middleware that sits between the Web server and the database. It can establish upto a limited number of database connections and forwards the connection requests and the responses. After a surge, it can shutdown unused connections to save memory.
There are connection pool managers that work with PHP like SQL Relay. This solution works but obviously it adds permanent overhead because all information exchanged between the database client and the server has to be forwarded by the connection pool manager middleware, making it all a bit slower.
Since the excessive number of Web processes is often caused by the high number of static content (images and CSS) files that is served by Apache, a more efficient solution consists in serving static content by a separate Web server.
Static content can be served more efficiently by a multi-threaded Web server. Multi-threaded Web servers take much less memory to run and can serve many concurrent accesses. Examples of Web servers that can run in multi-threaded mode are Apache 2 and thttpd.
The separate static content server could run on the same machine as the main Web server, as long as it runs on a different TCP port or IP address. Serving content on other port than the HTTP default port (80) is not the most recommended solution because some institutions set their firewalls to block any traffic to non-standard ports. Therefore it is always recommended to use a spare IP address, if possible.
Since when you move the static content to a separate Web server, your main Apache 1.x will only be serving pages generated by your PHP database driven application scripts. It will not start more persistent database connections than the limit of processes allowed by Apache.
To make sure your physical memory is not exhausted with excessive database connections, you should adjust Apache configuration to a reasonable limit number of processes. That is done setting the MaxClients Apache configuration option. You should also disable the option KeepAlive.
To make sure that the number of running processes returns to a reasonable number a while after a surge of accesses, you should also adjust the values of the options MaxSpareServers, StartServers and MinSpareServers. Here is an example of Apache configuration option values:
Maxclients 80 StartServers 20 MinSpareServers 20 MaxSpareServers 40 KeepAlive off
If you are having performance problems with your database application, you need to focus on "using more the brain and less the muscles". Smarter is better than stronger.
This means in practice that it will not work just by switching to a more expensive database server, or to database abstraction that is claimed to be faster.
You need to analyze the roots of your problems before you make any changes to your applications. Do not start making changes before you are sure of the benefits.
In general, the improvement solutions that are more effective are those provide greater performance at the expense of a greater usage of a certain resources, like more memory or disk space. You need to be wise and balance the pros and the cons of each move.
For instance, using indexes on table columns involved in expensive SELECT queries, is a common performance improvement advice. It works by trading SELECT query performance improvement by more disk space and eventual INSERT or UPDATE query performance degradation.
Often, the greatest bottle-neck of database applications is due to operations that are lengthy or repeated too many times. Therefore, the most effective way to improve the performance of database applications is to avoid accessing the database as much as you can.
Most database applications have two kinds of database performance problems, depending on whether they are mostly read-only or mostly transactional (i.e. information in the database is changed very often).
Here follows two generic solutions that focus on avoiding to access the database as much as you can, depending on whether your database applications are mostly read-only or mostly transactional.
Most database driven sites are mostly read-only. That is the case of those sites that publish content stored in a database, but each content article does not change frequently.
In that case, you can cache that content in intermediate storage. That helps avoiding to access the database more than once to retrieve the same information repeatedly.
One common solution for single server sites is to cache the content in local disk files or shared memory. Shared memory is faster but its availability may be too limited. Disk files based cache are often a good compromise between performance and access speed. Disk files are also cached by the local file system, so they often provide access performance near to what you achieve when you use shared memory.
If you have a site that is served by a cluster of servers, there are more adequate solutions for using distributed caches, like using memcached servers.
There is a class named File cache that provides safe means to store and retrieve arbitrary content from cache files. It employs safe locking to prevent cache file corruption in case multiple concurrent accesses attempting to update and retrieve the same cache file.
You can use this class or other cache manager software to store whole result sets stored in array variables, using the PHP serialize or pack functions or other serialization function that turns the result set values into a string.
However, if you are using result set data from multiple database queries to generate a single HTML page, it is often more efficient to actually cache only the data of each HTML page, than caching just the result sets. This way not only you avoid accessing the database, but you also avoid the overhead of building whole pages.
Some database applications need to perform complex operations that take a lot of time to execute. Often such operations need to update one or more large tables with one or more complex indexes.
If the applications do not need to present the results of such operations right after they are ordered to be executed, it is usually better to defer the execution of the operations to a background process.
Such process takes requests to process jobs, usually on a first-come first-served basis. Therefore, there must be a queueing system that stores the details of the requested jobs by the order of arrival of the requests.
There is a kind of software dedicated to solve job queue processing that deals with the complexity of message passing processing, distribution and response delivery.
If your database application needs to queue just a few kinds of jobs, you can solve this problem just by creating a simple table with just a few fields that are enough to pass the details of the requested jobs.
Then you can start a background process, that may run once in a while, started periodically for instance by tools like cron, and processes all queued jobs at once. This way, the database server will only be loaded during that period while the background process is executing the queued jobs.
Job queuing also avoids the problem of having too many concurrent processes delaying each other while trying to update the same large or complex tables.