Dynamic Web Content: What you can do in Postgresql that you cannot do in Oracle.
May 27th, 2007 byWith the advent of dynamic online content comes the need to create, destroy and manage this content in more automated and efficient ways. I have discovered some very nifty ways to do this in Postgresql which cannot be done as well, as easily, or even at all in Oracle. As it turns out, Postgresql is very well suited for dynamic database generation, for any size data set.
Let’s say you have a web site where users create dynamic content. You have these basic choices when determining how to store your content:
1. Place everything in a monolithic database (one large database, many indexed fields), sorting by key/indexed fields. User content can be distinguished by a unique user id, or maybe there is no need to distinguish user content at all. It depends on the data being stored, and the applied use of that data.
2. Dynamically create a database from a fixed (or even from a dynamic) database schema. Store all user data in a separate database, and metadata in a common (master) database.
Option 1 is the most common. Everyone doing web content seems to want to default to this model, maybe because it is the easiest to roll out quickly, maybe because they think they have no other choice.
In some cases, it’s an appropriate model based on the data set. An example that comes to mind which makes this model appropriate would be weather reporting. In a weather database, it probably doesn’t matter who recorded last night’s rainfall around the world, as long as the person had legitimate access, and recorded the data correctly. It also probably makes no sense to break the data out into different databases by zip code, region, or state. The purpose of collecting weather data is not only to report local weather conditions, but to also measure and compare the data statewide, nationwide, or worldwide. This is a contrived but decent example of a data set best being served by a monolithic database.
An example of data which can/should be broken out into separate databases is an inventory application. Let’s say you provide online inventory information for many companies, where each company is unrelated to another. Companies may even have security concerns, or may not want other companies to have access to their hardware inventory data, for example, because it may reveal trade secrets regarding what proprietary hardware designs they have devised for their business needs. Companies may also want to separate and secure their inventory data because competitors can derive sales information form this data.
A monolithic database could be used in this model, but how do you guarantee one client that their data is secure, and will not be exposed to another client, even by accident? Complex selective encryption methods could be used on the monolithic model, but why bother? Breaking the data out into separate client databases is ideal for this data model for many for these reasons:
1. Security: Data is inherently separate. Separate databases can easily be encrypted if necessary. Logins/passwords to each database can be isolated to the individual database.
2. Data Management: Individual client data can easily be archived, exported or purged to meet clients needs.
3. Local Database Support: The model inherently supports database servers owned and run by the client. If the client chooses to switch to a model where they run and maintain your software behind their own firewall, the client’s database can easily be exported, encrypted, sent to the client, decrypted, and imported on their server.
Of course, these issues can be dealt with in the monolithic model, with much wrestling and convoluted scripting/management, but again, why make the job of the developers and database admin more difficult?
Using the above example, let’s say we have an inventory database schema for Postgresql which looks something like this, in a file called client_db_schema (Highlight the text if you have trouble seeing it all):
CREATE SEQUENCE simple_inventory_id_seq # The sequence number is for DBAPI 2.0 compliance.
INCREMENT BY 1
MINVALUE 1
NO MAXVALUE
CACHE 1;
CREATE TABLE simple_inventory (
simple_inventory_id integer unique DEFAULT nextval(('simple_inventory_id_seq'::text)::regclass) NOT NULL PRIMARY KEY,
item_number integer unique NOT NULL,
item_description character varying NOT NULL,
(...any other fields you can imagine)
);
(more client-specific tables go here)
After installing Postgresql, run the createuser command:
createuser -s queen_bean
It will prompt you for a password.
The “queen_bean” super user can create or destroy databases, create other users, etc. She really is the queen.
To create a new inventory database from Python and psycopg, do this:
import os
import psycopg
super_db = psycopg.connect('dbname=template1' + ' user='queen_bean' + ' password=whatever the password is' + ' port=' + str(5432) + ' host=localhost')
cursor = super_db.cursor()
# Notice that 'template1' is the actual name of the Postgresql internal master database.
# Connect to this to do database create/delete operations, but don't delete the template1 database itself.
super_db.conn.set_isolation_level(0) # This tells Postgresql that we're about to do high level, 'narrow' admin operations.
for i in range(1,6):
client_string = "client%d" % i
#cursor.execute("drop database %s_inventory;" % client_string) #optional
cursor.execute("create database %s_inventory with encoding = 'UTF-8' ;" % client_string)
super_db.close()
To do this on the command line instead of in Python, you’d type this:
psql template1 queen_bean psql> create database client1_inventory; psql> create database client2_inventory; etc... pqsl> \\q
This loop created five different, empty client databases. Now we need to add the tables from the client database schema to each database:
# This is the schema file we created above. It's in the current directory, but could be anywhere (change the ./ component).
new_tables = os.path.abspath(os.path.join(os.path.dirname(__file__),"./client_db_schema"))
# Read this file into memory:
fo = open(new_tables)
commands = ' '.join(fo.readlines())
fo.close()
for i in range(1,6):
client_string = "client%d" % i
client_db = psycopg.connect('dbname=' client_string + '_inventory' + ' user='queen_bean' + ' password=whatever the password is' + ' port=' + str(5432) + ' host=localhost')
cursor = client_db.cursor()
cursor.execute(commands)
client_db.close()
The command line equivalent would be:
psql client1_inventory queen_bean < ./client_db_schema psql client2_inventory queen_bean < ./client_db_schema etc... \\q
The above code is an oversimplification of some db concepts. But it works, and it demonstrates how simple it is to dynamically generate Postgresql databases for client data management.
In Postgresql, all databases are managed under the same data area on the disk. They are all accessible through the same database instance, or daemon. This is why we only connected once to the template1 master database to create the client databases.
In Oracle, the terminology is slightly different. A ‘database’ is understood to be the running instance, or daemon. The ‘instance’ is an individual database as Postgresql would know it. To create new Oracle ‘instances’, or databases requires shutting the entire Oracle daemon down! This means that all users accessing all oracle instances at that moment lose connectivity, because a new database is being dynamically generated.
In any database tool, a shutdown is not trivial. Connection cleanup, index cleanup, memory flush to disk, and garbage collection, amongst other things, happen during shutdown. This means that shutdown could take minutes to perform, making this model impossible to use under Oracle. Being experienced in both Oracle and Postgresql, I could list many other reasons why you would not want to use Oracle over Postgresql on any given day for any given project, but that is material for yet another article.
The above code implies some other neat features as well. Since the database schema comes from a file, that file could be generated in real time, from FORM data. So not only is the database dynamically created, but the schema can change in real time as well.
Beware of bogus arguments against this model:
Gripe: “But a separate connection is required for each user, for each database. This means thousands of connections, which could cause problems”
Answer: ZZZTTTT! Wrong. Connections which are open but inactive take up very few, if any, database resources. Plus, in a dynamic web design, the connection is being opened and closed immediately after use. So connections never linger, because of the threaded model of a web server. Plus, pg_pool does a good job at managing connections, if for some reason they are left open (usually programming bugs and crashes cause this). Unlike Oracle, a process’ database connections are all closed when the process closes. So bouncing an HTTP server will forcably cause db connection cleanup in Postgresql. In Oracle, connections are stored in the IPC table (Inter process communication table, a system resource), and they must be manually closed/cleared.
Gripe: “The multiple database model is inefficient and will negatively affect performance.”
Answer: BBBBBLLLAAP! Wrong. Testing has revealed that it is no more or less efficient than the monolithic database. This model is used very effectively in production web sites.
I hope this serves as a handy primer, and encourages you to play. Feel free to post questions.
~G~



May 27th, 2007 at 11:35 pm
I was going to post about your MySQL woes, but I see you’re also using PostgreSQL, more power to you! :-)
Well, I know I’m preachin’ to the converted now, but I’ll let you know about the comparison I just published, more ammo for the right choice. ;-)
MySQL-PostgreSQL comparison
http://www.teknico.net/devel/myvspg/index.en.html
May 28th, 2007 at 6:36 am
That was a great article, thank you. I don’t use MySQL, but I was looking for concrete reasons why I shouldn’t use it. Many people here seem like MySQL Fans, so I didn’t want to offend. I was going to just overstep the subject :)
Thanks again.
May 29th, 2007 at 5:04 am
Have you never heard of Oracle Label Security ?
May 29th, 2007 at 5:37 am
No, probably because it’s a recent $20,000 add-on. Label based security seems to be row locking. How does row locking even come close to the great choices in solutions that Postgresql provides? It doesn’t. But it seems to be the Oracle admin’s solution to everything. Postgresql is leaps and bounds ahead, in design and flexibility.
Thanks for trying.
~G~
June 18th, 2007 at 10:42 pm
I tend toward the monolithic database design for a couple of different reasons.
If I end up needing (or even just wanting) data from across the whole data set later, having everything in one is usually easier to query, and nearly always faster.
A lot of times I end up writing my own authentication code anyhow, so splitting up the data into different databases does not really help all that much. I’ll even do crazy stuff like make my own subclass of DBI (yeah, I still use PERL) that interfaces with my authentication mechanism and automatically restricts the results to the appropriate subset.
Often I am doing work for customers who are using web hosting ($5.99 per month for 2000 gigabytes of transfer!) and on hosted servers, there is generally a cap of 5 or 10 mysql databases. If I were to dynamically create databases per customer or something, it would not be too long before the hosting plan ran out!
June 19th, 2007 at 7:27 am
Andrew, good points. It is true that if your data set requires a lot of reconciliation between client data sets, the monolithic model may work best for you. I personally have used both models, depending on the requirements of the project and the data set.
Re: the hosted server cap, I wonder why they would want to find a balance between the number of instances and the size of each instance. It would be great if the limits could be treated like a disk quota, filling it either through max number of files or max bytes, with reasonable caps on each.
I do my own hosting, so I have this flexibility. Thank you for the info.
September 14th, 2007 at 5:16 pm
I’ve actually experienced having to transition from the multiple database model (used for pretty much the reasons you listed) to a monolithic model in a production environment using postgresql.
I was actually pondering the merits of such a change when I found this blog.
Anyways, the reason that we had to move away from a multiple database model is that connection pooling in a single process became a problem if it had to access multiple databases. I was never able to figure out how to get a connection pool to switch between databases and therefore we either ended up with too many connection (wasting connection overhead) or not enough in our pools.
Secondary was having to update each database individually as schema’s changed, or running queries which required data from multiple databases (or all).
Performance was actually worse (due to large tables), or no better, it was the other two issues which forced us back to a monolithic design. Perhaps a better pooling mechanism could solve the first issue (any suggestions?), but the second problem seems here to stay.
September 15th, 2007 at 10:16 am
Hi clineb,
I have found that certain models work best with the one-db-per-client/data-set and certain models do not.
The first question I asked myself is how much data, if any, would be shared between clients. If lateral searching across client data will be done frequently, the monolithic model would serve you better. If client data is infrequently searched laterally, the individual dbs can be quite handy. If a monolithic db is too big, multiple dbs combined with a process which collects and constructs metadata tables may be a good solution for you.
Connection pooling is definitely an issue when many dbs are involved. It requires grouping of pools, which is more painful to manage than one flat pool. But if your app is such that only a few connections are open at any given time per client per database, this is easier to automate, and becomes manageable. If your app is such that so many clients have so many connections per database, it sounds like you need a bigger, distributed hardware model, and a distributed db model to match. If it’s distributed, you won’t run out of file descriptors, and you will at some level end up with different databases (which are reconciled/converted to some form of metadata by a separate daemon).
Schema changes across many dbs don’t bother me too much for this reason. As a rule, I script schema changes anyway, and do a practice run on the test databases. Then the schema change is scheduled and run in production in a scripted, organized, reproducible fashion. if I run that script on one or one thousand databases, I am only concerned with how long it takes to finish, and whether or not the db has to be offline to complete the schema change. No other aspect of it bothers me, since I can automate the entire process and kick it off.
There is no silver bullet configuration which works best for all apps and data sets. If data sets are too large for one table, it may have to be split up into different tables or other sets of metadata. For each split either to a separate table or database, there is a price to pay in the form of an additional join or an additional file descriptor. Reindexing and vacuuming needs to be performed to match the frequency of inserts/deletes. The frequency of inserts vs. updates affects the type of indexing to choose.
The list of things to consider goes on and on. This is why I do not mind evolving databases to match the new demands of the data sets. I think it is impossible to predict all possible uses of your data, and it’s part of the evolution of any database model to rework these issues over time.
That being said, I wish you luck 8) If you want to have an in-depth connection pooling discussion using pgpool or the SqlAlchemy pooling options, post your email address here and we can continue this discussion.
Gloria
January 1st, 2008 at 9:10 pm
This is a useful article, and is broadly correct about its tradeoffs, except I think you may have the wrong idea about Oracle in some ways.
Firstly, on terminology: a “database” in Oracle terminology is the set of data files, control files, and configuration. An instance is the set of running processes that mount the database and share memory, known as the “SGA”, or system global area. A database may typically only have multiple instances if it is running in a Real Application Cluster. These are the definitions used in the Oracle Concepts manual which is freely available on the Web.
To fulfill your option #2 in Oracle, one creates different user accounts with their own schemas. These are pretty much equivalent to multiple “databases” in PostgreSQL. There are many benefits to running multiple applications on a single Oracle instance, particularly in administrative overhead: oracle has a lot of features for tuning the resource consumption of every user, so one can stop “runaway queries” due to bugs in the web software etc. Of course, one can do this with other databases using UNIX tools, one could argue the amount of knobs available in Oracle makes for more precision tuning (which has its own tradeoffs, of course).
As for connections, I haven’t really seen problems about connections lingering in the IPC table, though my experience is on the Java application server side. Usually either we use the web framework’s connection pool or Oracle’s shared server feature…