With the advent of dynamic online content comes the need to create, destroy and manage this content in more automated and efficient ways. I have discovered some very nifty ways to do this in Postgresql which cannot be done as well, as easily, or even at all in Oracle. As it turns out, Postgresql is very well suited for dynamic database generation, for any size data set.
Let’s say you have a web site where users create dynamic content. You have these basic choices when determining how to store your content:
1. Place everything in a monolithic database (one large database, many indexed fields), sorting by key/indexed fields. User content can be distinguished by a unique user id, or maybe there is no need to distinguish user content at all. It depends on the data being stored, and the applied use of that data.
2. Dynamically create a database from a fixed (or even from a dynamic) database schema. Store all user data in a separate database, and metadata in a common (master) database.
Option 1 is the most common. Everyone doing web content seems to want to default to this model, maybe because it is the easiest to roll out quickly, maybe because they think they have no other choice.
In some cases, it’s an appropriate model based on the data set. An example that comes to mind which makes this model appropriate would be weather reporting. In a weather database, it probably doesn’t matter who recorded last night’s rainfall around the world, as long as the person had legitimate access, and recorded the data correctly. It also probably makes no sense to break the data out into different databases by zip code, region, or state. The purpose of collecting weather data is not only to report local weather conditions, but to also measure and compare the data statewide, nationwide, or worldwide. This is a contrived but decent example of a data set best being served by a monolithic database.
An example of data which can/should be broken out into separate databases is an inventory application. Let’s say you provide online inventory information for many companies, where each company is unrelated to another. Companies may even have security concerns, or may not want other companies to have access to their hardware inventory data, for example, because it may reveal trade secrets regarding what proprietary hardware designs they have devised for their business needs. Companies may also want to separate and secure their inventory data because competitors can derive sales information form this data.
A monolithic database could be used in this model, but how do you guarantee one client that their data is secure, and will not be exposed to another client, even by accident? Complex selective encryption methods could be used on the monolithic model, but why bother? Breaking the data out into separate client databases is ideal for this data model for many for these reasons:
1. Security: Data is inherently separate. Separate databases can easily be encrypted if necessary. Logins/passwords to each database can be isolated to the individual database.
2. Data Management: Individual client data can easily be archived, exported or purged to meet clients needs.
3. Local Database Support: The model inherently supports database servers owned and run by the client. If the client chooses to switch to a model where they run and maintain your software behind their own firewall, the client’s database can easily be exported, encrypted, sent to the client, decrypted, and imported on their server.
Of course, these issues can be dealt with in the monolithic model, with much wrestling and convoluted scripting/management, but again, why make the job of the developers and database admin more difficult?
Using the above example, let’s say we have an inventory database schema for Postgresql which looks something like this, in a file called client_db_schema (Highlight the text if you have trouble seeing it all):
CREATE SEQUENCE simple_inventory_id_seq # The sequence number is for DBAPI 2.0 compliance.
INCREMENT BY 1
MINVALUE 1
NO MAXVALUE
CACHE 1;
CREATE TABLE simple_inventory (
simple_inventory_id integer unique DEFAULT nextval(('simple_inventory_id_seq'::text)::regclass) NOT NULL PRIMARY KEY,
item_number integer unique NOT NULL,
item_description character varying NOT NULL,
(...any other fields you can imagine)
);
(more client-specific tables go here)
After installing Postgresql, run the createuser command:
createuser -s queen_bean
It will prompt you for a password.
The “queen_bean” super user can create or destroy databases, create other users, etc. She really is the queen.
To create a new inventory database from Python and psycopg, do this:
import os
import psycopg
super_db = psycopg.connect('dbname=template1' + ' user='queen_bean' + ' password=whatever the password is' + ' port=' + str(5432) + ' host=localhost')
cursor = super_db.cursor()
# Notice that 'template1' is the actual name of the Postgresql internal master database.
# Connect to this to do database create/delete operations, but don't delete the template1 database itself.
super_db.conn.set_isolation_level(0) # This tells Postgresql that we're about to do high level, 'narrow' admin operations.
for i in range(1,6):
client_string = "client%d" % i
#cursor.execute("drop database %s_inventory;" % client_string) #optional
cursor.execute("create database %s_inventory with encoding = 'UTF-8' ;" % client_string)
super_db.close()
To do this on the command line instead of in Python, you’d type this:
psql template1 queen_bean
psql> create database client1_inventory;
psql> create database client2_inventory;
etc...
pqsl> \\q
This loop created five different, empty client databases. Now we need to add the tables from the client database schema to each database:
# This is the schema file we created above. It's in the current directory, but could be anywhere (change the ./ component).
new_tables = os.path.abspath(os.path.join(os.path.dirname(__file__),"./client_db_schema"))
# Read this file into memory:
fo = open(new_tables)
commands = ' '.join(fo.readlines())
fo.close()
for i in range(1,6):
client_string = "client%d" % i
client_db = psycopg.connect('dbname=' client_string + '_inventory' + ' user='queen_bean' + ' password=whatever the password is' + ' port=' + str(5432) + ' host=localhost')
cursor = client_db.cursor()
cursor.execute(commands)
client_db.close()
The command line equivalent would be:
psql client1_inventory queen_bean < ./client_db_schema
psql client2_inventory queen_bean < ./client_db_schema
etc...
\\q
The above code is an oversimplification of some db concepts. But it works, and it demonstrates how simple it is to dynamically generate Postgresql databases for client data management.
In Postgresql, all databases are managed under the same data area on the disk. They are all accessible through the same database instance, or daemon. This is why we only connected once to the template1 master database to create the client databases.
In Oracle, the terminology is slightly different. A ‘database’ is understood to be the running instance, or daemon. The ‘instance’ is an individual database as Postgresql would know it. To create new Oracle ‘instances’, or databases requires shutting the entire Oracle daemon down! This means that all users accessing all oracle instances at that moment lose connectivity, because a new database is being dynamically generated.
In any database tool, a shutdown is not trivial. Connection cleanup, index cleanup, memory flush to disk, and garbage collection, amongst other things, happen during shutdown. This means that shutdown could take minutes to perform, making this model impossible to use under Oracle. Being experienced in both Oracle and Postgresql, I could list many other reasons why you would not want to use Oracle over Postgresql on any given day for any given project, but that is material for yet another article.
The above code implies some other neat features as well. Since the database schema comes from a file, that file could be generated in real time, from FORM data. So not only is the database dynamically created, but the schema can change in real time as well.
Beware of bogus arguments against this model:
Gripe: “But a separate connection is required for each user, for each database. This means thousands of connections, which could cause problems”
Answer: ZZZTTTT! Wrong. Connections which are open but inactive take up very few, if any, database resources. Plus, in a dynamic web design, the connection is being opened and closed immediately after use. So connections never linger, because of the threaded model of a web server. Plus, pg_pool does a good job at managing connections, if for some reason they are left open (usually programming bugs and crashes cause this). Unlike Oracle, a process’ database connections are all closed when the process closes. So bouncing an HTTP server will forcably cause db connection cleanup in Postgresql. In Oracle, connections are stored in the IPC table (Inter process communication table, a system resource), and they must be manually closed/cleared.
Gripe: “The multiple database model is inefficient and will negatively affect performance.”
Answer: BBBBBLLLAAP! Wrong. Testing has revealed that it is no more or less efficient than the monolithic database. This model is used very effectively in production web sites.
I hope this serves as a handy primer, and encourages you to play. Feel free to post questions.
~G~