Databases are full of tables. Sprawling tables. Rectangular, with columns and rows. Flatter than Iowa. There are no files or folders in a database; no junk drawers, no hidden pockets where you can put stuff.
The information you want to store can be intertwined, and users' needs can seem to overlap. Laying out a database can feel like flattening an orange peel.
Memes can help us understand what databases do, and how to use them. First I'll introduce databases with a spreadsheet analogy. Then I'll describe some type theory, and show how it helps with database design.
A database is like a spreadsheet with a rigid structure. Databases have tables, like spreadsheets have sheets. Each table has a defined set of columns; each row has exactly those fields.
The set of tables with their columns is called the database's "schema".
Performance can be similar between databases and spreadsheets.
Say the rows of your sheet are ordered by some column. If you want to find a row using that column, you can grab the scrollbar and very quickly get there.
On the other hand, if you want to find a row by any other column, the table isn't ordered for you. For those searches you can't use your scroll bar; you have to use your down arrow key.
Like spreadsheets, database tables can be ordered by columns of theirs. Such a column is called a "key" of the table. A table can have multiple keys. If it does, the database will save duplicate copies of the table, ordered different ways. The database does this under the hood - as a programmer you just declare a table's keys and then use them.
All tables have at least one "primary key". No two rows of a table may have the same primary key value - you can tell any two rows of a table apart by their primary keys.
A type is a kind of thing. If you can say "here's an X", then X is a type.
Month is a type. You can tell it is, because you can say "here's a month". And here's a month: October. October is called an "instance", or a "proof" of month. Integer is a type, and 3 is an integer.
You can say that text is a type, and that "may the force be with you" is (a) text. Mass nouns can be types too.
Not all types have instances. Find me a 9-foot-tall person, if you can. And some types can never be instanced: you can't make a square circle.
Types are requirements we can communicate to each other. Once a type has been communicated, when someone says "I have an instance of that type we talked about" we all know what that means. If you somehow did construct a square circle, you could show me and I'd be very impressed.
Integer is a type, but number is not. If you gave me an irrational number and I was ancient Greek, I would exclaim "That's not a number!" Something can only be a type to the extent that everybody agrees what the instances are.
A table is a type. The rows of the table are the type's instances. That is, a table is a kind of thing and its rows are the ones of them that exist.
Let's say you have a set of products that you sell. Here's a picture of a Product table that your database might have.
Column names are along the top. Keys are in bold.
Each row of this table contains everything you need to form an "instance" or a "proof" of Product: a title, a cost, and an image to show the user.
One Table To Rule Them All?
Let's say we're building a networking site. At first, we'll have users who log in and edit their profiles. If they subscribe, we will also show them how many times other users have viewed their profile.
For each user we have to store:
- their login info (login name, and hashed password)
- their profile info (first name, last name, favorite letter)
- whether they have subscribed
- number of profile views they have enjoyed from other users
It seems simple enough to put all of this stuff into one table. That could look something like this:
|Login Name||Password Hash||First Name||Last Name||Favorite Letter||Subscriber||Profile View Count|
And sure, the job could be done with this table. But it has some pitfalls.
Database queries (reads and writes) access whole rows at a time. Whenever you view someone's profile, the application had better not show you their password hash. And whenever you update your own profile, it had better not let you make yourself a subscriber, or raise your profile view count.
When there are too many columns in one table, overly-powerful queries are available to the application. To avoid passing this power on to its users, special code has to be written to blank out or exclude certain fields at certain points.
Whitelisting or blacklisting fields like this, you can end up with field list after field list, for different users in different situations. You risk becoming Santa Claus, checking and re-checking them all. Typically, a user should be allowed to read from (or write to) whole rows of a table at a time. Permissions should be decided table-by-table and row-by-row, but not field-by-field.
Another shortcoming of that table, is that Profile View Count is just a single number. There is no "audit trail" so to speak. We have to trust that it has been incremented correctly, and could have a hard time correcting things if it hasn't been.
One Type per Table
We can find a better schema for our social networking site, just by making tables out of types.
A user is someone who logs in. A profile is personal information about a user. User and Profile are both types, so they should each have their own tables.
Here's a better way to start to organize our database (for now leaving out subscriptions and profile views):
|Login Name||Password Hash|
|User||First Name||Last Name||Favorite Letter|
Now, the User table is used for nothing but logging people in. The Profile table is what holds their personal info. The User column in the Profile table is a "foreign key". Given a User, you can quickly find all (one) of their Profiles.
Permissions are now at the row level.
If you want to view someone's profile, you hand the application their login name and the application hands you their Profile row.
Say you want to update a profile. You hand the application an entire Profile row. If the profile is yours (if your login name matches the User field of the update) then the application performs the update on your behalf - otherwise it won't.
Propositions are Types
A proposition is a statement that can hold. Here are some:
- Brunhilda's profile has been viewed twice
- Hildebrandt Keaton is a subscriber
Propositions are not "true or false". Either they are known to hold (we have a proof), or they are not known to hold (we don't have a proof). Proving a proposition false is not a thing. The closest you can come, would be proving that a proposition leads to a contradiction. That's like proving it "had better not be true".
For example, 2+2=5 cannot be proven false. Seriously. Not even by a mathematician. All we know is that it would be a disaster if 2+2=5 weren't false. Check out Godel's Incompleteness Theorems to see this in more detail.
Earlier I said if you can have "an X" then X is a type.
Propositions are also types. An "instance" of a proposition is a proof that it holds true.
This means 2+2=4 and 2+2=5 are really both types. The instances of 2+2=4 are convincing proofs that it's true. And 2+2=5 hopefully has no instances.
All types can be seen as propositions this way. Check out the similarity between these sentences:
- 2+2=4! Don't believe me? (2+2) = (3+1) = (4+0) = 4.
- The user keaton_h has a profile! Don't believe me? *hands over table row*.
- Month! Don't believe me? October.
The proposition "There is a user" is proven by each row of the User table. Each row of the Profile table is a proof of the proposition "A user has a profile".
These three types:
- keaton_h has subscribed
- brunhilda has subscribed
- bugby123 has subscribed
differ only by the user they refer to. Given any user U, "U has subscribed" is a type. Earlier I said "a table is a type". So, to support Subscriptions do we really have to add a whole table to our database for every user?
No, we do not. A table is not a type. A table is really a "type family". Each row is an instance of a type that may incorporate its keys.
Here's our subscriber table:
This table can't be considered a type - it's a type family. It's keyed by a User column, in fact its only column. Given that user U, each row has type "U has subscribed".
In the same way, our Profile table is keyed by a User column; we can think of it as the type family "Given a user U, U has a profile". If Hildebrandt Keaton links me to his profile page, then I give the application "keaton_h". It looks in the Profile table for proof of the type "keaton_h has a profile", and gives me the one it finds.
And our User table corresponds to the type family "Given a login name L, there is a user called L". When someone logs in, they give the application a login name and a password. The application finds the evidence that there is a user by that name. If it does find this evidence, it'll have a hashed password to use to check the password that they entered.
Each table represents a kind of fact. The columns in a table's schema are the pieces of evidence that you require, as proof of that kind of fact.
Suppose we get a confirmation number when we process payments for users that subscribe. We could add a column to our Subscriber table:
|User||Payment Confirmation Number|
The application often looks for instances of "U is a subscriber" for various users U. Only rarely, if there's a dispute or something, will it need to look up the full proof including the confirmation number.
But the confirmation number is there; it's what we require as proof that a user is a subscriber. No number, no row.
Profile views, we can think of as a type family too. Each profile has received a number of profile views. That is: "Given a profile P, P has received a number of views."
This type family depends on a profile, so we need a table that's keyed by Profile. The type family also includes "a number" - that must be a column as well.
But permissions seem impossible with this table. We can't just give every user write permissions to everybody else's ProfileViewsReceived row. We need to come up with a table that users can write to for themselves. Something like this would work:
Each user can add rows to this table as long as the User field is themselves, and the Profile field is someone else. The database can provide values for the ID column.
Users can write to UserViewedProfile but we want to read from ProfileViewsReceived. How do we get the data there?
Spreadsheets give you formula cells. Each time you change the value of a cell, the formula cell is adjusted automatically by the spreadsheet program.
Databases sometimes let you define "Views". More than formula cells, these are whole formula tables. If you know how you want to transform and combine your data, you can usually declare Views, which are kept up to date as your application runs.
With a view in place, users can insert profile views as well as query for totals.
We've finished our database schema!
An application can run any code as a user inserts a row. Whenever a user inserts a hypothetical EmailSent row, the application could send an email, for example.
Inserts can trigger all kinds of side effects like this - sending payments or emails, posting to facebook, releasing the kraken, etc. The rows act as "proof" that the side effects occurred.
Views can be emulated this way too. If our database didn't support views, we could have our application update our ProfileViewsReceived table whenever a user inserts a UserViewedProfile row.
Some would say that the only things you should let users do are insert and read rows - no updating or deleting. Rather than updating their Profile, they insert a ProfileUpdate. You now must code Profile as a view that aggregates together all of the ProfileUpdates.
Doing this you end up with a perfect audit trail; you have saved all user actions. This lets you recover from a number of mistakes you can make as a programmer.
Not all applications need a perfect audit trail, of course. It's up to you to draw the line.
A database is a place to write stuff down. You give it whatever schema you want, and then plug in whatever data you want. Then you can read your data back out.
Type theory is full of ideas and techniques that help us make the most out of databases. I've presented it here informally, but used some of the jargon, and tried to describe things from the ground up, in a constructive
Now, go apply some type theory!