Users Table Mutations

Three ways of building and altering a users table

Synopsis

John Doe just started a new job as a CTO in a Uber-like startup. He will have to scale, disrupt and innovate. But for now, he needs a users table.

Day 1: Users are simple entities with ID, email, password, name and role (customer, driver or admin).
Day 2: John adds a phone number to all users.
Day 3: New use cases, business is growing. Users need country, gender and optional company.
Day 4: Let’s add a bio to driver users only.
Day 5: Some clients want to add custom properties to their customers account.

John could have met the day-to-day needs in an infinite number of ways. Let’s examine 3 scenarii amongst others (John chooses PostgreSQL).

1st Scenario — Bigmouth Buffalo’s Path

Day 1

John creates a users table with 6 columns: id (uuid, primary key), email (varchar), password (varchar), role (enum: customer, driver, admin), created_at and updated_at (timestamps).

Day 2

John alters the users table and adds a phone column (varchar, nullable). Phone is nullable: the users created on day one do not have a phone number, and John will maybe never know their phone numbers.

Day 3

John alters again the users table and adds country (varchar, nullable), gender (enum, nullable), and company (varchar, nullable).

Day 4

Biographies are only for drivers. There are at least two possibilities:

Create a nullable bio text column (NULL for non-drivers)
Because it’s only for drivers, create a drivers_biographies table with FK on users and a bio text column.

Each possibility has its pros and cons, John chooses to add a new table because he dislikes having too much NULL and he is certain he never will add biography to other account types. Maybe he is wrong.

Day 5

Clients can now ask for custom fields on their users. So it’s some kind of dynamic variable properties. John could add nullable new column on demand on its users table (client ask for users’ pets names, John adds a pet_name column). But it does not sound scalable to him, maybe he is wrong.

So, he decided to create two new tables: properties and users_properties. The properties table has 2 columns:

id (uuid)
label (varchar) Example: “Pet name”

The users_properties has 3 columns:

user_id (foreign key on _users_ table)
property_id (foreign key on _users_properties_ table) Example: the id of “Pet name”
value (uh… Something? it depends. So, Text I guess?!)

Now every time a client ask for new properties on its users, John’s application adds a new entry in properties table. Then, for each new user with custom properties, some lines are added in users_properties table for its own properties (EAV-like pattern).

Post-mortem — Mary Foobar analysis

There seems to be some problems with this approach:

company is an optional column which is empty most of time. There is only one for now, but what about day 12?
drivers_biographies is a one to one table, it’s overkill.
Mixing users’ data in two tables is a mess. Why are some fields (company, gender) in users table and some other (mobile_phone, pet_name) in users_properties table (with a different schema)? John’s database architecture could be considered inconsistent.
Querying a user with its properties would be harder than usual.
Mapping to objects (assuming John uses this kind of thing) would be complicated too.
There are 4 tables at day 5, only for describing users.
Every new column in users table has to be nullable: theses columns were added after the first insertions of users, which have no value.

Quick and dirty implementation of the Bigmouth Buffalo’s Path.

Maybe John could have built a better architecture.

2nd Scenario —Bluefish’s path

Day 1

John is foresighted. As of Buffalo Mouth’s Path, he creates a users table with id, email, password and role. But he also creates immediately a users_properties table (Entity-Attribute-Value model), similar to Wordpress approach. users_properties is a table with user_id (foreign key on users table), key (varchar, example: name, company, phone, etc.), and value (text).

Day 2, 3, 4, 5

John is relaxed, he does not have to alter its table schemas. Its program just adds new users in database with their new properties. He created an admin panel to quickly add new properties.

Post mortem — Mary Foobar Analysis

There is one benefit over the previous design. It’s simple to explain and understand, each property is located in one unique table. EAV is a well known pattern, even if John had not heard about it, he would have invented it himself. There are only two tables for describing users metadata, and there will not be more. But there are also caveats:

Simple queries become complex and unreadable versus a classic table with one column by property. SELECT name FROM users WHERE age=50 AND gender='f' becomes:

SELECTname.meta_value AS name,FROMusers_properties age,users_properties name,users_properties genderWHERE age.key = ’age’AND age.value = ’50'AND age.user_id = name.user_idAND gender.user_id = age.user_idAND gender.key = ’gender’ and gender.value = ’f’

The table is not readable with human eye. SQL is about rows and columns, the model is destroyed by EAV.

Unreadable random table found on Google Images

Duplicate data is harder to detect, no data types, no data hierarchy, no consistency, etc.

3rd Scenario — Rainbow Trout’s path

Day 1

As many else, John took an interest years ago in “NoSQL”. He tried and abandoned MongoDB to return to its first love, PostgreSQL. He remembers feeling schema-less data can have benefits in some cases, especially in variable metadata. So this time, John creates a users table with id, email, password, role, metadata (JSONB) created_at and updated_at. Its metadata column is schema-less, he could store objects like:

{"phone": "+33612345678", "company": "My company", "gender": "f"}

Day 2, 3, 4, 5

Table schema does not change.

Post mortem — Mary Foobar Analysis

There are some benefits with this design over previous paths:

Only one table for describing one entity: users.
Queries are easy to write: select * from users where metadata->'age'=50
Data is easy to read in one row per user (to be fair, the JSON part is a bit harder to read if it grows).
EAV is avoided for user defined fields.

There are some warnings too:

John should remember to add a GIN index on metadata.
John should not add relation in metadata. With a JSONB column, there is a great temptation to add everything in that column. If John adds a new entity, he has to create a new table, not add a sub-property like:

{ "bookings": [{ "id": "aaa-bb-cc", "date": "2017–01–12", "duration": "10 days"},{ "id": "xxx-yy-zz", "date": "2017–02–10", "duration": "1 day"},]}

Final Thoughts

In my tinkerer career, I created users table following these three paths (not precisely, but similar) and some other: we could think about dynamically adding column and tables, changing for another DBMS, etc. For the last few months, I’ve been following the third path. I’m sure there is no « right » way, and this path has many hidden caveats too (I should read more) but I’m OK with it for my day-to-day work. I migrated a legacy system with 60+ half-useless tables to about 10 tables by removing metadata tables. Not sure it’s better anyway, and I’m not saying less table is better, but in this specific case, I think code and database are easier to read, create, update and delete. I know I will discover a new path in a few month and be ashamed of what I have done.

The Rainbow Trout’s path is OK for users table, I don’t think it’s a good idea to add JSONB everywhere on every table every time. It may help to store metadata only. Not relations.
I did not speak about indexes, that’s not the matter, but please use index everywhere.
I only talked about PostgreSQL, maybe SQL is not the thing to build a users database.
I’m still astonish I did not find an obvious best way to build a users table after all these years. Did I miss something?

Feel free to comment with advices, feedback and criticism. I would be really happy to learn more.

Sorry for long post, here is a potato