Eventual consistency for SQL: Normalization

October 14, 2015

Using a log-structured schema, we can merge SQL databases to achieve eventual consistency.

Previously, I introduced eventual consistency for SQL. This post illustrates how to normalize an eventually consistent SQL database.

To demonstrate how to normalize for eventual consistency, let's design a database for a Twitter clone, consisting of users and statuses. A traditional schema for a Twitter looks like:

CREATE TABLE Users (
  username VARCHAR(255) PRIMARY KEY,
  email VARCHAR(255),
  phone VARCHAR(255),
  location VARCHAR(255),
  confirmed BOOLEAN NOT NULL,
  salt VARCHAR(255) NOT NULL,
  hashed_password VARCHAR(255) NOT NULL
);

CREATE TABLE Statuses (
  content VARCHAR(140) NOT NULL,
  created_at DATE DEFAULT ( SYSDATE ) NOT NULL,
  username VARCHAR(255) REFERENCES Users (username) NOT NULL,
) PRIMARY KEY (content, created_at, user_id);

CREATE TABLE Follows (
  follower_username VARCHAR(255) REFERENCES Users (username) NOT NULL,
  followee_username VARCHAR(255) REFERENCES Users (username) NOT NULL
) PRIMARY KEY (follower_username, followee_username);

The first step in normalizing for eventual consistency is to identify state changes in your data. For example, a user becomes confirmed after clicking a link in an email or text message. Under the schema above, the following UPDATE statement would get executed:

UPDATE Users
SET confirmed = true
WHERE username = 'kmarekspartz';

However, since we're avoiding UPDATE, this will not work. Instead, let's normalize this mutation out of our database.[^1]

[^1]: I'm going to assume offline migrations for simplicity, but these migrations can be achieved in a zero-downtime environment, too. You would create both places for the data reside, deploy a version of the application to read from both, deploy a version of the application to write to both, run a backfill migration (like in the example), then deploy a version which only reads and writes the new place, then drop the old place. Fun!

CREATE TABLE Confirmations (
  username VARCHAR(255) REFERENCES Users (username) NOT NULL
);

INSERT INTO Confirmations
SELECT username
FROM Users
WHERE confirmed = true;

ALTER TABLE Users
DROP COLUMN confirmed;

INSERT INTO Confirmations VALUES ('kmarekspartz');

SELECT Users.*, IS_NULL(Confirmations.username) AS confirmed
FROM Users
LEFT OUTER JOIN Confirmations
ON Users.username = Confirmations.username;

Applying this normalization to the rest of the schema would lead to a new schema:

CREATE TABLE Users (
  username VARCHAR(255) PRIMARY KEY,
  salt VARCHAR(255) NOT NULL,
  hashed_password VARCHAR(255) NOT NULL
);

CREATE TABLE Confirmations (
  username VARCHAR(255) REFERENCES Users (username) NOT NULL
);

CREATE TABLE Emails (
  email VARCHAR(255) PRIMARY KEY
);

CREATE TABLE UserEmails (
  username VARCHAR(255) REFERENCES Users (username) NOT NULL,
  email VARCHAR(255) REFERENCES Emails (email) NOT NULL
) PRIMARY KEY (username, email);

CREATE TABLE Phones (
  phone VARCHAR(255) PRIMARY KEY
);

CREATE TABLE UserPhones (
  username VARCHAR(255) REFERENCES Users (username) NOT NULL,
  phone VARCHAR(255) REFERENCES Phones (phone) NOT NULL
) PRIMARY KEY (username, phone);

CREATE TABLE Locations (
  location VARCHAR(255) PRIMARY KEY
);

CREATE TABLE UserLocations (
  user_location_id PRIMARY KEY AUTOINCREMENT
  username VARCHAR(255) REFERENCES Users (username) NOT NULL,
  location VARCHAR(255) REFERENCES Locations (location) NOT NULL
);

CREATE TABLE UserLocationDeletions (
  user_location_id REFERENCES UserLocations (user_location_id) NOT NULL
);

CREATE TABLE Statuses (
  content VARCHAR(140) NOT NULL,
  created_at DATE DEFAULT ( SYSDATE ) NOT NULL,
  username VARCHAR(255) REFERENCES Users (username) NOT NULL,
) PRIMARY KEY (content, created_at, user_id);

CREATE TABLE Follows (
  follow_id PRIMARY KEY AUTOINCREMENT,
  follower_username VARCHAR(255) REFERENCES Users (username) NOT NULL,
  followee_username VARCHAR(255) REFERENCES Users (username) NOT NULL
);

CREATE TABLE FollowDeletions (
  follow_id REFERENCES Follows (follow_id) NOT NULL
);

I've turned most properties into many-to-many relationships, and added deletion tables for the join tables. This is because many-to-many relationships are easier to merge than one-to-one. With a many-to-many, you can use UNION as your merge strategy. With one-to-one, there's not a deterministic way to choose a winner. This will result in temporary inconsistencies, but if you have each user interact with a particular host or shard, you can minimize those inconsistencies. In a mobile environment, the user is interacting with their phone, and we can guarantee that their phone is in a consistent state at any given time. In a server environment, we can route that user's interactions to a particular host, failing over to an in sync replica if that host is down.

One thing that doesn't work well with a many-to-many relationship is passwords. I left it as one-to-one, but passwords aren't needed anymore, particularly when there is an email address or a phone number available. Instead of asking a user for a password, we can send them a link with a token to sign in. This is one-factor authentication, but uses what is commonly a second factor in two-factor authentication. Removing passwords from this schema would make eventual consistency possible.

I like to call this method of normalization 'log normalization' but that gets confusing. It is too bad we have unique constraints on technical terms. If there's a better name for this, let me know!