How to Merge Two User Database Tables

You are probably reading this because you realise that maintaining two user database tables when they have a common purpose is quite restricting. It really is; your code becomes more complex, your users don’t understand why they have two accounts and so on.

You might also be thinking merging two user database tables is pretty trivial; well here is a short summary of what could go wrong:

  • A bug or logical oversight in your merge strategy means you mix up unique identifiers for users, meaning you have just irreversibly broken your data relationships (you did have backups, right?)
  • When merging the same user, you override their current settings for their old settings. This could be something like a password or even worse email, making password recovery impossible if the email is no longer functional.
  • You don’t handle the post-migration gracefully and your users give up and go elsewhere
  • Your migration is working perfectly, but a freak outage stops your migration script in its tracks! Can you pick up where you left off?

Now, you might be thinking that merging two user database tables together into a single user table can be the mother of all technical problems and you would be correct!

How did you find yourself in such a tough situation? Well it could happen for a number of reasons, and you’ll have to decide for yourself whether this guide is going to help you in your situation.

This guide is based on PHP and MySQL but the overarching concepts should be similar enough with other RDBMS and programming languages.

Here are some examples of what you might be facing:

Diagram 1: Two users tables connected to a single application
Diagram 2: Users A connected to environments A and B, Users B connected to environment C
Diagram 3: The holy grail – unified user table

In my case, the situation was near enough to Diagram 2. We have a total of 6 environments, 5 of which are connected to Users A and a one lone wolf environment connected to Users B. Each environment has a database with an identical schema which houses environment specific data and sits above the lower level user database.

To make things more interesting, the Users B table was established from a copy of the Users A table; and then in a moment best described as a mental fart, (in my defence it was 2007, a time when I barely knew what I was doing) I allowed the two user tables to grow independently.

Because each environment runs the same application and is related in purpose, we cross-promote. Most users registered in both user tables. Some users registered in either Users A or Users B but not both. Approximately 45,000 of 70,000 user accounts are actually the same people. What a mess!

This guide makes the following assumptions, your mileage may vary:

  1. Users in both tables have a unique, auto-incrementing integer as a primary key
  2. Users have a secondary unique identifier which in this case is a username column
  3. The username must be unique in the final user table
  4. Some user accounts are the same account holder with the same ID
  5. Some user accounts are the same account holder with a different ID
  6. Some user accounts are different account holders but their ID or username clash
  7. Some user accounts are unique to Users B
  8. Users A and Users B are roughly equivalent in the number of rows

So how do we deal with this mess? Well i’ve thought about this on and off for a really long time and I finally bit the bullet and made this work with a strategy I believe works and causes the minimum disruption possible.

Step 1 – Idiot Checks

This is a very big undertaking and things will go wrong (trust me, I know, I first tried this 5 years ago and botched it completely).

Make sure you are in a strong position to rollback any changes made to your application (ie. version control), and that you can restore a DB backup in the event of a crisis.

Don’t allow yourself to be rushed on this task. Sit on it for a while if you have to and ponder what could go wrong.

Backup your data. Did I say that yet?

Step 2 – Test Environment

You should already have a fully functional development environment for your application, but depending on how accurately this reflects your production environment, you may want to take things a step further.

It is certainly not a silly idea to get a copy of your production database(s) which are probably much larger than your dev database, and do a dry run update (or two dozen) in your development environment.

Step 3 – Alter User Tables

For the remainder of this guide, i’m going to refer to the user tables as legacy.users and master.users. As you might infer, they belong to different schemas; legacy and master.

legacy.users is the table that you want to deprecate. You will be extracting users from this table and appending them into master.users one way or another.

master.users is the table you want to keep. It is the table that your application will use going forward, exclusively.

Altering legacy.users will allow you to track the migrated status of a user on the legacy table (in case you break the migration into several units of work) and track the new ID they were assigned.

On the master.users table you will be tracking any clashing IDs for a user (more on this later), and the migration status for the user.

Step 4 – Put Every User In a Category

We are going to group each user into 1 of 4 categories because each category requires different logic.

Category 1 (SAME) – Same User, Same ID, Same Username
For every user in this category, you owe them a beer! Same user, same ID means that the scary part, updating all instances of a the user ID in all your tables is redundant.

Considerations: Which accounts are they using most actively? Perhaps it is legacy.users meaning that master.users may contain some out of date settings. Proceed with caution.

Category 2 (SOFT CLASH) – Same User, Clashing ID, Same Username
This user must really love your different environments, because they signed up for both of them independently! Oh… that means in all likelihood they got a different ID (otherwise they would be in Category 1).

Considerations: How can you match users to figure out if they are the same person? Is the matching criteria foolproof? Users tend to supply slightly differentiating data even when entering the same thing, like a postal address. In my case I used username and email OR username and first name + surname combined.

Category 3 (HARD CLASH) – Different User, Clashing ID, Same Username
What are the chances of two independent users choosing the exact same username? As you might have concluded when you try to register your favourite username on a popular service like Reddit, pretty good actually!

Considerations: Is this really a different user or did they just do something trivial and fail your soft clash test? You’ll probably find that in many cases, they do actually look like the same person and you might want to loosen up your soft clash criteria – but be warned – allowing a user to effectively hijack another users account is as terrible as it sounds.

Category 4 (UNIQUE) – Unique User, Clashing ID
You want to buy this user a beer almost as much as you want to buy a Category 1 user a beer, because unique usernames are on the trivial end of the scale when it comes to merging.

Considerations: Does this user already have a master.users account and do they want to merge the two accounts now that you have unified the (user land) universe? If you don’t have one already, you should consider a permanent utility that can merge two user accounts in your application – provided of course that the user can authenticate as both users or the merge is managed by someone with sufficient administrative privileges.

Step 5 – Plan of Attack

Now that we have identified all our user groups, it’s time to man (or woman) up and start manipulating our data.

As I previously alluded, the Category 1 users are your best friends. You might have dozens of tables with foreign keys referencing the user table, and you don’t have to change a single one for Category 1 users! All you need to consider is the potential for some columns to have different data, particularly if one account has been dormant.

For every other category, there is potential for a number of very serious conflicts.

  • The master.users account matching a soft clash user has an ID that is already in use in our legacy environment, by someone else! We can’t borrow the ID from master.users without reassigning all the other users data to the clashing user! The inverse is also true; a legacy.users ID is already spoken for in master.users so it can’t be repurposed.
  • Given the conflict above ruling out using the ID from the master.users table, we need a new user row. But the username clashes as well, and as we previously stated, the username must be unique. This is particularly messy for our hard clash users, but we are going to append a number to the end of their username.

The key takeaway is that you are going to be adding new user rows – some act as placeholders, others are designed to be permanent. 

Task 1: Migrate all legacy users to master.users with an ID well above the current maximum ID.
For example, I used 100,000 when the previous highest ID was around 50,000. Theoretically you only need to increment from the maximum user ID on legacy.users or master.users, whichever is greater – but your brain will appreciate you if you do round up to a nice even number, especially in case something goes wrong.

Task 2: For soft clash users, Set the master.users.clash_id on both the original row and the new row.
Provide a user initiated “account migration facility”, which will allow them to unify their accounts once and for all next time they login.

Task 3: For hard clash users their username has to change and this is non-negotiable.
(Unless you allow them to initiate a merge by authenticating as the other user). They obviously need to be informed of this change.

Task 4: For all category 2-4 users, set the legacy.users.new_id so that you can keep a reference of where they ended up.
Knowing this allows you to rollback references to legacy.users.new_id to This is perhaps slightly preferable to a full database restoration, but you do need to also clean master.users of any migrated accounts if you want to start over.

The following sample code is written in PHP which should give you some ideas on how your migration script is going to look. Note that the database queries are pseudo code to ease the explanation.

You’ll want to consider things like database transactions, sanitation and escaping which are outside the scope of this guide.

* sorry, you might see this addthis social plugin towards the bottom of this code block, obviously it is not supposed to be there (its supposed to be at the bottom of the post) but I can’t seem to get rid of it!

Step 6 – Handball Back to the User

From a technical standpoint, you should have achieved what you set out to achieve; all users are now on the same table. Hooray! Open the floodgates for new user registrations!

But your users are still in trouble. They may be suffering from the following conditions:

  • You changed their password and now it is potentially different to what they expected
  • Their account data is divided across two accounts (well thats not really new but you know what I mean)
  • Their username is different to what it was before

Obviously you need to put a lot of time and thought into making this as seamless as possible for users.

This is where master.users.migrated comes into play. On login attempts you can look for migrated = 0 and offer a migration to that user – migrated = -1 is reserved for users who reject the migration offer, and migrated = 1 is for users who have run the migration already.

This flow chart might help with the visualisation.

Diagram 4: User merge flowchart on login


Wrapping Up

I hope this guide goes a long way to helping you merge your user tables together; I know that personally I thought about how to address this without making a meal of it for several years, on and off.

Please share your strategies and challenges in the comments section and good luck!

PayPal IPN, CodeIgniter and Encoding – a recipe for disaster!


Having experienced a couple of banging-my-head-on-the-desk months trying to integrate PayPal IPN and CodeIgniter so that it doesn’t do stupid things, I can share with you a few pitfalls that us mere mortals might skim over in the awful world of PayPal documentation. (Although thumbs up to PayPal technical support, they have put me on the right track more than once!).

The source of my problems are international users and the funny non-english characters they selfishly include in their names and addresses, or assume that having two lines in their address is ok – well i’m here to tell you that it is not ok because dammit, we are trying to write programs here and we are lazy!

When it comes to PayPal IPN:

In other words if you put one little character out of place in your IPN response, PayPal will take its ball and go home

When a New Line is not a New Line

CodeIgniter is going to take any posted new lines (like the second line in a postal address) and standardise them, and there is a few unappealing hacks nothing you can do about it.

// it's my way or the highway!
if (strpos($str, "\r") !== FALSE)
    $str = str_replace(array("\r\n", "\r", "\r\n\n"), PHP_EOL, $str);

See that? In system/core/Input.php of CodeIgniter, any new lines in your POST, GET or COOKIE get standardised to your PHP installation, which more than likely is going to be “\n”. Without going into too much detail (because I barely know what i’m talking about), different OS’s have historically had different ways to express new lines and carriage returns, and you need to send back to PayPal EXACTLY what they sent you.

While this may serve some purpose to CodeIgniter internals, it’s a massive problem for PayPal,  which sends a “\r\n” but gets “\n” back. PayPal has a simple, brutal response to describe this heresy – INVALID.

You have a couple of options:

a) Instead of using $_POST or $this->input->post(), use:

 $data = file_get_contents('php://input');

This gets the raw post before CodeIgniter or any other part of your application can mess with it.

b) Reverse what CodeIgniter did – slightly hackish, but if you know PayPal sent you “\r\n” then make sure it gets it back

str_replace("\n", "\r\n", $v);

Do you speak my l@nguage©?

More specifically, do you use UTF-8 character encoding EVERYWHERE?

See, a bunch of people have names and addresses that contain things like accented characters. One option might be to tell them to type in an incorrect variation of their address or name, replacing the weird characters with something that is on my standard US keyboard. Another option is to store it correctly, serve it correctly and handle it correctly.

This comes in a few parts:

1. Tell PayPal about UTF-8
Put this in your PayPal web payment standard form – this apparently takes higher precedence than the obscure “PayPal button language encoding” option on the PayPal website:

<input type="hidden" name="charset" value="utf-8">

2. Tell PHP about UTF-8
PHP has an option in php.ini called default_charset, but tragically it’s default value was not UTF-8 until PHP 5.6. You can set this value with:

ini_set('default_charset', 'UTF-8');

Or update php.ini default_charset value with “UTF-8″.

The best way to test this is something like:

$str = 'Pauzé';

echo $str;

If it doesn’t output the same, you are probably using a non UTF-8 charset in use.

3. Tell your database about UTF-8
Although not strictly related to IPN, there is little point accepting UTF-8 characters in PHP if you do not also store these characters correctly in the database. Unfortunately MySQL has a stupid weird implementation of UTF-8 that is not corrected until v5.5.

As to whether the omitted characters are likely to exist in names and addresses, time will tell (i’m guessing not), however to be on the safe side you should run MySQL 5.5+ where possible.