Data Migration (also known as ETL pipeline) is to transfer data from System A to System B. A lot of companies have this need when they acquire a company, provide a strategic tool for people to migrate to their platform.
This could be a blogging provide tools for people to move over from another platform. A e-commerce website provide a migration tool in order to make you switch to their platform. Or if you are a consultancy that helps with a client moving away legacy system. Helping a friend to recover their website that they accidentally deleted on VPS.
Data Migration can be decomposed into few steps:
- Get all source data
- Migrate the data to your system
- Post-migrate
Below I shared experiences and tips from completed three real migrations.
Let’s say the source data side is called Provider. The provider could provide you with various forms. Provider may give you the data dump of all the databases (yes, a single website could have few MySQL databases and a Postgres database).
Provider could build API for you to migrate data over time. Once you make sure everything works. They can shut down the old site. It could also be your provider has a the data on EC2 for you to scp hundred gigabytes of data. 0 Or a provider knows nothing about technology that they deliver their hard disks to you by DHL.
The worst case I heard would be the provider completely lost the data, and you have to write a crawler that crawls all the data from the provider or web archive. Once you got the data. You can start to model the data so you can use it.
Once you figured out the data, get it on your local machine. It could be a giant SQL file (GB - TBs) that you probably want to add some delays when you import to your production database, because you don’t want to take down your production. A simple way and worked for me was to add sleep
statement after INSERT INTO
statements. You could edit a 50G file with editor like Hex Fiend. 1 Or the following ruby program may help you:
delay = 0.5
File.open(result_file, "w") do |output_file|
File.open(input_file) do |file|
file.lazy.each_slice(2000) do |lines|
output_file.write(
lines.gsub(
");\nINSERT INTO",
");\nDO SLEEP(#{delay});\nINSERT INTO",
)
)
end
end
end
Then you can model the data. I did it with Rails, so you will have another ApplicationRecord
object that connects to another database by using establish_connection:
module DataMigration
DATABASE_URL = ENV.fetch("MIGRATION_DATABASE_URL")
class ApplicationRecord
establish_connection(DATABASE_URL)
end
end
MIGRATION_DATABASE_URL
may look like mysql2://root@localhost/data_migration
.
The provider’s user table could be totally different from Rails convention. Table name is user
with primary key called user_id
and you can model it like this:
module DataMigration
class User < ApplicationRecord
self.primary_key = "user_id"
self.table_name = "user"
end
end
and user table may has many recipes from user_recipe_id
:
module DataMigration
class User < ApplicationRecord
self.primary_key = "user_id"
self.table_name = "user"
has_many :recipes, foreign_key: "user_recipe_id"
end
end
You can use Active Record Associations to model any kind of database relations.
Sometimes the provider data may not be a table. For example ingredients could be HTML in recipes
table’s recipe_ingredeints
field. Then you write a parser and parse them into ActiveRecord objects to work with.
It’s highly suggested you also write tests for these migration code you wrote, because you could understand the data relationship wrong, and these tests can help you understand what you have now. Also writing tests is a great way to get things done.
For example, the provider database could contain user accounts with duplicate / invalid emails. Ask them to do some data cleanup is better than you write code to handle it.
Have a contingency plan when things go so wrong.
I conclude some guidelines on writing migration code:
Wrap data creation in transaction. Because if unfortunately things went wrong, it will rollback. You could also implement mechanisms that enables retry failures.
The migration code should be idempotent operation 2 That is, an operation run twice that produce the same results. Migrating a lot of data should encounter a lot of edge failure cases. So you should design your program that could be run multiple times. When insert/update data from provider to your platform, you are looking for upsert
.
You can find some outstanding users from the provider. Check if your migration code works manually. Or much better by writing test cases for all their data to make sure everything works after the migration.
For example, if the provider users table’s password is not using a secure algorithm to encrypt. First figure out what algorithm (let’s say this is legacy auth) was using. When a migrated user signed in, use legacy auth to verify the password. If password match, store the password with your safe and secure encryption algorithm. This could be illustrates by this example:
class LegacyAuthentication
def self.run(password)
if password_matched?(password)
user.update(password: password) # <-- BCrypt happened here
else
false
end
end
private
# Insecure md5 authenticator
def legacy_authenticatior
MD5::Authenticator
end
def password_matched?(password)
legacy_authenticatior.authenticate(password)
end
end
Use all methods that raise exceptions. Because we want to find all the possible errors before the real migration happens. Fail fast. Fix every error if possible before the real migration.
Catch Expected Errors with Retry mechanism in place. For example, when you are doing lots of write operations into the database. Likely you will encounter deadlocks or constrain locks from a table with foreign key present. You should catch these exceptions and retry them in a given time interval.
Log Every Unexpected Errors. To do some data verifications post migration. To evaluate if the migration works.
Run your migration code in background jobs in a priority-based designated queue. For example, I ran the migration jobs in a queue called data_migration
with low priority.
You should have a dashboard to see how many jobs failed and what kind of exceptions got raised. The dashboard should also shows you the status of current migration. The CPU usage (keep all CPU core at 75%) of your servers and database. Because you do not want to drain your servers and database to affect experiences of your users.
Ask if the provider if they could fix the data. Otherwise you gotta do the hard work.
Run against all data to be migrated before the real migration.
Of course we want the migration to finish as fast as possible, as early as possible. But make sure you already 100% sure the migration code is correct and working, before you optimize the performance of migration code. Things you could do is standard things you would do like:
- Optmize SQL Queries selecting data from provider database
- Preload associations
- Minimize the scope of your database transaction 3
- Skip unnecessary callbacks to be run
- Turn off the
touch
(no_touching) - Process 500, 1000, 2000 records in one background job
- Cache data in memory, Redis
- Do important work first
For example. Run the migration code for users who signed in recently. - The migration is an IO bound for your database.
You could also scale up your database to have more writes per second if possible. A lot of workers writing to your database maybe worse than few workers if you can only have that much write operations. - Bulk insert and upsert (activerecord-import)
- You could also transform the source data into CSV then use incredibly fast LOAD DATA, COPY commands
99% of the existing assets will not change from the 10 days before the migration. You can write scripts to upload all the photos / images / assets to your CDN with a way to identify (hash of a photo URL from the provider) them during the migration. So when you upload a user from the provider to your system. Your migration code can check do we already have this image on our CDN? If yes, use it. If not, upload.
And remember, every change to your migration code. You have to run again to make sure it works. So aforementioned tests are also important to verify the correctness of your migration code.
Another aspect to optimize the performance is to only do the work when needed. For example, a user may have 500 bookmarks and 200 recipes. The recipe and user are important to your website. You may have a migration that runs for total few billions of bookmarks. But You could run the 500 bookmarks migration when a user actually signs in with job in high priority queue.
You need an entry point to kick off the migration. A rake task. A web interface, or a place for your user to upload data dump. Kicks off the migration. Look at your metrics dashbaord. Waiting and
After migrated data to your system. You will need to handle traffics from the old site. Generate all sorts of URLs and set up redirect rules to redirect traffics from the old site to your platform.
Rails provides sharp tools that helped me completed 3 migrations. Use small objects to make code easy to understand and maintainable. On Schedule is much better than fastest code. Data Integrity is even more important than schedule. Keep it simple!