Semantia Home » How we Migrated a 100 Million Row Big Data Database into the Cloud

How we Migrated a 100 Million Row Big Data Database into the Cloud

Migrating big data into the Cloud

Migrating a very large database into the cloud is not as straightforward as initial documentation might lead you to believe.

Several factors need to be weighed together because multiple competing constraints will appear to be working against you every step of the way.

We were recently asked to modify a Microsoft SSIS migration package to provide support for an existing SQL database as the alternative target.

Ordinarily, this would be a straightforward process. Establish connections to the source and destination databases and set the package running.

In our particular case, there were a few nuances we had to deal with.

Firstly, we had a requirement that for each row inserted into the target database, we also needed to record the ID of the new record in the source database.

Furthermore, the source database was located on premise in Melbourne Australia, whereas the target database was located in a physically separate datacentre in Europe.

When the source and target databases are located in physically separate datacentres, completely new sets of challenges emerge.

The Challenges with Migrating a 100 Million Row Database into the Cloud

Upon first investigation, there’s an obvious solution.

Simply iterate through each of the rows of source data, insert it into the target database, get the new record’s ID and write it back to the original database.

7,000 records took more than an hour.

With more than 100 million rows of source data in one of the 200 tables, you could appreciate that row-by-row processing was not really an option.

Unfortunately this requires row-by-row processing via the “For Each” container where the remote ID is available via the SCOPE_IDENTITY() function.

At approximately, two rows per second though, the entire process would take more than 2 years.

It was too long for our needs.

When we investigated the major bottlenecks to this approach, we found that there was a huge amount of time spent opening and closing database and network connections.

So, it became logical that we needed an alternative approach that reduced the processing overheads by:

  • Minimising the constant opening/closing of database connections
  • Increasing the volume of data sent with a single connection request

Bulk Up on Database Inserts

To reduce “chattiness” during data transfer, the script was modified so that it was able to chunk the source data (using checkpoints) into manageable blocks of 1,000 rows and perform bulk inserts one block at a time.

The target table was also remotely joined to source table using the unique key(s) in the join criteria to return the remote row identities. This was done using the Linked Server with appropriate security credentials.

Reducing Chattiness during big data transfer to cloud

This approach saw a massive reduction in processing time with those same 7,000 rows completed in 60 seconds.

Not too bad. Based on the sample size of 7,000 rows, this approach would reduce years into days. Some further tweaking could improve this even further.

Job done! (Or so we thought.)

Firing of Triggers during Bulk Database Uploads

The creation of the new records in the target database should have further created corresponding records in a child table on the same target database.

This was fine while executing the record-by-record inserts however, during a bulk upload, things work a little differently it seems.

The creation of child table rows was dependent on the firing of insert triggers.

During the creation of the single records, everything worked fine however. During the bulk inserts however, the child table was always found to be empty.

After some research, we learned that the fast load option using bulk inserts do not fire triggers by default.

Database bulk inserts do not fire triggers by default

So, on the one hand, row-by-row inserts would achieve the desired result but it would take a couple of years to complete.

On the other hand, a bulk insert could completed in a little under ten days but we’d be missing a key piece of information – the child table.

Essentially, I had to re-consider the row-by-row option again as it worked end to end.

There had to be a way of of achieving the desired outcome of row-by-row processing times (with firing of insert triggers) but in a way the achieve the same levels of performance observed by bulk uploads.

Information on the topic online seems rare and gratitude goes out to Stack Overflow and ladenedge for providing a glimpse into some advanced SSIS editor features and the bulk insert option through FIRE_TRIGGERS.

Perfect!

The revised migration approach is fast, flexible and most importantly, delivers the desired result.