I have been working recently with a client who is implementing a new ERP system based on SQL 2008, and converting data from a bunch of old Access, SQL 2000, and SQL 2005 databases.
Part of the conversion effort involves validation processes, which I've implemented as stored procedures in a SQL 2008 database. The idea of the validation is to compare the data in the old systems with the converted data in the new ERP system and flag any differences. The tables I created contain the old and new keys, plus the old and new attributes we want to compare. The stored procedures all perform the following general logic:
- retrieve converted data from the ERP system in SQL 2008 and insert it into a table specific to that kind of data (customer, invoice, inventory, etc.)
- update each record with the corresponding attributes in the old system where the keys "match" (or are comparable)
- insert any source records which didn't get converted for some reason
A report then displays all records where there is either:
- a source record with no converted record or
- a converted record with no source record or
- unequal attributes.
The business users (and auditors) then use these reports to validate the conversion effort.
The specific comparison tables and the stored procedures are in a dedicated SQL 2008 database on "server A" (for the sake of argument). As you can imagine, I need to pull data in from the source database(s) and the new ERP system database, which reside on different servers. In particular, the first update query above retrieves data from the destination database and inserts it into the local validation database. The second and third queries join the table in the local validation database to the original source database.
Initially, I wrote the stored procs to take advantage of SQL Server's "linked server" feature. Using a 4-part name ([server].[database].[owner].[name]) seemed attractive at first.
However, we quickly ran into 2 problems:
- The servers containing the source databases were located in a different data centre, and the network bandwidth between them was absolutely minuscule
- The way linked servers work is to issue an OpenRowset() command, which basically brings back the entire table (all rows and columns) to the local server and then processing it locally
Both problems were showstoppers. You could write SQL statements which joined tables in one database to tables in a database on the linked server, but they didn't perform.
We then thought "let's copy the tables locally". So we wrote very simple "INSERT INTO x SELECT * FROM y" SQL statements. These took almost as long to run.
We eventually solved the problem using SQL Server Integration Services (SSIS) packages. Actually, we used the SQL Server "Import / Export Data" feature, which basically builds an SSIS package and executes it on the fly. Doing that allows you to use the native drivers and connection providers for each database system, pulls the data out efficiently into memory, and then uses the destination driver and connection provider to load the data quickly into the destination. Using this approach, we reduced the data copy time from 15-20 minutes per table to about 30 seconds.
Interestingly, I got involved with another client who was using linked servers, but between SQL Server 2000, SQL Server 2005, and two flavours of Oracle and trying to do the same thing. Joins between the tables had been optimized in the past, and worked reasonably well, but a small driver update suddenly killed the performance in production, and they were left struggling to try to get queries to execute in a reasonable time. Unfortunately, there wasn't much we could do to save them — if you go against all best practices that both Microsoft and Oracle publish, you can't really complain when it comes back to bite you!