A farily common scenario: You maintain a table of data, we will say pricing data for a hosted online shopping cart. You offer the ability for sellers to upload pricing information (a CSV/XML file) on a daily basis. You want to allow the seller to 1) Remove prices for items not present in the file 2) Update prices for items already in the data table 3) Add new records that are not already in the pricing table.
Pretty straight forward but I have seen some poor solutions like:
- Reading each line of the file using the programming language of choice and issuing 2 SQL commands per line (exists check + delete/update/insert) (very very bad)
- Loading the entire file into a temp table and using a cursor to issue 2 SQL commands per line (very bad)
I think the problem stems from the fact that most developers don't know you can issue DELETE, INSERT, and UPDATE statements that include joins and exists checks.
For starters, I like to have a staging table with no constraints and usually all nvarchar columns (even for numeric/int columns). This allows me to pound the data file into the staging table using the BULK INSERT capability of SQL Server. There's no faster way to move a lot of data into the db.
Next you can run a series of 'validation' queries to ensure that required fields are present and to validate data types. I will usually copy invalid rows to an 'error' table, delete them from the staging table, and then pull them from the error table for a logging purposes.
Use the scripts below to setup our little demonstration on your SQL Server:
SET NOCOUNT ON
GO
CREATE TABLE Item
(
Id BIGINT IDENTITY PRIMARY KEY,
Number NVARCHAR(256) NOT NULL,
Price NUMERIC(9,2) NOT NULL,
Description NVARCHAR(256) NOT NULL
)
GO
CREATE TABLE Item_Staging
(
Number NVARCHAR(256),
Price NVARCHAR(256),
Description NVARCHAR(256)
)
GO
SET IDENTITY_INSERT Item ON
GO
INSERT INTO Item(Id, Number, Price, Description)
VALUES( 1, '00001', 1.00, 'Part One' )
INSERT INTO Item(Id, Number, Price, Description)
VALUES( 2, '00002', 2.00, 'Part Two' )
INSERT INTO Item(Id, Number, Price, Description)
VALUES( 3, '00003', 3.00, 'Part Three' )
GO
SET IDENTITY_INSERT Item OFF
GO
INSERT INTO Item_Staging(Number, Price, Description)
VALUES( '00001', 1.10, 'Part One' )
INSERT INTO Item_Staging(Number, Price, Description)
VALUES( '00002', 2.20, 'Part Two' )
INSERT INTO Item_Staging(Number, Price, Description)
VALUES( '00004', 4.40, 'Part Four' )
GO
SET NOCOUNT OFF
GO
This will create two tables, Item (our production table) and Item_Staging (...you get the point). It will also add some test data. For our purposes, assume we have already ran our validation routines on Item_Staging to remove any bad data.
Now we are ready to delete records from the Item table that are not present in the Item_Staging table:
DELETE Item
FROM Item LEFT JOIN Item_Staging ON Item.Number = Item_Staging.Number
WHERE Item_Staging.Number IS NULL
Pretty simple? We use the Number column to join the two tables and delete records based on the abscense of a Number in Item_Staging.
Next, let's update prices for all products in the Item table that also appear in Item_Staging (you could update any/all fields, we are just updating price for simplicity here)
UPDATE Item SET
Price=Item_Staging.Price
FROM Item INNER JOIN Item_Staging ON Item.Number = Item_Staging.Number
This time we can use an INNER JOIN to connect all matching records.
Finally, we add any records in Item_Staging that are not in Item:
INSERT INTO Item(Number, Price, Description)
SELECT S.Number, S.Price, S.Description
FROM Item_Staging S LEFT JOIN Item I ON S.Number = I.Number
WHERE I.Number IS NULL
Similar to the DELETE, we use a LEFT JOIN this time looking for recrods not present in the Item table.
Some additional pointers for you:
- I recommend dropping and re-creating any indexes on the production table prior to running the import. You should get a performance boost since the DBMS will only have to update indexes once after the processes have completed
- Wrap the process in a transaction (always recommended when performing more than one operation at a time)
The full script below should:
- DELETE Item 00003
- UPDATE prices for Items 00001 and 00002 to 1.1 and 2.2 respectively
- INSERT Item 00004
- DELETE all data from Item_Staging
SET NOCOUNT ON
GO
BEGIN TRAN
DELETE Item
FROM Item LEFT JOIN Item_Staging ON Item.Number = Item_Staging.Number
WHERE Item_Staging.Number IS NULL
UPDATE Item SET
Price=Item_Staging.Price
FROM Item INNER JOIN Item_Staging ON Item.Number = Item_Staging.Number
INSERT INTO Item(Number, Price, Description)
SELECT S.Number, S.Price, S.Description
FROM Item_Staging S LEFT JOIN Item I ON S.Number = I.Number
WHERE I.Number IS NULL
TRUNCATE TABLE Item_Staging
COMMIT TRAN
GO
SET NOCOUNT OFF
GO
Please feel free to share your bulk import and processing stories via the feedback section below.
posted @ Friday, December 22, 2006 7:11 PM