Delete Absent Rows/Update Existing Rows/Insert New Rows

A farily common scenario:  You maintain a table of data, we will say pricing data for a hosted online shopping cart.  You offer the ability for sellers to upload pricing information (a CSV/XML file) on a daily basis.  You want to allow the seller to 1) Remove prices for items not present in the file 2) Update prices for items already in the data table 3) Add new records that are not already in the pricing table.

Pretty straight forward but I have seen some poor solutions like:

  • Reading each line of the file using the programming language of choice and issuing 2 SQL commands per line (exists check + delete/update/insert) (very very bad)
  • Loading the entire file into a temp table and using a cursor to issue 2 SQL commands per line (very bad)

I think the problem stems from the fact that most developers don't know you can issue DELETE, INSERT, and UPDATE statements that include joins and exists checks.

For starters, I like to have a staging table with no constraints and usually all nvarchar columns (even for numeric/int columns).  This allows me to pound the data file into the staging table using the BULK INSERT capability of SQL Server.  There's no faster way to move a lot of data into the db.

Next you can run a series of 'validation' queries to ensure that required fields are present and to validate data types.  I will usually copy invalid rows to an 'error' table, delete them from the staging table, and then pull them from the error table for a logging purposes.

Use the scripts below to setup our little demonstration on your SQL Server:

SET NOCOUNT ON
GO
CREATE TABLE Item
(
  Id BIGINT IDENTITY PRIMARY KEY,
  Number NVARCHAR(256) NOT NULL,
  Price NUMERIC(9,2) NOT NULL,
  Description NVARCHAR(256) NOT NULL
)
GO
 
CREATE TABLE Item_Staging
(
  Number NVARCHAR(256),
  Price NVARCHAR(256),
  Description NVARCHAR(256)
)
GO
 
SET IDENTITY_INSERT Item ON
GO
INSERT INTO Item(Id, Number, Price, Description)
VALUES( 1, '00001', 1.00, 'Part One' )
INSERT INTO Item(Id, Number, Price, Description)
VALUES( 2, '00002', 2.00, 'Part Two' )
INSERT INTO Item(Id, Number, Price, Description)
VALUES( 3, '00003', 3.00, 'Part Three' )
GO
SET IDENTITY_INSERT Item OFF
GO
 
INSERT INTO Item_Staging(Number, Price, Description)
VALUES( '00001', 1.10, 'Part One' )
INSERT INTO Item_Staging(Number, Price, Description)
VALUES( '00002', 2.20, 'Part Two' )
INSERT INTO Item_Staging(Number, Price, Description)
VALUES( '00004', 4.40, 'Part Four' )
GO
SET NOCOUNT OFF
GO

This will create two tables, Item (our production table) and Item_Staging (...you get the point).  It will also add some test data.  For our purposes, assume we have already ran our validation routines on Item_Staging to remove any bad data. 

Now we are ready to delete records from the Item table that are not present in the Item_Staging table:

DELETE Item
FROM Item LEFT JOIN Item_Staging ON Item.Number = Item_Staging.Number
WHERE Item_Staging.Number IS NULL

Pretty simple?  We use the Number column to join the two tables and delete records based on the abscense of a Number in Item_Staging.

Next, let's update prices for all products in the Item table that also appear in Item_Staging (you could update any/all fields, we are just updating price for simplicity here)

UPDATE Item SET
  Price=Item_Staging.Price
FROM Item INNER JOIN Item_Staging ON Item.Number = Item_Staging.Number

 This time we can use an INNER JOIN to connect all matching records.

Finally, we add any records in Item_Staging that are not in Item:

INSERT INTO Item(Number, Price, Description)
SELECT S.Number, S.Price, S.Description
FROM Item_Staging S LEFT JOIN Item I ON S.Number = I.Number
WHERE I.Number IS NULL

 Similar to the DELETE, we use a LEFT JOIN this time looking for recrods not present in the Item table.

Some additional pointers for you:

  • I recommend dropping and re-creating any indexes on the production table prior to running the import.  You should get a performance boost since the DBMS will only have to update indexes once after the processes have completed
  • Wrap the process in a transaction (always recommended when performing more than one operation at a time)

The full script below should:

  • DELETE Item 00003
  • UPDATE prices for Items 00001 and 00002 to 1.1 and 2.2 respectively
  • INSERT Item 00004
  • DELETE all data from Item_Staging
SET NOCOUNT ON
GO
BEGIN TRAN
 
DELETE Item
FROM Item LEFT JOIN Item_Staging ON Item.Number = Item_Staging.Number
WHERE Item_Staging.Number IS NULL
 
UPDATE Item SET
  Price=Item_Staging.Price
FROM Item INNER JOIN Item_Staging ON Item.Number = Item_Staging.Number
 
INSERT INTO Item(Number, Price, Description)
SELECT S.Number, S.Price, S.Description
FROM Item_Staging S LEFT JOIN Item I ON S.Number = I.Number
WHERE I.Number IS NULL
 
TRUNCATE TABLE Item_Staging
 
COMMIT TRAN
GO
SET NOCOUNT OFF
GO

Please feel free to share your bulk import and processing stories via the feedback section below.

posted @ Friday, December 22, 2006 7:11 PM


Print

Comments on this entry:

# re: Delete Absent Rows/Update Existing Rows/Insert New Rows

Left by Willie Tilton at 6/20/2007 10:23 PM
Gravatar

My implementation:<br /><br />/*<br />Source Table &lt;maps to&gt; Destination Table<br />TtblStgFocusEnrollment | tblPatient<br />IPMS Generated | PatientID<br />IVRSiteNumber | SiteID<br />IVRPatientNumber | PatientCode<br />IVRSubjectInitials | MI<br />IVRSCreenNumber | PatExternalID<br />EmbedRandNo | RandNumber<br />IVRSubjectDOB | DOB<br />*/<br />USE CORE<br /><br />DECLARE @StudyID INT<br />SET @StudyID = 11<br /><br />SELECT COUNT(*) CountBefore FROM tblPatient<br />WHERE SiteID IN (SELECT SiteID FROM tblSite WHERE StudyID = @StudyID)<br /><br />SET NOCOUNT ON<br />GO<br /><br />BEGIN TRAN<br /><br />/* Patients that were imported that did not exist in IPMS (New Records) */<br />SELECT <br /> (SELECT SiteID FROM tblSite WHERE s.IVRSiteNumber = tblSite.SiteNo AND tblSite.StudyID = @StudyID) AS SiteID, <br /> s.IVRSiteNumber AS SiteNo,<br /> IVRPatientNumber AS PatientNo,<br /> IVRSubjectInitials AS Initials,<br /> IVRScreenNumber AS PatientExtID,<br /> IVRSubjectDOB AS DOB,<br /> EmbedRandNo AS RandNo<br />FROM TtblStgFocusEnrollment s <br /> LEFT JOIN tblPatient p <br /> ON s.EmbedRandNo = p.RandNumber<br />WHERE p.RandNumber IS NULL<br /><br />/* Records that will be updated */<br />SELECT <br /> (SELECT SiteID FROM tblSite WHERE s.IVRSiteNumber = tblSite.SiteNo AND tblSite.StudyID = @StudyID) AS SiteID, <br /> s.IVRSiteNumber AS SiteNo,<br /> IVRPatientNumber AS PatientNo,<br /> IVRSubjectInitials AS Initials,<br /> IVRScreenNumber AS PatientExtID,<br /> IVRSubjectDOB AS DOB,<br /> EmbedRandNo AS RandNo<br />FROM tblPatient p<br /> INNER JOIN TtblStgFocusEnrollment s<br /> ON p.RandNumber = s.EmbedRandNo<br /><br />/*<br />IVRSiteNumber | SiteID<br />IVRPatientNumber | PatientCode<br />IVRSubjectInitials | MI<br />IVRScreenNumber | PatExternalID<br />EmbedRandNo | RandNumber<br />IVRSubjectDOB | DOB<br />*/<br /><br />-- Update our existing records<br />UPDATE tblPatient SET<br /> SiteID = (SELECT SiteID FROM tblSite WHERE s.IVRSiteNumber = tblSite.SiteNo AND tblSite.StudyID = @StudyID),<br /> PatientCode = IVRPatientNumber,<br /> MI = IVRSubjectInitials,<br /> PatExternalID = IVRScreenNumber,<br /> DOB = IVRSubjectDOB<br />FROM tblPatient p<br /> INNER JOIN TtblStgFocusEnrollment s<br /> ON p.RandNumber = s.EmbedRandNo<br /><br />-- Insert new records<br />INSERT INTO tblPatient(SiteID, PatientCode, MI, PatExternalID, DOB, RandNumber)<br />SELECT <br /> SiteID = (SELECT SiteID FROM tblSite WHERE s.IVRSiteNumber = tblSite.SiteNo AND tblSite.StudyID = @StudyID), <br /> IVRPatientNumber,<br /> IVRSubjectInitials,<br /> IVRScreenNumber,<br /> IVRSubjectDOB,<br /> EmbedRandNo<br />FROM TtblStgFocusEnrollment s <br /> LEFT JOIN tblPatient p <br /> ON p.RandNumber = s.EmbedRandNo<br />WHERE p.RandNumber IS NULL<br /> <br />TRUNCATE TABLE TtblStgFocusEnrollment<br /><br />-- Add all PatientVisit record placeholders to newly added patients<br /><br />CREATE TABLE #Visits<br />(<br /> VisitId INT,<br /> VisitOrder INT<br />)<br /><br />INSERT INTO #Visits (VisitID, VisitOrder)<br />SELECT VisitID, VisitOrder FROM tblVisit<br />WHERE StudyID = @StudyID<br /><br />INSERT INTO tblPatientVisit (PatientID, VisitID, VisitFailed, PatientVisitType, VisitOrder)<br />SELECT p.PatientID, v.VisitID, 0 AS VisitFailed, 'Standard' AS PatientVisitType, v.VisitOrder FROM tblPatient p <br /> LEFT JOIN tblPatientVisit pv<br /> ON p.PatientID = pv.PatientID<br /> INNER JOIN tblSite s<br /> ON p.SiteID = s.SiteID<br /> CROSS JOIN #Visits v<br />WHERE StudyID = @StudyID<br />AND pv.PatientID IS NULL<br /><br />DROP TABLE #Visits<br /><br />-- Update existing PatientVisits with new patient information<br /><br />-- TODO...<br /><br />IF @@error &lt;&gt; 0<br />ROLLBACK TRAN<br /> <br />COMMIT TRAN<br /><br />GO<br />SET NOCOUNT OFF<br />GO<br /><br />SELECT COUNT(*) CountAfter FROM tblPatient<br />WHERE SiteID IN (SELECT SiteID FROM tblSite WHERE StudyID = @StudyID)<br />

Your comment:



 (will not be displayed)


 
 
 
Please add 3 and 6 and type the answer here:
 

Live Comment Preview:

 
«August»
SunMonTueWedThuFriSat
272829303112
3456789
10111213141516
17181920212223
24252627282930
31123456