Detecting Duplicate Rows                                                               Administration Tips




Detecting Duplicate Rows

In order to determine which rows duplicate each other in a table, you must first be
prepared to state a definition of "duplicate". For example, the rows "1, FRED, DEPT_10"
and '2,FRED,DEPT_20' sort of duplicate each other on the NAME column -though chances
are, this is not a real duplication, but merely a case of having too many Freds working in
different departments.

But if we saw these two rows: '1,FRED, DEPT_10', and '1,FRED,DEPT_20', we begin to
wonder if there is duplication, since two different physical Freds should not end up being
allocated the same Employee Number.

So, the first thing is to determine which COLUMNS in your table define uniqueness. In my
case, it's Employee_Number, but I don't care about duplications on something like 'NAME'.

Having done that, the quickest way to find duplicates is to stick a UNIQUE constraint on
the relevant columns, and attempt to enable that constraint.

First, the bare constraint definition:

ALTER TABLE BLAH ADD             (CONSTRAINT   EMPID_UQ UNIQUE DISABLED NOVALIDATE);


That adds the constraint -but it's in its 'disabled' or "switched off" state (it has to be -there
may already be duplicate records whose presence would have prevented the constraint
being defined in the first place if it hadn't been).

Now we create a special table to house the errors that arise if we attempt to enforce the
constraint (i.e., switch it fully on). Oracle supplies a script to create that table (it's called
utlexcpt.sql, and it's found in the ORACLE_HOME/rdbms/admin directory). But the script
is easy enough to replicate yourself:

CREATE TABLE EXCEPTIONS(ROW_ID ROWID,
OWNER VARCHAR2(30),
TABLE_NAME VARCHAR2(30),
CONSTRAINT VARCHAR2(30));


You'll note that this table happens to be called "exceptions". It could be called anything
we like (such as 'unique_errs'), but "exceptions" is Oracle's choice of name, so I guess we
shall live with it.

The next thing to do is try and enable the unique constraint, but specifying that any errors
arising from doing so should be written into this new exceptions table we've just created.



Copyright © Howard Rogers 2001                       10/17/2001                                 Page 1 of 2
Detecting Duplicate Rows                                                          Administration Tips


Try this:

ALTER TABLE BLAH ENABLE CONSTRAINT EMPID_UQ EXCEPTIONS INTO EXCEPTIONS;


The enabling of the constraint failed, but the trick is that the exceptions table now
contains details of what rows exist that are duplicates, and thus prevent the constraint
being fully enabled. All we need to do now is select * from exceptions, and we shall be
able to see the offending records:

ROW_ID              OWNER                         TABLE_NAME
------------------ ------------------------------ -----------------
AAAH2NAABAAAPCMAAA SCOTT                               BLAH
AAAH2NAABAAAPCMAAB SCOTT                               BLAH

...at which point, you'll realise that unless you are fluent in Base-64, you haven't got a
hope in Hell of ever interpreting those ROWIDs into something you can actually work with!

So now comes the clever bit: we use those ROWIDs as a hook by which to extract the data
out of the main table, in a format we can recognise and work with. That involves a
relatively simple bit of sub-select'ing, like this:

SELECT    *   FROM BLAH WHERE ROWID IN   (SELECT   ROW_ID FROM EXCEPTIONS);


And now we get a report that looks like this:

    EMPNO                  NAME
----------                 ----------
       1                   FRED
       1                   FREDERIKA

...and it becomes obvious where the problem is.

There's no automatic fix here. All we've done is use the EXCEPTIONS table to fish out a
reference to the offending rows. How they stop being offenders is up to you: you need to
do manual DML to knock the rows into shape. In this case, a simple SET EMPNO=2 WHERE
NAME='FREDERIKA'; will do the trick -but in the real world, you need to work out the
correct updates to perform (and investigate why the problem ever arose in the first place).

When the issue has been resolved, you simply need to truncate the exceptions table
(unless you feel you'll never need it again, in which case feel free to drop it. It's easy
enough to re-create, after all).




Copyright © Howard Rogers 2001               10/17/2001                                    Page 2 of 2

Duplicaterows

  • 1.
    Detecting Duplicate Rows Administration Tips Detecting Duplicate Rows In order to determine which rows duplicate each other in a table, you must first be prepared to state a definition of "duplicate". For example, the rows "1, FRED, DEPT_10" and '2,FRED,DEPT_20' sort of duplicate each other on the NAME column -though chances are, this is not a real duplication, but merely a case of having too many Freds working in different departments. But if we saw these two rows: '1,FRED, DEPT_10', and '1,FRED,DEPT_20', we begin to wonder if there is duplication, since two different physical Freds should not end up being allocated the same Employee Number. So, the first thing is to determine which COLUMNS in your table define uniqueness. In my case, it's Employee_Number, but I don't care about duplications on something like 'NAME'. Having done that, the quickest way to find duplicates is to stick a UNIQUE constraint on the relevant columns, and attempt to enable that constraint. First, the bare constraint definition: ALTER TABLE BLAH ADD (CONSTRAINT EMPID_UQ UNIQUE DISABLED NOVALIDATE); That adds the constraint -but it's in its 'disabled' or "switched off" state (it has to be -there may already be duplicate records whose presence would have prevented the constraint being defined in the first place if it hadn't been). Now we create a special table to house the errors that arise if we attempt to enforce the constraint (i.e., switch it fully on). Oracle supplies a script to create that table (it's called utlexcpt.sql, and it's found in the ORACLE_HOME/rdbms/admin directory). But the script is easy enough to replicate yourself: CREATE TABLE EXCEPTIONS(ROW_ID ROWID, OWNER VARCHAR2(30), TABLE_NAME VARCHAR2(30), CONSTRAINT VARCHAR2(30)); You'll note that this table happens to be called "exceptions". It could be called anything we like (such as 'unique_errs'), but "exceptions" is Oracle's choice of name, so I guess we shall live with it. The next thing to do is try and enable the unique constraint, but specifying that any errors arising from doing so should be written into this new exceptions table we've just created. Copyright © Howard Rogers 2001 10/17/2001 Page 1 of 2
  • 2.
    Detecting Duplicate Rows Administration Tips Try this: ALTER TABLE BLAH ENABLE CONSTRAINT EMPID_UQ EXCEPTIONS INTO EXCEPTIONS; The enabling of the constraint failed, but the trick is that the exceptions table now contains details of what rows exist that are duplicates, and thus prevent the constraint being fully enabled. All we need to do now is select * from exceptions, and we shall be able to see the offending records: ROW_ID OWNER TABLE_NAME ------------------ ------------------------------ ----------------- AAAH2NAABAAAPCMAAA SCOTT BLAH AAAH2NAABAAAPCMAAB SCOTT BLAH ...at which point, you'll realise that unless you are fluent in Base-64, you haven't got a hope in Hell of ever interpreting those ROWIDs into something you can actually work with! So now comes the clever bit: we use those ROWIDs as a hook by which to extract the data out of the main table, in a format we can recognise and work with. That involves a relatively simple bit of sub-select'ing, like this: SELECT * FROM BLAH WHERE ROWID IN (SELECT ROW_ID FROM EXCEPTIONS); And now we get a report that looks like this: EMPNO NAME ---------- ---------- 1 FRED 1 FREDERIKA ...and it becomes obvious where the problem is. There's no automatic fix here. All we've done is use the EXCEPTIONS table to fish out a reference to the offending rows. How they stop being offenders is up to you: you need to do manual DML to knock the rows into shape. In this case, a simple SET EMPNO=2 WHERE NAME='FREDERIKA'; will do the trick -but in the real world, you need to work out the correct updates to perform (and investigate why the problem ever arose in the first place). When the issue has been resolved, you simply need to truncate the exceptions table (unless you feel you'll never need it again, in which case feel free to drop it. It's easy enough to re-create, after all). Copyright © Howard Rogers 2001 10/17/2001 Page 2 of 2