Sharing names/address
     cleaning patterns for Patstat:
     a metadata structure proposal




By Gianluca Tarasconi
www.rawpatentdata.blogspot.com
From chaos to order...

Main milestones of clearing and standardizing patstat
   persons (inventors and applicants), starting from
   TLS206 table, can be synthesized as follows:
   RE-PARSING / RESTRUCTURING
   CLEANING
   STANDARDIZATION
   DEDUPLICATION
Due to the strict sequentiality of the process, results of last steps (address
standardization and deduplication) greatly depend from the quality of first two
steps.

                                        1
... and back to chaos...
 Different team specialize on „local’ addresses [countrywise data
  cleaning]

 Standards (i.e. sequence in toponym, street name, number) differ
  from country to country

 Enrichments / links to other data may need special data structure


Eventually data parsing and cleaning will
produce very different results among different
workteams.

                                 2
PERSON_ID 1430436: how would you clean this
(if you are not russian?)
                               ARMYANSKOE
                               SPETSIALIZIROVANNOE
   ARMYANSKOE SPETSIALIZIROVANNOE
                               PROEKTNO-
    PROEKTNO-IZYSKATELSKOE, HAÔÚHO-
                               IZYSKATELSKOE,
                               NAUCHNO-
    ôCCóE¯OBATEóÓCKOE ô KOHCTPÔKTOPCKOE I
                               ISSLEDOVATELSKOE
    OT¯EóEHôE BCECO½³HO×O      KONSTRUKTORSKOE
                               OTDELENIE
    ×OCÔ¯APCTBEHHO×O üPOEKTHO- VSESOYUZNOGO
    ô³ÕCKATEóÓCKO×O ô HAÔÚHO- GOSUDARSTVENNOGO
                               PROEKTNO-
    ôCCóE¯OBATEóÓCKO×O ôHCTôTÔTA
                               IZYSKATELSKOGO I
    ÜHEP×ETôÚECKôX CôCTEM ô ÜóEKTPôÚECKôX
                               NAUCHNO-
                               ISSLEDOVATELSKOGO
    CETEö "ÜHEP×OCETÓüPOEKT"   INSTITUTA
                               ENERGETICHESKIKH
                               SYSTEM I
                               ELEKTRICHESKIKH SETEI
                               "ENERGOSVYAZPROEKT"
                        2b
Metadata structure proposal (I)



                                                                               Dedupl.
             Re-
             parsing               Cleaning Parsed      Standard.
                                                                                         Standard
 Data                                        & clean
                        Parsed                                      Standard                 &
 origin                                       data                    data               disambig
   (206,                  data
                                                                                            data
206ascii…)             structure




 We figure that there will be a certain point in which data coming from patstat are parsed
 into an intermediate data structure where original strings should be splitted into
 several fields according to the meaning of information contained; right after cleaning
 phase will remove the noise, allowing other tools (ie google lookup) to standardize tuples.


                                              3
Metadata structure proposal (II)

   LAST_NAME               Surname / company name
   FIRST_NAME              First name (blank for companies)
   MIDDLE_NAME             Second, third, 4th names … (blank for companies)
   NAME_EXTENSION          Jr/Sr/academic title; type of business entity in companies
   ADDRESS                 Typically: toponym, name, number
   LOCALITY                City area (optional)
   ADDR_OTHER              Other specifics different than toponyms (floor, building, but
    also c/o company name) [should be data not relevant for standardization]
   CITY                    Municipality name
   COUNTY                  Administrative level above municipality
   REGION                  Administrative level above county
   STATE                   Administrative level above region for federal nations
   ZIP_CODE                Alphanumeric zip code
                                            4
Dimensions in data cleaning:
Define pre-parsing and data cleaning as:
• a projection (in the algebraic sense of the word),
  where:
• some operators transform some vectors (fields) into
  other vectors…
• … within the constraint of (endogenous) conditions
  given from the data structure.
Projections may take place if some pre-conditions
  (patterns) are satisfied.


                          5
Dimensions in data cleaning: operators (I)
We consider only operators for MAP+CORRECT, considering the other
possible operators as particular cases of this operator

    MAP+CORRECT            maps a vector in the correct domain, possibly
     transforming its elements (moves a string from one field to another,
     replacing it where correction is needed…)

For such operator we should consider two dimensions indicating where the operation
takes place:
    FIELD FROM               name of the field where operation start from
    FIELD TO                 name of the final target of operator (optional)

Also we need to list what string has to be found and what must be replaced with
    FIND                     string to be found
    REPLACE                  string replacing the string found
                                           6
Dimensions in data cleaning: operators (II)
How to emulate other strings operators with map+correct:

   MOVE (moves a string from one field to another) = M+C where REPLACE
    string = FIND string
   REPLACE (changes a string inside a field )= M+C where FIELD FROM =
    FIELD TO
   INSERT (inserts a string without removing other strings ) = M+C where
    FIELD FROM = FIELD TO and REPLACE string = FIND + insert string
   DELETE (removes a string without removing other strings)= M+C where
    FIELD FROM = FIELD TO and REPLACE string is empty

[NOTE: move a string within a field is considered only in case we need to shift it
to trailing or leading position may take several steps to accomplish;



                                        7
Usage of operators: example
Description / last_name   Field from Field to   Find       replace
TAPROGGE GESELLSCHAFT MBH
REPLACE                     FIRST_NAME FIRST_NAME GESELLSCHA GMBH
                                                FT MBH
TAPROGGE GMBH
MOVE                        FIRST_NAME NAME_EXT GMBH       GMBH
TAPROGGE
MOVE+REPLACE (same as A+B in FIRST_NAME NAME_EXT GESELLSCHAF GMBH
1 step)                                          T MBH




                                    8
Endogenous conditions

Methods used to clean addresses may differ depending from pieces of
information contained in the data themselves. Typical case are:

   APPLICATION AUTHORITY                  gives some „address filling hints’
                                           and charset

   COUNTRY CODE                           gives toponyms, administrative
                                           data etc. etc.

   YEAR FROM / TO (OPT.)                  some info may change with time
                                           (fi: change in ctry code)

   PATSTATEDICTION FROM/TO (OPT.) some info can change with
                                   changes in patstat.
                                     9
Pre-conditions: match patterns

Eventually, at string level, this is the core of our interchange format.
Our proposal is to use SQL REGEXP operator patterns as default, including the
following parameters

   LIKE                           pattern to be found (inclusion criteria)
   LIKE NOT [OPTIONAL]            pattern not to be in (exclusion criteria)
   POSITION (begin / end)         start / end position where pattern can be
   SQLSTANDARD                    gives the standard used for filling the
                                   patterns (sql „dialect‟, like vs regexp…) in
                                   order to make easier translation




                                     10
Interchange data structure proposal:
 vectors(I)
       It‟s proposed to use a field called OPERATIONKIND where we may store
        origin and destination of the move operation.
       It would be a multilayer indicator having a digit for each of the field of the
        pattern group, indicating the field to be addressed.

COUNTRY ADDRESS LOCALITY ADDR_OTHER CITY COUNTY REGION STATE ZIP NOWHERE
    A          B         C           D           E       F    G      H    I       0

LAST_NAME FIRST_NAME MIDDLE_NAME NAME_EXTENSION
        0           1            2                   3




                                            11
Interchange data structure proposal:
vectors(II)
FI: BCEF
LIKE, LIKE NOT, FIND, REPLACE = BCEF would mean if LIKE pattern is
    in address, NOT LIKE is not in locality, find FIND pattern in city and
    insert REPLACE pattern in county.

It will be added an optional last digit indicating in case of move operation
      (where 1st and 4th digit are different) containing L or T respectively
      where REPLACE pattern must be inserted leading or trailing in
      target field.

FI: BBBDT         would mean LIKE, LIKE NOT, FIND are in address, and
    replace string must be inserted at the end of addr_other field.




                                       12
One example (as before):
ID                                          1
OPERATIONKIND           0003
APPLICATION AUTHORITY   EP
                                                     Looking back at previous
COUNTRY CODE            DE                           example @ page 9 (TAPROGGE
                                                     GESELLSCHAFT MBH)
ALIKE                   GESELLSCHAFT MBH$
LIKE NOT                                             0003 means LIKE, NOT LIKE,
FIND                    GESELLSCHAFT MBH
FIND2                   GESELLSCHAFT MBH             and FIND patterns are in
REPLACE                 GMBH                         first_name, REPLACE pattern
POSFROM                                 2000
POSTO                                   2000
                                                     in NAME_EXTENSION.
SQLSTANDARD             MYSQL50
DATE FROM
DATE TO
PATSTATFROM
PATSTATTO
DESCRIPTION             moves GESELLSCHAFT
                        MBH from name to kind




                                                13
Open issues (I):
Eventually we have to consider some issues still pending

 Define a standard address
Since cleaning pattern rely on backward logic, people sharing these data should
have a common target in data standardization. It‟s propose to use local post office
standards, but such standards may be unavailable / not fitting.

 Automatic query generation
User would greatly benefit from exchanging patterns if it could be possible to
create a query generating tool that would, from pattern table, create SQL files.




                                         14
Open issues (II):

 High correlation & chronology
Quality and results of data cleaning may depend from the order steps have
been run (FI: if I do not remove PO BOXES numbers from addresses before
cleaning street numbers I may have wrong results).

Most of all some patterns must be run recursively and in some cases groups of
patterns should run recursively (fi: MOVE from address PO BOX, CITY, ZIP,
REMOVE COMMA; since I do not know the order the elements have in
ADDRESS I should run the group of queries 4 times to be sure)

A partial solution may be to add fields indicating the ID of previous query,
of following query and number of repetitions.
Remain open the issue of how do we manage group of repetitions and cleaning
patterns needing a ‘loop until no match is found.
                                     15
Open issues (III):

 Grants for collaboration to this project
A grant for collaborating to this part of ape-inv project is open for visiting @
Kites;
info on APE-INV website, section"Grants" (call: http://www.esf-ape-
inv.eu/download/Draft%20call%20for%20visits%20_ESF-APE-
INV%202012_4th_call.pdf)




                                         16
Acknowledgements

   Thanks to: Francesco Lissoni for supervision, Lorenzo Peccati For
    suggestions, Bulat Sanditov for Russian translation.




                                     17
Appendix: Interchange data structure
 proposal (I)

This is the list of the fields needed; where not indicated meaning of the field is
explained in previous slides
 APPLICATION AUTHORITY 2 char string                     % may indicate valid for
                                                          all
 COUNTRY CODE                          2 char string     % may indicate any
                                                          country
 DATE FROM                             date              [optional] empty means
                                                          no exclusion
 DATE TO                               date              [optional] empty means
                                                          no exclusion
 PATSTATFROM                           MMYYYY            [optional] empty means
                                                          no exclusion
 PATSTATTO                             MMYYYY            [optional] empty means
                                                          no exclusion
                                        A1
Interchange data structure proposal (II)

Where not indicated meaning of the field is explained in previous slides

   ALIKE                           string   (is not called LIKE cause it may
                                             cause errors in some SQL )
   LIKE NOT [OPTIONAL]             string
   FIND                            string
   FIND2                           string   when literal find do not work
                                             and we need a fix len
   REPLACE                         string
   POSFROM                         integer start point of string position
   POSTO                           integer end of position where string can be
   SQLSTANDARD                     string


                                       A1
Interchange data structure proposal (III)

Note: some combinations of POSFROM POSTO may have particular meanings
like :

(1 , 1) mean start position ; (9999; 9999) means trailing position;
(2 ; 9999) means everywhere but at beginning.

Eventually a field containing a description of the operation is needed;

   DESCRIPTION                      text




                                        A1
Appendix2: Some examples
ID                                        1                  2                      99                   100                       106
OPERATIONKIND           EEED                  EEEE               BBBB                    BBBB                  BBBB
APPLICATION AUTHORITY   EP                    %                  %                       %                     %
COUNTRY CODE            US                    %                  %                       %                     %

                        PO BOX [0-9][0-
LIKE                    9][0-9][0-9]          %,,%                '[0-9] - [0-9]'         '[0-9] BIS [0-9]'     '[0-9] A [0-9]'
LIKE NOT                                                         ' - .+ - '              ' BIS .+ BIS '        ' A .+ A '
FIND                    PO BOX                ,,                  '-'                     ' BIS '               'A'
FIND2                   PO BOX ####
REPLACE                                       ,                  '-'                     '-'                   '-'
POSFROM                                   1                  1                   2                        2                          2
POSTO                                     1               9999                9999                     9999                       9999
SQLSTANDARD             MYSQL50               MYSQL50            MYSQL50                 MYSQL50               MYSQL50
DATE FROM
DATE TO
PATSTATFROM
PATSTATTO

                        moves PO BOX           Removes           these are different formats aiming to set multiple street
DESCRIPTION             from city to          double comma       number in address to format #-#
                        addr_other            in city


                                                     A2
Appendix 3: Deep into one pattern (I)

Let‟s see how query would work in one examples (# 100 the one highlightened)

We suppose we have an intermediate table called address where our fields are
    structured according to metadata structure proposal (see above).

Our patterns table is called here corrections.

We run it on a record with ADDRESS = “WAGNER STRASSE 3 BIS 12”




                                        A3
Appendix 3: Deep into one pattern (II):
“WAGNER STRASSE 3 BIS 12” VS “ BIS “

update applicant a, corrections b
set a.address=trim(concat(                      new address field is trimmed aggregation
LEFT(a.address, INSTR(a.address,b.find)-1), of what was before the change (“WAGNER
b.replace,                                      STRASSE 3“)
                                                “-“
right(a.address, LENGTH(a.address) -
      length(b.find)-INSTR(a.address,           “12”
      b.find)+1) ))
where
b.OPERATIONKIND = “BBBB”
INSTR(a.address, b.find) >= b.posfrom           this means “ – “ is from position 2 onward
and INSTR(a.address, b.find) <= b.POSTO this means “ – “ is before position 9999
and a.address regexp b.like                     address contains reg. expr. '[0-9] BIS [0-9]'
and a. address not regexp b.likenot             addr. don‟t contain ' BIS .+ BIS „that means twice„ BIS
and b.datefrom is null and b.dateto is null and „
                                                no criteria on date or patstat ediction
b.pastatfrom is null and b.pastatto is null;

                                               A3

Sharing names and address cleaning patterns for Patstat

  • 1.
    Sharing names/address cleaning patterns for Patstat: a metadata structure proposal By Gianluca Tarasconi www.rawpatentdata.blogspot.com
  • 2.
    From chaos toorder... Main milestones of clearing and standardizing patstat persons (inventors and applicants), starting from TLS206 table, can be synthesized as follows:  RE-PARSING / RESTRUCTURING  CLEANING  STANDARDIZATION  DEDUPLICATION Due to the strict sequentiality of the process, results of last steps (address standardization and deduplication) greatly depend from the quality of first two steps. 1
  • 3.
    ... and backto chaos...  Different team specialize on „local’ addresses [countrywise data cleaning]  Standards (i.e. sequence in toponym, street name, number) differ from country to country  Enrichments / links to other data may need special data structure Eventually data parsing and cleaning will produce very different results among different workteams. 2
  • 4.
    PERSON_ID 1430436: howwould you clean this (if you are not russian?) ARMYANSKOE SPETSIALIZIROVANNOE  ARMYANSKOE SPETSIALIZIROVANNOE PROEKTNO- PROEKTNO-IZYSKATELSKOE, HAÔÚHO- IZYSKATELSKOE, NAUCHNO- ôCCóE¯OBATEóÓCKOE ô KOHCTPÔKTOPCKOE I ISSLEDOVATELSKOE OT¯EóEHôE BCECO½³HO×O KONSTRUKTORSKOE OTDELENIE ×OCÔ¯APCTBEHHO×O üPOEKTHO- VSESOYUZNOGO ô³ÕCKATEóÓCKO×O ô HAÔÚHO- GOSUDARSTVENNOGO PROEKTNO- ôCCóE¯OBATEóÓCKO×O ôHCTôTÔTA IZYSKATELSKOGO I ÜHEP×ETôÚECKôX CôCTEM ô ÜóEKTPôÚECKôX NAUCHNO- ISSLEDOVATELSKOGO CETEö "ÜHEP×OCETÓüPOEKT" INSTITUTA ENERGETICHESKIKH SYSTEM I ELEKTRICHESKIKH SETEI "ENERGOSVYAZPROEKT" 2b
  • 5.
    Metadata structure proposal(I) Dedupl. Re- parsing Cleaning Parsed Standard. Standard Data & clean Parsed Standard & origin data data disambig (206, data data 206ascii…) structure We figure that there will be a certain point in which data coming from patstat are parsed into an intermediate data structure where original strings should be splitted into several fields according to the meaning of information contained; right after cleaning phase will remove the noise, allowing other tools (ie google lookup) to standardize tuples. 3
  • 6.
    Metadata structure proposal(II)  LAST_NAME Surname / company name  FIRST_NAME First name (blank for companies)  MIDDLE_NAME Second, third, 4th names … (blank for companies)  NAME_EXTENSION Jr/Sr/academic title; type of business entity in companies  ADDRESS Typically: toponym, name, number  LOCALITY City area (optional)  ADDR_OTHER Other specifics different than toponyms (floor, building, but also c/o company name) [should be data not relevant for standardization]  CITY Municipality name  COUNTY Administrative level above municipality  REGION Administrative level above county  STATE Administrative level above region for federal nations  ZIP_CODE Alphanumeric zip code 4
  • 7.
    Dimensions in datacleaning: Define pre-parsing and data cleaning as: • a projection (in the algebraic sense of the word), where: • some operators transform some vectors (fields) into other vectors… • … within the constraint of (endogenous) conditions given from the data structure. Projections may take place if some pre-conditions (patterns) are satisfied. 5
  • 8.
    Dimensions in datacleaning: operators (I) We consider only operators for MAP+CORRECT, considering the other possible operators as particular cases of this operator  MAP+CORRECT maps a vector in the correct domain, possibly transforming its elements (moves a string from one field to another, replacing it where correction is needed…) For such operator we should consider two dimensions indicating where the operation takes place:  FIELD FROM name of the field where operation start from  FIELD TO name of the final target of operator (optional) Also we need to list what string has to be found and what must be replaced with  FIND string to be found  REPLACE string replacing the string found 6
  • 9.
    Dimensions in datacleaning: operators (II) How to emulate other strings operators with map+correct:  MOVE (moves a string from one field to another) = M+C where REPLACE string = FIND string  REPLACE (changes a string inside a field )= M+C where FIELD FROM = FIELD TO  INSERT (inserts a string without removing other strings ) = M+C where FIELD FROM = FIELD TO and REPLACE string = FIND + insert string  DELETE (removes a string without removing other strings)= M+C where FIELD FROM = FIELD TO and REPLACE string is empty [NOTE: move a string within a field is considered only in case we need to shift it to trailing or leading position may take several steps to accomplish; 7
  • 10.
    Usage of operators:example Description / last_name Field from Field to Find replace TAPROGGE GESELLSCHAFT MBH REPLACE FIRST_NAME FIRST_NAME GESELLSCHA GMBH FT MBH TAPROGGE GMBH MOVE FIRST_NAME NAME_EXT GMBH GMBH TAPROGGE MOVE+REPLACE (same as A+B in FIRST_NAME NAME_EXT GESELLSCHAF GMBH 1 step) T MBH 8
  • 11.
    Endogenous conditions Methods usedto clean addresses may differ depending from pieces of information contained in the data themselves. Typical case are:  APPLICATION AUTHORITY gives some „address filling hints’ and charset  COUNTRY CODE gives toponyms, administrative data etc. etc.  YEAR FROM / TO (OPT.) some info may change with time (fi: change in ctry code)  PATSTATEDICTION FROM/TO (OPT.) some info can change with changes in patstat. 9
  • 12.
    Pre-conditions: match patterns Eventually,at string level, this is the core of our interchange format. Our proposal is to use SQL REGEXP operator patterns as default, including the following parameters  LIKE pattern to be found (inclusion criteria)  LIKE NOT [OPTIONAL] pattern not to be in (exclusion criteria)  POSITION (begin / end) start / end position where pattern can be  SQLSTANDARD gives the standard used for filling the patterns (sql „dialect‟, like vs regexp…) in order to make easier translation 10
  • 13.
    Interchange data structureproposal: vectors(I)  It‟s proposed to use a field called OPERATIONKIND where we may store origin and destination of the move operation.  It would be a multilayer indicator having a digit for each of the field of the pattern group, indicating the field to be addressed. COUNTRY ADDRESS LOCALITY ADDR_OTHER CITY COUNTY REGION STATE ZIP NOWHERE A B C D E F G H I 0 LAST_NAME FIRST_NAME MIDDLE_NAME NAME_EXTENSION 0 1 2 3 11
  • 14.
    Interchange data structureproposal: vectors(II) FI: BCEF LIKE, LIKE NOT, FIND, REPLACE = BCEF would mean if LIKE pattern is in address, NOT LIKE is not in locality, find FIND pattern in city and insert REPLACE pattern in county. It will be added an optional last digit indicating in case of move operation (where 1st and 4th digit are different) containing L or T respectively where REPLACE pattern must be inserted leading or trailing in target field. FI: BBBDT would mean LIKE, LIKE NOT, FIND are in address, and replace string must be inserted at the end of addr_other field. 12
  • 15.
    One example (asbefore): ID 1 OPERATIONKIND 0003 APPLICATION AUTHORITY EP Looking back at previous COUNTRY CODE DE example @ page 9 (TAPROGGE GESELLSCHAFT MBH) ALIKE GESELLSCHAFT MBH$ LIKE NOT 0003 means LIKE, NOT LIKE, FIND GESELLSCHAFT MBH FIND2 GESELLSCHAFT MBH and FIND patterns are in REPLACE GMBH first_name, REPLACE pattern POSFROM 2000 POSTO 2000 in NAME_EXTENSION. SQLSTANDARD MYSQL50 DATE FROM DATE TO PATSTATFROM PATSTATTO DESCRIPTION moves GESELLSCHAFT MBH from name to kind 13
  • 16.
    Open issues (I): Eventuallywe have to consider some issues still pending  Define a standard address Since cleaning pattern rely on backward logic, people sharing these data should have a common target in data standardization. It‟s propose to use local post office standards, but such standards may be unavailable / not fitting.  Automatic query generation User would greatly benefit from exchanging patterns if it could be possible to create a query generating tool that would, from pattern table, create SQL files. 14
  • 17.
    Open issues (II): High correlation & chronology Quality and results of data cleaning may depend from the order steps have been run (FI: if I do not remove PO BOXES numbers from addresses before cleaning street numbers I may have wrong results). Most of all some patterns must be run recursively and in some cases groups of patterns should run recursively (fi: MOVE from address PO BOX, CITY, ZIP, REMOVE COMMA; since I do not know the order the elements have in ADDRESS I should run the group of queries 4 times to be sure) A partial solution may be to add fields indicating the ID of previous query, of following query and number of repetitions. Remain open the issue of how do we manage group of repetitions and cleaning patterns needing a ‘loop until no match is found. 15
  • 18.
    Open issues (III): Grants for collaboration to this project A grant for collaborating to this part of ape-inv project is open for visiting @ Kites; info on APE-INV website, section"Grants" (call: http://www.esf-ape- inv.eu/download/Draft%20call%20for%20visits%20_ESF-APE- INV%202012_4th_call.pdf) 16
  • 19.
    Acknowledgements  Thanks to: Francesco Lissoni for supervision, Lorenzo Peccati For suggestions, Bulat Sanditov for Russian translation. 17
  • 20.
    Appendix: Interchange datastructure proposal (I) This is the list of the fields needed; where not indicated meaning of the field is explained in previous slides  APPLICATION AUTHORITY 2 char string % may indicate valid for all  COUNTRY CODE 2 char string % may indicate any country  DATE FROM date [optional] empty means no exclusion  DATE TO date [optional] empty means no exclusion  PATSTATFROM MMYYYY [optional] empty means no exclusion  PATSTATTO MMYYYY [optional] empty means no exclusion A1
  • 21.
    Interchange data structureproposal (II) Where not indicated meaning of the field is explained in previous slides  ALIKE string (is not called LIKE cause it may cause errors in some SQL )  LIKE NOT [OPTIONAL] string  FIND string  FIND2 string when literal find do not work and we need a fix len  REPLACE string  POSFROM integer start point of string position  POSTO integer end of position where string can be  SQLSTANDARD string A1
  • 22.
    Interchange data structureproposal (III) Note: some combinations of POSFROM POSTO may have particular meanings like : (1 , 1) mean start position ; (9999; 9999) means trailing position; (2 ; 9999) means everywhere but at beginning. Eventually a field containing a description of the operation is needed;  DESCRIPTION text A1
  • 23.
    Appendix2: Some examples ID 1 2 99 100 106 OPERATIONKIND EEED EEEE BBBB BBBB BBBB APPLICATION AUTHORITY EP % % % % COUNTRY CODE US % % % % PO BOX [0-9][0- LIKE 9][0-9][0-9] %,,% '[0-9] - [0-9]' '[0-9] BIS [0-9]' '[0-9] A [0-9]' LIKE NOT ' - .+ - ' ' BIS .+ BIS ' ' A .+ A ' FIND PO BOX ,, '-' ' BIS ' 'A' FIND2 PO BOX #### REPLACE , '-' '-' '-' POSFROM 1 1 2 2 2 POSTO 1 9999 9999 9999 9999 SQLSTANDARD MYSQL50 MYSQL50 MYSQL50 MYSQL50 MYSQL50 DATE FROM DATE TO PATSTATFROM PATSTATTO moves PO BOX Removes these are different formats aiming to set multiple street DESCRIPTION from city to double comma number in address to format #-# addr_other in city A2
  • 24.
    Appendix 3: Deepinto one pattern (I) Let‟s see how query would work in one examples (# 100 the one highlightened) We suppose we have an intermediate table called address where our fields are structured according to metadata structure proposal (see above). Our patterns table is called here corrections. We run it on a record with ADDRESS = “WAGNER STRASSE 3 BIS 12” A3
  • 25.
    Appendix 3: Deepinto one pattern (II): “WAGNER STRASSE 3 BIS 12” VS “ BIS “ update applicant a, corrections b set a.address=trim(concat( new address field is trimmed aggregation LEFT(a.address, INSTR(a.address,b.find)-1), of what was before the change (“WAGNER b.replace, STRASSE 3“) “-“ right(a.address, LENGTH(a.address) - length(b.find)-INSTR(a.address, “12” b.find)+1) )) where b.OPERATIONKIND = “BBBB” INSTR(a.address, b.find) >= b.posfrom this means “ – “ is from position 2 onward and INSTR(a.address, b.find) <= b.POSTO this means “ – “ is before position 9999 and a.address regexp b.like address contains reg. expr. '[0-9] BIS [0-9]' and a. address not regexp b.likenot addr. don‟t contain ' BIS .+ BIS „that means twice„ BIS and b.datefrom is null and b.dateto is null and „ no criteria on date or patstat ediction b.pastatfrom is null and b.pastatto is null; A3