Data cleanup

Data Cleanup
Methods using simple tools in advanced ways.

Start with what you have
▪ Iterative changes
▪ Use the tool that works
▪ Have a plan to achieve the format you need

The Setonian
Transforming ~5,000 news stories.

This is what we have to get to

Rows:
One story per row
All fields enclosed by quotation marks
All fields separated by commas
This is harder than you think.

Columns:
id,title,text,status,date_added,date_start,author_id,school_id,category,tags,show_ga
llery,national,large_image_url,medium_image_url,small_image_url,score,subtitle,byli
ne,external_id,main_homepage,homepage_thumbnail,main_category,category_thu
mbnail
"post_type","post_title","post_content","post_status","post_date","post_author","pos
t_category","post_tags","post_thumbnail","post_excerpt","Byline"

Some similarities, some differences
▪ title = post_title
▪ text = post_content
▪ date_start = post_date
▪ author_id = post_author
▪ category = post_category
▪ tags = post_tags
▪ large_image_url = post_thumbnail
▪ subtitle = post_excerpt
▪ id & external_id
▪ date_ added
▪ school_id
▪ show_gallery
▪ national
▪ medium_image_url & small_image_url ?
▪ score
▪ main_homepage & main_category
▪ homepage_thumbnail &
main_category_thumbnail
▪ Byline

Start with some low-hanging fruit
What exactly is “” ?
It’s a “soft hyphen.”
Do a find-and-replace with any text editor to
replace with “” (nothing).
3,777 replacements made.

And more…
What exactly is “ ” ?
It’s a “non-breaking space.”
replace with with “ ” (just a space).

And while we’re on the subject…
replace a double space with a single space.
But try it again: 4,448 replacements.
Then 1,043…
Then 456…
Then 76…
Then 32…
Then 16…
Then 2…
Finally …. 0

So what are all these &…; things anyway?
HTML character entities
Mathematical symbols
θ written as θ or θ
≈ written as &thickapprox; or ≈
Accented characters
ú written as ú or ú
Special punctuation, curly quotes and apostrophes, dashes, actual ampersands
“ as “ or “
Non-English characters
þ written as þ or þ

While we’re cleaning house
Tabs
Maybe a search-and-replace
MaybeWord’s special expansion
Tabs are useless in HTML because the get compressed into spaces
No matter what kind, multiple spaces are treated as one.
Replace tabs with spaces … another 54,000 instances.
Then repeat our double- to single-space search: another 6,000+.

Now it gets more interesting
Matching something that’s not always the same.

What about these blocks?
Font tags are evil
▪ <font face=""Times"">She had a time of 26.27.33.
</font></div><div style=""margin: 0in 0in 0pt"">
</div><div style=""margin: 0in 0in 0pt""><font
face=""Times"">Freshman Eloisa Parades finished in 147th
place with a time of 26.41.7.</font></div><div
style=""margin: 0in 0in 0pt""> </div><div style=""margin:
0in 0in 0pt""><font face=""Times"">Freshman Hughnique
Rolle finished in 157th place with a time of 27.16.3.
</font></div><div style=""margin: 0in 0in 0pt"">
</div><div style=""margin: 0in 0in 0pt""><font
face=""Times"">Junior MadisonWrest finished in 161st
place with a time of 27.29.66.</font></div><div
style=""margin: 0in 0in 0pt""> </div><div style=""margin:
0in 0in 0pt
▪ <font color=""#221e1f"" size=""7""><font
color=""#221e1f"" size=""7""><span _fck_bookmark=""1""
style=""display: none""> </span></font></font>
▪ These don’t bring anything to
the party.
▪ But every one is different!
▪ Fonts and faces, colors, sizes
and more.
▪ Also – look at the <span> and
<div> tags

Regular expressions
Essentially, pattern matching
Using a special set of meta-characters and wild cards
Found in Python, R, Perl, PHP and some common (and
free) text editors, all using the POSIX standard
But not in Excel, sorry.
How do we find a <font…> tag when we don’t
know what’s in it?

Basic matches and anchors
▪ /any/
– Matches “any,” “many”, “anymore”
▪ /^any/
– Matches “any,” “anymore”
▪ /any$/
– Matches “many,” “any”

Optionals, multiples, wildcards
SPECIAL SYMBOLS:
? – may or may not be there (lazy)
+ - one or more instances (greedy)
* - zero or more (greedy)
{0,2} – zero to two instances
▪ /an*y/
– Matches “any,” “canny”, “cay”, “annnnnnny”
▪ /an?y/
– Matches “many,” “may”
▪ /an+y/
– Matches “any,” “anymore” but not “may”
▪ /an{1,2}y/
– Matches “many,” “any”, “canny” but not “day”

Character sets
SPECIAL SYMBOLS:
. - Any character except a
newline
- used to escape special
characters
. – a period
- a backslash
[ - start of a character set
( - start of a capturing group
▪ /d/
– Any digit
▪ /w/
– Any word character, including digits and underscores
▪ /s/
– Any whitespace character
▪ /D/
– Any non-digit
▪ /W/
– Any non-character
▪ /S/
– Any non-whitespace character
▪ /n/
– New line
▪ /[any]/
– Matches either “a” or “n” or “y”

Using Sublime Text’s RegEx Search

Matching a font tag
▪ <font color=“”black”” face=“”Tahoma”” size=“”1””> is easy
▪ But it only finds 34 matches
▪ We have 680 font tags in this file

Build out a regular expression
Sublime Text. shows you a preview as you add to it
This:
▪ <font
▪ <fontscolor=
▪ <fonts[a-z]+=
But we can’t stop with that
▪ <fonts[a-z]+=[a-z]+>
▪ <fonts[a-z]+=['"]+[a-z]+['"]+>
▪ <fonts[w]+=['"]+[w]+['"]+>
Matches:
▪ 691
▪ 395
▪ 680
▪ Nothing
▪ 184 - ['"] is different from [‘”]
▪ 191 – expands to include digits, “-”

Pressing onwards
There are multiple attributes to find
This:
▪ <font(s[w]+=['"]+[w]+['"]+)+>
What are we missing?
Matches:
▪ 311
<font color=""#0066cc"" style=""list-style-type: none; list-style-
position: initial; list-style-image: initial; margin-top: 0px; margin-right:
0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-
right: 0px; padding-bottom: 0px; padding-left: 0px; "">
Pound signs, colons, semicolons and frankly who knows what else.

Sorry, this was just to show examples
There’s a far simpler way: the “negation” pattern.
This:
▪ [^"]
▪ [^89]
What we’ve been looking for is
▪ <font[^>]+>
▪ <font>
▪ <font[^>]*>
Matches:
▪ Anything BUT a double quote
▪ Anything BUT an 8 or 9
▪ But it only finds 680!?!
▪ 11 matches
▪ Finds all 691

Replace them with “” (nothing)
▪ And then, get rid of </font> ▪ Just check – yup, 691 of them.

Get rid of other junk tags
But it’s not so easy
▪ <span
▪ <span class=“something”>
▪ <span[^>]+(class=['"]+[^'"]+['"]+)[^>]+>
▪ <span id=“something”>
▪ </span>
<span> isn’t necessarily a junk tag
▪ 31,263 instances
▪ 4,470 – Maybe keep those
▪ Replace with <span $1>?
▪ Fortunately doesn’t occur.
▪ Remove in some cases

Get rid of the junk attributes
▪ Inline styles
– sstyle=['"]+[^'"]+['"]+ -- 9,392 matches

Get rid of the junk attributes
▪ Proprietary tags
– sdata-scayt_word=['"]+[^'"]+['"]+ -- 23,643 matches

And so on
▪ data-scaytid – 23,646 matches
▪ …
Come to think of it, let’s just get rid of all the <span> tags.
But still, get rid of whatever other junk you find
▪ <o:p></o:p> -- 1,036 of these
▪ <p>s+</p> (empty paragraphs) – 4,537 of these

But just when you thought it was safe…

Dealing with line breaks
▪ The “n” character is special
▪ It matches the end of a line, not a character but the carriage return
▪ Synonymous with “r”
▪ You can search for it.
– n([^d]) replaced with $1 yields 12,989 matches
▪ But we want to run it multiple times…10,000 more matches.
▪ We’ve gone from 16.8mb to 14.4mb!

Are we there yet?
Let’s try opening it in Excel to see where we are.

How did we do?
Sort the spreadsheet by the last column…

Only one wrong one – that’s not so bad.

Fortunately it’s just the one
▪ We can delete all the junk MicrosoftWord code
▪ Re-save it and try again
▪ If there were more, it would be easy enough to write regular
expressions to track them down.

But it still didn’t work.
Corrupted data

Here’s the line
… prevention metho",,,,,,,,,,,,,,,,,,,,s," """"The""",1,9/17/2009 0:00 …
Who knows where those commas came from.
The most efficient way is to just edit it manually.

Different formats
▪ The data export put the “tags” into curly braces:
▪ "{tag1},{tag2}"
▪ We need to get them out somehow.
▪ Let’s make an assumption:
– Replace ,“{ with ,“ 676 matches
– Replace },{ with , 1060 matches
– Replace }”, with “, 676 matches!
▪ That will work! Opening into Excel, a quick Data::Sort shows good
results.

Some other easy things
▪ The story ID won’t exist
▪ Change a number at the beginning of a line to the word “post”
▪ ^d+
▪ Change the column headers
▪ Change the status from “1” to “publish”
▪ Get rid of the doubled quotation marks now.

So far so good. Now we can remove the
columns we don’t need.

And clear out the first column too.

Some other harder things
▪ Consolidate the thumbnail columns
▪ Large – Medium – Small
▪ You want the first one, the largest one, that has a value.
▪ Excel is kind of great for this
▪ Create a new column
▪ Paste in
=IF(ISNA(VLOOKUP("*",I3:K3,1,FALSE)),"",VLOOKUP("*",I3:K3,1,FA
LSE))
▪ Drag it down to the bottom…

Be sure to capture the values
▪ If you try to delete the three columns you’ll get a reference error
▪ Make a new column
▪ Copy the previous column
▪ Use the Paste::Values function
▪ Then delete the small, medium and large columns, plus your formula.
▪ There’s a few more steps you could take, but not now…

Where Excel Falls Down
▪ Would you be surprised to learn that Excel doesn’t follow the
standards forCSV files?
▪ It gives you this:
– post,HRL alters housing fee after University reactions,"<p>The Department of
Housing and Residence Life …
▪ When what you need is this:
– "post","Poetry-in-the-Round features literary legend","<p>As a special
presentation by Seton Hall …
▪ Open Office Calc is the tool for this job.
– Open your saved file into it, and export it from there as a CSV

Montrose Park Historic
District
Turning a book into SQL queries

What to capture?
▪ Street as the category
▪ Address as the “title”
▪ Block #
▪ Lot #
▪ Outbuildings – numbers and types
▪ Descriptions
▪ Styles?
▪ Images?

Some basic transformations
▪ Preparing for automatic reading
▪ For this step I used to like the Windows Store app “CodeWriter”
▪ It handled whitespaces, especially new lines, much better
▪ But it’s pretty unstable – save your work often!
▪ Newer versions of SublimeText work well.

▪ s*Keys*ns*Outbuildings:s*(.*) => n$1n
▪ s*Non-Contributings*ns*Outbuildings:s*(.*) ->
nn$1
▪ s*Contributings*ns*Outbuildings:s*(.*) -> n$1nn
▪ s*Blocks(d*)s*Lots(.*) -> n$1n$2
These change this:
470 Berkeley Avenue Block 506 Lot 1
Key
Outbuildings: 1 stylistically similar detached carriage house (C)
Into this:
470 Berkeley Avenue
506
1
1 stylistically similar detached carriage house (C)

▪ (d*s([A-z]*s[A-z]*)sis.*) => n$1n_ watch that _!
Transforms:
470 Berkeley Avenue is a 2 1/2 story, 5 bay, rectangular plan, brick, Neoclassical-influenced,
residential building. Constructed c. 1920, the slate-clad, side gambrel roofed house is articulated by
a colossal order, fluted Ionic column-supported full front porch with mutule-supported entablature
and balustrade above. Three round-arched, pilastered dormers with lancet upper sashes ornament
the slate roofline. The fenestration on the facade consists of 9/1 windows with brick lintels featuring
stone keystones and sills. The projecting enclosed portico features a segmentally arched brick
surround, with a leaded fanlight and matching sidelights. Above the portico entablature is a
wrought iron balcony. At one side of the house is a one story, set back sun porch, and at the back of
the house, is a cross gambrel wing. This Neoclassical house is located at the corner of Montrose and
Berkeley Avenues, in an estate setting.
To:
470 Berkeley Avenue is a 2 1/2 story, 5 bay, rectangular plan, brick, Neoclassical-influenced,
residential building. Constructed c. 1920, the slate-clad, side gambrel roofed house is articulated by
a colossal order, fluted Ionic column-supported full front porch with mutule-supported entablature
and balustrade above. Three round-arched, pilastered dormers with lancet upper sashes ornament
the slate roofline. The fenestration on the facade consists of 9/1 windows with brick lintels featuring
stone keystones and sills. The projecting enclosed portico features a segmentally arched brick
surround, with a leaded fanlight and matching sidelights. Above the portico entablature is a
wrought iron balcony. At one side of the house is a one story, set back sun porch, and at the back of
the house, is a cross gambrel wing. This Neoclassical house is located at the corner of Montrose and
Berkeley Avenues, in an estate setting.
Berkeley Avenue
post

▪ n -> “n”
– Adds quotation marks to beginning and end of each line
▪ n -> ,
– This takes out all the line feeds, and makes a confusing mess
▪ ,"_","", -> ,""n
– Makes sense of it again
▪ "post","","" -> "post","“
– Clears an empty field at the end of the line

A useful comma-quote delimited entry
▪ "","BERKELEY AVENUE","","470 Berkeley Avenue","506","1","1 stylistically
similar detached carriage house (C)","","470 Berkeley Avenue is a 2 1/2
story, 5 bay, rectangular plan, brick, Neoclassical-influenced, residential
building. Constructed c. 1920, the slate-clad, side gambrel roofed house is
articulated by a colossal order, fluted Ionic column-supported full front
porch with mutule-supported entablature and balustrade above. Three
round-arched, pilastered dormers with lancet upper sashes ornament the
slate roofline. The fenestration on the facade consists of 9/1 windows with
brick lintels featuring stone keystones and sills. The projecting enclosed
portico features a segmentally arched brick surround, with a leaded fanlight
and matching sidelights. Above the portico entablature is a wrought iron
balcony. At one side of the house is a one story, set back sun porch, and at
the back of the house, is a cross gambrel wing. This Neoclassical house is
located at the corner of Montrose and Berkeley Avenues, in an estate
setting. ","Berkeley Avenue","post",""

You could do something like this:
"(d*)","Blocks(d*)nLots(d*)n([^:]*):s*(.*)
-> a ready-to-run MySQL query ->
INSERT INTO wp_postmeta (post_id,meta_key,meta_value)
($1,'Block','$2');nINSERT INTO wp_postmeta
(post_id,meta_key,meta_value) ($1,'Lot','$3');nINSERT INTO
wp_postmeta (post_id,meta_key,meta_value) ($1,'$4','$5');n

Extracting GeoCoordinates for Mapping
Using the Excel 2013-16
WEBSERVICE and FILTERXML
functions
Column C:
=WEBSERVICE(CONCATENATE("htt
p://nominatim.openstreetmap.org/s
earch/?format=xml&q=",A2,",", B2))
Column D:
=FILTERXML(C2,"//place/@lat")
Column E:
=FILTERXML(C2,"//place/@lon")
address Town query latitude longitude
264 Walton Ave. South Orange, NJ
<?xml version="1.0" encoding="UTF-8" ?>
<searchresults timestamp='Thu, 12 May 16 14:21:13+0000'attribution='Data © OpenStreetMap contributors,
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='264 Walton Ave.,South Orange, NJ'
polygon='false' exclude_place_ids='1618056838'
more_url='http://nominatim.openstreetmap.org/search.php?format=xml&exclude_place_ids=1618056838
&q=264+Walton+Ave.%2CSouth+Orange%2C+NJ'>
<place place_id='1618056838' place_rank='30' boundingbox="40.743006454546,40.743106454546,-
74.267518212121,-74.267418212121"lat='40.7430564545455' lon='-74.2674682121212' display_name='264,
Walton Avenue, Academy Heights, South Orange, Essex County, New Jersey, 07079, United States of America'
class='place' type='house' importance='0.401'/></searchresults> 40.74305645 -74.26746821
400 South Orange Ave. South Orange, NJ
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='400 South Orange Ave.,South Orange, NJ'
polygon='false' exclude_place_ids='55320221,111676295,129751356,112011838'
more_url='http://nominatim.openstreetmap.org/search.php?format=xml&exclude_place_ids=55320221,11
1676295,129751356,112011838&q=400+South+Orange+Ave.%2CSouth+Orange%2C+NJ'>
<place place_id='55320221'osm_type='way' osm_id='11619078' place_rank='26'
boundingbox="40.7457257,40.7463348,-74.2598793,-74.2580144" lat='40.746138' lon='-74.259232'
display_name='South Orange Avenue, Academy Heights, South Orange, Essex County, New Jersey, 07079,
United States of America' class='highway' type='primary' importance='0.6'/></searchresults> 40.746138 -74.259232
191 Parker Ave. Maplewood, NJ
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='191 Parker Ave.,Maplewood, NJ'
&q=191+Parker+Ave.%2CMaplewood%2C+NJ'>
<place place_id='1618398991' place_rank='30' boundingbox="40.731509,40.731609,-74.2508735,-74.2507735"
lat='40.731559' lon='-74.2508235' display_name='191, Parker Avenue, Maplewood, Essex County, New Jersey,
07040, United States of America' class='place' type='house' importance='0.311'/></searchresults> 40.731559 -74.2508235
6016 Morrow Dr. Brook Park, OH
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='6016 Morrow Dr.,Brook Park, OH'
&q=6016+Morrow+Dr.%2CBrook+Park%2C+OH'>
<place place_id='1831991659' place_rank='30' boundingbox="41.39944379397,41.39954379397,-
81.797624723618,-81.797524723618" lat='41.3994937939698' lon='-81.7975747236181' display_name='6016,
Morrow Drive, Brook Park, Cuyahoga County, Ohio,44142, United States of America' class='place' type='house'
importance='0.501'/></searchresults> 41.39949379 -81.79757472
Carrer del Duc, 4 2-1 Barcelona, Catalonia 08002
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='Carrer del Duc, 4 2-1,Barcelona, Catalonia
08002'polygon='false' exclude_place_ids='50632377'
more_url='http://nominatim.openstreetmap.org/search.php?format=xml&exclude_place_ids=50632377&a
mp;q=Carrer+del+Duc%2C+4+2-1%2CBarcelona%2C+Catalonia+08002'>
<place place_id='50632377' osm_type='way' osm_id='4747020' place_rank='26'
boundingbox="41.3838107,41.3848529,2.1728333,2.1736705" lat='41.3845007' lon='2.1731101'
display_name='Carrer del Duc, el Gòtic, Ciutat Vella, Barcelona, BCN, CAT, 08002,España' class='highway'
type='pedestrian' importance='0.51'/></searchresults> 41.3845007 2.1731101
2 Lafayette Street Fairhaven, MA 02719
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='2 Lafayette Street,Fairhaven, MA 02719'
&q=2+Lafayette+Street%2CFairhaven%2C+MA+02719'>
<place place_id='1215278014'place_rank='30' boundingbox="41.646861,41.646961,-70.912316,-70.912216"
lat='41.646911' lon='-70.912266' display_name='2, Lafayette Street, Fairhaven, Bristol County, Massachusetts,
02719, United States of America' class='place' type='house' importance='0.411'/></searchresults> 41.646911 -70.912266

For further study…
Matching addresses to images
▪ Folder: GlensideDr
▪ Files:
– Dykeman19Glenside.jpg
– Fenrich23Glenside.jpg
– Finlay8Glenside.jpg
– w+(d+)([^.]+).jpg
▪ Automatically generate
an “image” field?

Data cleanup

Recommended

Recommended

More Related Content

Similar to Data cleanup

Similar to Data cleanup (20)

Recently uploaded

Recently uploaded (20)

Data cleanup