SlideShare a Scribd company logo
1 of 57
Data Cleanup
Methods using simple tools in advanced ways.
Start with what you have
▪ Iterative changes
▪ Use the tool that works
▪ Have a plan to achieve the format you need
The Setonian
Transforming ~5,000 news stories.
This is what we started with.
This is what we have to get to
Rows:
One story per row
All fields enclosed by quotation marks
All fields separated by commas
This is harder than you think.
Columns:
id,title,text,status,date_added,date_start,author_id,school_id,category,tags,show_ga
llery,national,large_image_url,medium_image_url,small_image_url,score,subtitle,byli
ne,external_id,main_homepage,homepage_thumbnail,main_category,category_thu
mbnail
"post_type","post_title","post_content","post_status","post_date","post_author","pos
t_category","post_tags","post_thumbnail","post_excerpt","Byline"
Some similarities, some differences
▪ title = post_title
▪ text = post_content
▪ date_start = post_date
▪ author_id = post_author
▪ category = post_category
▪ tags = post_tags
▪ large_image_url = post_thumbnail
▪ subtitle = post_excerpt
▪ id & external_id
▪ date_ added
▪ school_id
▪ show_gallery
▪ national
▪ medium_image_url & small_image_url ?
▪ score
▪ main_homepage & main_category
▪ homepage_thumbnail &
main_category_thumbnail
▪ Byline
Start with some low-hanging fruit
What exactly is “­” ?
It’s a “soft hyphen.”
Do a find-and-replace with any text editor to
replace with “” (nothing).
3,777 replacements made.
And more…
What exactly is “ ” ?
It’s a “non-breaking space.”
Do a find-and-replace with any text editor to
replace with with “ ” (just a space).
28,177 replacements made.
And while we’re on the subject…
Do a find-and-replace with any text editor to
replace a double space with a single space.
15,064 replacements made.
But try it again: 4,448 replacements.
Then 1,043…
Then 456…
Then 76…
Then 32…
Then 16…
Then 2…
Finally …. 0
So what are all these &…; things anyway?
HTML character entities
Mathematical symbols
θ written as θ or θ
≈ written as ≈ or ≈
Accented characters
ú written as ú or ú
Special punctuation, curly quotes and apostrophes, dashes, actual ampersands
“ as “ or “
Non-English characters
þ written as þ or þ
While we’re cleaning house
Tabs
Maybe a search-and-replace
MaybeWord’s special expansion
Tabs are useless in HTML because the get compressed into spaces
No matter what kind, multiple spaces are treated as one.
Replace tabs with spaces … another 54,000 instances.
Then repeat our double- to single-space search: another 6,000+.
Now it gets more interesting
Matching something that’s not always the same.
What about these blocks?
Font tags are evil
▪ <font face=""Times"">She had a time of 26.27.33.
</font></div><div style=""margin: 0in 0in 0pt"">
</div><div style=""margin: 0in 0in 0pt""><font
face=""Times"">Freshman Eloisa Parades finished in 147th
place with a time of 26.41.7.</font></div><div
style=""margin: 0in 0in 0pt""> </div><div style=""margin:
0in 0in 0pt""><font face=""Times"">Freshman Hughnique
Rolle finished in 157th place with a time of 27.16.3.
</font></div><div style=""margin: 0in 0in 0pt"">
</div><div style=""margin: 0in 0in 0pt""><font
face=""Times"">Junior MadisonWrest finished in 161st
place with a time of 27.29.66.</font></div><div
style=""margin: 0in 0in 0pt""> </div><div style=""margin:
0in 0in 0pt
▪ <font color=""#221e1f"" size=""7""><font
color=""#221e1f"" size=""7""><span _fck_bookmark=""1""
style=""display: none""> </span></font></font>
▪ These don’t bring anything to
the party.
▪ But every one is different!
▪ Fonts and faces, colors, sizes
and more.
▪ Also – look at the <span> and
<div> tags
Regular expressions
Essentially, pattern matching
Using a special set of meta-characters and wild cards
Found in Python, R, Perl, PHP and some common (and
free) text editors, all using the POSIX standard
But not in Excel, sorry.
How do we find a <font…> tag when we don’t
know what’s in it?
Basic matches and anchors
▪ /any/
– Matches “any,” “many”, “anymore”
▪ /^any/
– Matches “any,” “anymore”
▪ /any$/
– Matches “many,” “any”
Optionals, multiples, wildcards
SPECIAL SYMBOLS:
? – may or may not be there (lazy)
+ - one or more instances (greedy)
* - zero or more (greedy)
{0,2} – zero to two instances
▪ /an*y/
– Matches “any,” “canny”, “cay”, “annnnnnny”
▪ /an?y/
– Matches “many,” “may”
▪ /an+y/
– Matches “any,” “anymore” but not “may”
▪ /an{1,2}y/
– Matches “many,” “any”, “canny” but not “day”
Character sets
SPECIAL SYMBOLS:
. - Any character except a
newline
 - used to escape special
characters
. – a period
 - a backslash
[ - start of a character set
( - start of a capturing group
▪ /d/
– Any digit
▪ /w/
– Any word character, including digits and underscores
▪ /s/
– Any whitespace character
▪ /D/
– Any non-digit
▪ /W/
– Any non-character
▪ /S/
– Any non-whitespace character
▪ /n/
– New line
▪ /[any]/
– Matches either “a” or “n” or “y”
Using Sublime Text’s RegEx Search
Matching a font tag
▪ <font color=“”black”” face=“”Tahoma”” size=“”1””> is easy
▪ But it only finds 34 matches
▪ We have 680 font tags in this file
Build out a regular expression
Sublime Text. shows you a preview as you add to it
This:
▪ <font
▪ <fontscolor=
▪ <fonts[a-z]+=
But we can’t stop with that
▪ <fonts[a-z]+=[a-z]+>
▪ <fonts[a-z]+=['"]+[a-z]+['"]+>
▪ <fonts[w]+=['"]+[w]+['"]+>
Matches:
▪ 691
▪ 395
▪ 680
▪ Nothing
▪ 184 - ['"] is different from [‘”]
▪ 191 – expands to include digits, “-”
Pressing onwards
There are multiple attributes to find
This:
▪ <font(s[w]+=['"]+[w]+['"]+)+>
What are we missing?
Matches:
▪ 311
<font color=""#0066cc"" style=""list-style-type: none; list-style-
position: initial; list-style-image: initial; margin-top: 0px; margin-right:
0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-
right: 0px; padding-bottom: 0px; padding-left: 0px; "">
Pound signs, colons, semicolons and frankly who knows what else.
Sorry, this was just to show examples
There’s a far simpler way: the “negation” pattern.
This:
▪ [^"]
▪ [^89]
What we’ve been looking for is
▪ <font[^>]+>
▪ <font>
▪ <font[^>]*>
Matches:
▪ Anything BUT a double quote
▪ Anything BUT an 8 or 9
▪ But it only finds 680!?!
▪ 11 matches
▪ Finds all 691
Replace them with “” (nothing)
▪ And then, get rid of </font> ▪ Just check – yup, 691 of them.
Get rid of other junk tags
But it’s not so easy
▪ <span
▪ <span class=“something”>
▪ <span[^>]+(class=['"]+[^'"]+['"]+)[^>]+>
▪ <span id=“something”>
▪ </span>
<span> isn’t necessarily a junk tag
▪ 31,263 instances
▪ 4,470 – Maybe keep those
▪ Replace with <span $1>?
▪ Fortunately doesn’t occur.
▪ Remove in some cases
Get rid of the junk attributes
▪ Inline styles
– sstyle=['"]+[^'"]+['"]+ -- 9,392 matches
Get rid of the junk attributes
▪ Proprietary tags
– sdata-scayt_word=['"]+[^'"]+['"]+ -- 23,643 matches
And so on
▪ data-scaytid – 23,646 matches
▪ …
Come to think of it, let’s just get rid of all the <span> tags.
But still, get rid of whatever other junk you find
▪ <o:p></o:p> -- 1,036 of these
▪ <p>s+</p> (empty paragraphs) – 4,537 of these
But just when you thought it was safe…
Dealing with line breaks
▪ The “n” character is special
▪ It matches the end of a line, not a character but the carriage return
▪ Synonymous with “r”
▪ You can search for it.
– n([^d]) replaced with $1 yields 12,989 matches
▪ But we want to run it multiple times…10,000 more matches.
▪ We’ve gone from 16.8mb to 14.4mb!
Are we there yet?
Let’s try opening it in Excel to see where we are.
How did we do?
Sort the spreadsheet by the last column…
Only one wrong one – that’s not so bad.
Fortunately it’s just the one
▪ We can delete all the junk MicrosoftWord code
▪ Re-save it and try again
▪ If there were more, it would be easy enough to write regular
expressions to track them down.
But it still didn’t work.
Corrupted data
Here’s the line
… prevention metho",,,,,,,,,,,,,,,,,,,,s," """"The""",1,9/17/2009 0:00 …
Who knows where those commas came from.
The most efficient way is to just edit it manually.
Different formats
▪ The data export put the “tags” into curly braces:
▪ "{tag1},{tag2}"
▪ We need to get them out somehow.
▪ Let’s make an assumption:
– Replace ,“{ with ,“ 676 matches
– Replace },{ with , 1060 matches
– Replace }”, with “, 676 matches!
▪ That will work! Opening into Excel, a quick Data::Sort shows good
results.
Some other easy things
▪ The story ID won’t exist
▪ Change a number at the beginning of a line to the word “post”
▪ ^d+
▪ Change the column headers
▪ Change the status from “1” to “publish”
▪ Get rid of the doubled quotation marks now.
So far so good. Now we can remove the
columns we don’t need.
And clear out the first column too.
Some other harder things
▪ Consolidate the thumbnail columns
▪ Large – Medium – Small
▪ You want the first one, the largest one, that has a value.
▪ Excel is kind of great for this
▪ Create a new column
▪ Paste in
=IF(ISNA(VLOOKUP("*",I3:K3,1,FALSE)),"",VLOOKUP("*",I3:K3,1,FA
LSE))
▪ Drag it down to the bottom…
Be sure to capture the values
▪ If you try to delete the three columns you’ll get a reference error
▪ Make a new column
▪ Copy the previous column
▪ Use the Paste::Values function
▪ Then delete the small, medium and large columns, plus your formula.
▪ There’s a few more steps you could take, but not now…
Where Excel Falls Down
▪ Would you be surprised to learn that Excel doesn’t follow the
standards forCSV files?
▪ It gives you this:
– post,HRL alters housing fee after University reactions,"<p>The Department of
Housing and Residence Life …
▪ When what you need is this:
– "post","Poetry-in-the-Round features literary legend","<p>As a special
presentation by Seton Hall …
▪ Open Office Calc is the tool for this job.
– Open your saved file into it, and export it from there as a CSV
Montrose Park Historic
District
Turning a book into SQL queries
Starting point
What to capture?
▪ Street as the category
▪ Address as the “title”
▪ Block #
▪ Lot #
▪ Outbuildings – numbers and types
▪ Descriptions
▪ Styles?
▪ Images?
Some basic transformations
▪ Preparing for automatic reading
▪ For this step I used to like the Windows Store app “CodeWriter”
▪ It handled whitespaces, especially new lines, much better
▪ But it’s pretty unstable – save your work often!
▪ Newer versions of SublimeText work well.
▪ s*Keys*ns*Outbuildings:s*(.*) => n$1n
▪ s*Non-Contributings*ns*Outbuildings:s*(.*) ->
nn$1
▪ s*Contributings*ns*Outbuildings:s*(.*) -> n$1nn
▪ s*Blocks(d*)s*Lots(.*) -> n$1n$2
These change this:
470 Berkeley Avenue Block 506 Lot 1
Key
Outbuildings: 1 stylistically similar detached carriage house (C)
Into this:
470 Berkeley Avenue
506
1
1 stylistically similar detached carriage house (C)
▪ (d*s([A-z]*s[A-z]*)sis.*) => n$1n_ watch that _!
Transforms:
470 Berkeley Avenue is a 2 1/2 story, 5 bay, rectangular plan, brick, Neoclassical-influenced,
residential building. Constructed c. 1920, the slate-clad, side gambrel roofed house is articulated by
a colossal order, fluted Ionic column-supported full front porch with mutule-supported entablature
and balustrade above. Three round-arched, pilastered dormers with lancet upper sashes ornament
the slate roofline. The fenestration on the facade consists of 9/1 windows with brick lintels featuring
stone keystones and sills. The projecting enclosed portico features a segmentally arched brick
surround, with a leaded fanlight and matching sidelights. Above the portico entablature is a
wrought iron balcony. At one side of the house is a one story, set back sun porch, and at the back of
the house, is a cross gambrel wing. This Neoclassical house is located at the corner of Montrose and
Berkeley Avenues, in an estate setting.
To:
470 Berkeley Avenue is a 2 1/2 story, 5 bay, rectangular plan, brick, Neoclassical-influenced,
residential building. Constructed c. 1920, the slate-clad, side gambrel roofed house is articulated by
a colossal order, fluted Ionic column-supported full front porch with mutule-supported entablature
and balustrade above. Three round-arched, pilastered dormers with lancet upper sashes ornament
the slate roofline. The fenestration on the facade consists of 9/1 windows with brick lintels featuring
stone keystones and sills. The projecting enclosed portico features a segmentally arched brick
surround, with a leaded fanlight and matching sidelights. Above the portico entablature is a
wrought iron balcony. At one side of the house is a one story, set back sun porch, and at the back of
the house, is a cross gambrel wing. This Neoclassical house is located at the corner of Montrose and
Berkeley Avenues, in an estate setting.
Berkeley Avenue
post
▪ n -> “n”
– Adds quotation marks to beginning and end of each line
▪ n -> ,
– This takes out all the line feeds, and makes a confusing mess
▪ ,"_","", -> ,""n
– Makes sense of it again
▪ "post","","" -> "post","“
– Clears an empty field at the end of the line
A useful comma-quote delimited entry
▪ "","BERKELEY AVENUE","","470 Berkeley Avenue","506","1","1 stylistically
similar detached carriage house (C)","","470 Berkeley Avenue is a 2 1/2
story, 5 bay, rectangular plan, brick, Neoclassical-influenced, residential
building. Constructed c. 1920, the slate-clad, side gambrel roofed house is
articulated by a colossal order, fluted Ionic column-supported full front
porch with mutule-supported entablature and balustrade above. Three
round-arched, pilastered dormers with lancet upper sashes ornament the
slate roofline. The fenestration on the facade consists of 9/1 windows with
brick lintels featuring stone keystones and sills. The projecting enclosed
portico features a segmentally arched brick surround, with a leaded fanlight
and matching sidelights. Above the portico entablature is a wrought iron
balcony. At one side of the house is a one story, set back sun porch, and at
the back of the house, is a cross gambrel wing. This Neoclassical house is
located at the corner of Montrose and Berkeley Avenues, in an estate
setting. ","Berkeley Avenue","post",""
You could do something like this:
"(d*)","Blocks(d*)nLots(d*)n([^:]*):s*(.*)
-> a ready-to-run MySQL query ->
INSERT INTO wp_postmeta (post_id,meta_key,meta_value)
($1,'Block','$2');nINSERT INTO wp_postmeta
(post_id,meta_key,meta_value) ($1,'Lot','$3');nINSERT INTO
wp_postmeta (post_id,meta_key,meta_value) ($1,'$4','$5');n
Extracting GeoCoordinates for Mapping
Using the Excel 2013-16
WEBSERVICE and FILTERXML
functions
Column C:
=WEBSERVICE(CONCATENATE("htt
p://nominatim.openstreetmap.org/s
earch/?format=xml&q=",A2,",", B2))
Column D:
=FILTERXML(C2,"//place/@lat")
Column E:
=FILTERXML(C2,"//place/@lon")
address Town query latitude longitude
264 Walton Ave. South Orange, NJ
<?xml version="1.0" encoding="UTF-8" ?>
<searchresults timestamp='Thu, 12 May 16 14:21:13+0000'attribution='Data © OpenStreetMap contributors,
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='264 Walton Ave.,South Orange, NJ'
polygon='false' exclude_place_ids='1618056838'
more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=1618056838
&amp;q=264+Walton+Ave.%2CSouth+Orange%2C+NJ'>
<place place_id='1618056838' place_rank='30' boundingbox="40.743006454546,40.743106454546,-
74.267518212121,-74.267418212121"lat='40.7430564545455' lon='-74.2674682121212' display_name='264,
Walton Avenue, Academy Heights, South Orange, Essex County, New Jersey, 07079, United States of America'
class='place' type='house' importance='0.401'/></searchresults> 40.74305645 -74.26746821
400 South Orange Ave. South Orange, NJ
<?xml version="1.0" encoding="UTF-8" ?>
<searchresults timestamp='Thu, 12 May 16 14:09:51+0000'attribution='Data © OpenStreetMap contributors,
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='400 South Orange Ave.,South Orange, NJ'
polygon='false' exclude_place_ids='55320221,111676295,129751356,112011838'
more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=55320221,11
1676295,129751356,112011838&amp;q=400+South+Orange+Ave.%2CSouth+Orange%2C+NJ'>
<place place_id='55320221'osm_type='way' osm_id='11619078' place_rank='26'
boundingbox="40.7457257,40.7463348,-74.2598793,-74.2580144" lat='40.746138' lon='-74.259232'
display_name='South Orange Avenue, Academy Heights, South Orange, Essex County, New Jersey, 07079,
United States of America' class='highway' type='primary' importance='0.6'/></searchresults> 40.746138 -74.259232
191 Parker Ave. Maplewood, NJ
<?xml version="1.0" encoding="UTF-8" ?>
<searchresults timestamp='Thu, 12 May 16 14:09:51+0000'attribution='Data © OpenStreetMap contributors,
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='191 Parker Ave.,Maplewood, NJ'
polygon='false' exclude_place_ids='1618398991'
more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=1618398991
&amp;q=191+Parker+Ave.%2CMaplewood%2C+NJ'>
<place place_id='1618398991' place_rank='30' boundingbox="40.731509,40.731609,-74.2508735,-74.2507735"
lat='40.731559' lon='-74.2508235' display_name='191, Parker Avenue, Maplewood, Essex County, New Jersey,
07040, United States of America' class='place' type='house' importance='0.311'/></searchresults> 40.731559 -74.2508235
6016 Morrow Dr. Brook Park, OH
<?xml version="1.0" encoding="UTF-8" ?>
<searchresults timestamp='Thu, 12 May 16 14:09:51+0000'attribution='Data © OpenStreetMap contributors,
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='6016 Morrow Dr.,Brook Park, OH'
polygon='false' exclude_place_ids='1831991659'
more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=1831991659
&amp;q=6016+Morrow+Dr.%2CBrook+Park%2C+OH'>
<place place_id='1831991659' place_rank='30' boundingbox="41.39944379397,41.39954379397,-
81.797624723618,-81.797524723618" lat='41.3994937939698' lon='-81.7975747236181' display_name='6016,
Morrow Drive, Brook Park, Cuyahoga County, Ohio,44142, United States of America' class='place' type='house'
importance='0.501'/></searchresults> 41.39949379 -81.79757472
Carrer del Duc, 4 2-1 Barcelona, Catalonia 08002
<?xml version="1.0" encoding="UTF-8" ?>
<searchresults timestamp='Thu, 12 May 16 14:12:47+0000'attribution='Data © OpenStreetMap contributors,
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='Carrer del Duc, 4 2-1,Barcelona, Catalonia
08002'polygon='false' exclude_place_ids='50632377'
more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=50632377&a
mp;q=Carrer+del+Duc%2C+4+2-1%2CBarcelona%2C+Catalonia+08002'>
<place place_id='50632377' osm_type='way' osm_id='4747020' place_rank='26'
boundingbox="41.3838107,41.3848529,2.1728333,2.1736705" lat='41.3845007' lon='2.1731101'
display_name='Carrer del Duc, el Gòtic, Ciutat Vella, Barcelona, BCN, CAT, 08002,España' class='highway'
type='pedestrian' importance='0.51'/></searchresults> 41.3845007 2.1731101
2 Lafayette Street Fairhaven, MA 02719
<?xml version="1.0" encoding="UTF-8" ?>
<searchresults timestamp='Thu, 12 May 16 14:12:30+0000'attribution='Data © OpenStreetMap contributors,
ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='2 Lafayette Street,Fairhaven, MA 02719'
polygon='false' exclude_place_ids='1215278014'
more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=1215278014
&amp;q=2+Lafayette+Street%2CFairhaven%2C+MA+02719'>
<place place_id='1215278014'place_rank='30' boundingbox="41.646861,41.646961,-70.912316,-70.912216"
lat='41.646911' lon='-70.912266' display_name='2, Lafayette Street, Fairhaven, Bristol County, Massachusetts,
02719, United States of America' class='place' type='house' importance='0.411'/></searchresults> 41.646911 -70.912266
For further study…
Matching addresses to images
▪ Folder: GlensideDr
▪ Files:
– Dykeman19Glenside.jpg
– Fenrich23Glenside.jpg
– Finlay8Glenside.jpg
– w+(d+)([^.]+).jpg
▪ Automatically generate
an “image” field?
Questions?

More Related Content

Similar to Data cleanup

Code Fast, die() Early, Throw Structured Exceptions
Code Fast, die() Early, Throw Structured ExceptionsCode Fast, die() Early, Throw Structured Exceptions
Code Fast, die() Early, Throw Structured ExceptionsJohn Anderson
 
Slides chapter3part1 ruby-forjavaprogrammers
Slides chapter3part1 ruby-forjavaprogrammersSlides chapter3part1 ruby-forjavaprogrammers
Slides chapter3part1 ruby-forjavaprogrammersGiovanni924
 
Real life-coffeescript
Real life-coffeescriptReal life-coffeescript
Real life-coffeescriptDavid Furber
 
WTFin Perl
WTFin PerlWTFin Perl
WTFin Perllechupl
 
Rails, Postgres, Angular, and Bootstrap: The Power Stack
Rails, Postgres, Angular, and Bootstrap: The Power StackRails, Postgres, Angular, and Bootstrap: The Power Stack
Rails, Postgres, Angular, and Bootstrap: The Power StackDavid Copeland
 
Intro to Ruby/Rails at TechLady Hackathon
Intro to Ruby/Rails at TechLady HackathonIntro to Ruby/Rails at TechLady Hackathon
Intro to Ruby/Rails at TechLady Hackathonkdmcclin
 
Zend Certification Preparation Tutorial
Zend Certification Preparation TutorialZend Certification Preparation Tutorial
Zend Certification Preparation TutorialLorna Mitchell
 
A Scala Corrections Library
A Scala Corrections LibraryA Scala Corrections Library
A Scala Corrections LibraryPaul Phillips
 
Learning Perl 6
Learning Perl 6 Learning Perl 6
Learning Perl 6 brian d foy
 
Graph Visualization - OWASP NYC Chapter
Graph Visualization - OWASP NYC ChapterGraph Visualization - OWASP NYC Chapter
Graph Visualization - OWASP NYC ChapterCheckmarx
 
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJS
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJSThe Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJS
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJSmfyleman
 
Dan Shappir "JavaScript Riddles For Fun And Profit"
Dan Shappir "JavaScript Riddles For Fun And Profit"Dan Shappir "JavaScript Riddles For Fun And Profit"
Dan Shappir "JavaScript Riddles For Fun And Profit"Fwdays
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsdavidfstr
 
Uses & Abuses of Mocks & Stubs
Uses & Abuses of Mocks & StubsUses & Abuses of Mocks & Stubs
Uses & Abuses of Mocks & StubsPatchSpace Ltd
 
Compass, Sass, and the Enlightened CSS Developer
Compass, Sass, and the Enlightened CSS DeveloperCompass, Sass, and the Enlightened CSS Developer
Compass, Sass, and the Enlightened CSS DeveloperWynn Netherland
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxPythian
 
Regular expressions-ada-2018
Regular expressions-ada-2018Regular expressions-ada-2018
Regular expressions-ada-2018Emma Burrows
 

Similar to Data cleanup (20)

Code Fast, die() Early, Throw Structured Exceptions
Code Fast, die() Early, Throw Structured ExceptionsCode Fast, die() Early, Throw Structured Exceptions
Code Fast, die() Early, Throw Structured Exceptions
 
Slides chapter3part1 ruby-forjavaprogrammers
Slides chapter3part1 ruby-forjavaprogrammersSlides chapter3part1 ruby-forjavaprogrammers
Slides chapter3part1 ruby-forjavaprogrammers
 
Real life-coffeescript
Real life-coffeescriptReal life-coffeescript
Real life-coffeescript
 
WTFin Perl
WTFin PerlWTFin Perl
WTFin Perl
 
Rails, Postgres, Angular, and Bootstrap: The Power Stack
Rails, Postgres, Angular, and Bootstrap: The Power StackRails, Postgres, Angular, and Bootstrap: The Power Stack
Rails, Postgres, Angular, and Bootstrap: The Power Stack
 
Intro to Ruby/Rails at TechLady Hackathon
Intro to Ruby/Rails at TechLady HackathonIntro to Ruby/Rails at TechLady Hackathon
Intro to Ruby/Rails at TechLady Hackathon
 
Zend Certification Preparation Tutorial
Zend Certification Preparation TutorialZend Certification Preparation Tutorial
Zend Certification Preparation Tutorial
 
A Scala Corrections Library
A Scala Corrections LibraryA Scala Corrections Library
A Scala Corrections Library
 
Learning Perl 6
Learning Perl 6 Learning Perl 6
Learning Perl 6
 
Graph Visualization - OWASP NYC Chapter
Graph Visualization - OWASP NYC ChapterGraph Visualization - OWASP NYC Chapter
Graph Visualization - OWASP NYC Chapter
 
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJS
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJSThe Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJS
The Road To Damascus - A Conversion Experience: LotusScript and @Formula to SSJS
 
Dan Shappir "JavaScript Riddles For Fun And Profit"
Dan Shappir "JavaScript Riddles For Fun And Profit"Dan Shappir "JavaScript Riddles For Fun And Profit"
Dan Shappir "JavaScript Riddles For Fun And Profit"
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
SQL -PHP Tutorial
SQL -PHP TutorialSQL -PHP Tutorial
SQL -PHP Tutorial
 
Uses & Abuses of Mocks & Stubs
Uses & Abuses of Mocks & StubsUses & Abuses of Mocks & Stubs
Uses & Abuses of Mocks & Stubs
 
Compass, Sass, and the Enlightened CSS Developer
Compass, Sass, and the Enlightened CSS DeveloperCompass, Sass, and the Enlightened CSS Developer
Compass, Sass, and the Enlightened CSS Developer
 
Fancy talk
Fancy talkFancy talk
Fancy talk
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
 
XSS and How to Escape
XSS and How to EscapeXSS and How to Escape
XSS and How to Escape
 
Regular expressions-ada-2018
Regular expressions-ada-2018Regular expressions-ada-2018
Regular expressions-ada-2018
 

Recently uploaded

Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........LeaCamillePacle
 

Recently uploaded (20)

Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........
 

Data cleanup

  • 1. Data Cleanup Methods using simple tools in advanced ways.
  • 2. Start with what you have ▪ Iterative changes ▪ Use the tool that works ▪ Have a plan to achieve the format you need
  • 4. This is what we started with.
  • 5. This is what we have to get to
  • 6. Rows: One story per row All fields enclosed by quotation marks All fields separated by commas This is harder than you think.
  • 8. Some similarities, some differences ▪ title = post_title ▪ text = post_content ▪ date_start = post_date ▪ author_id = post_author ▪ category = post_category ▪ tags = post_tags ▪ large_image_url = post_thumbnail ▪ subtitle = post_excerpt ▪ id & external_id ▪ date_ added ▪ school_id ▪ show_gallery ▪ national ▪ medium_image_url & small_image_url ? ▪ score ▪ main_homepage & main_category ▪ homepage_thumbnail & main_category_thumbnail ▪ Byline
  • 9. Start with some low-hanging fruit What exactly is “&shy;” ? It’s a “soft hyphen.” Do a find-and-replace with any text editor to replace with “” (nothing). 3,777 replacements made.
  • 10. And more… What exactly is “&nbsp;” ? It’s a “non-breaking space.” Do a find-and-replace with any text editor to replace with with “ ” (just a space). 28,177 replacements made.
  • 11. And while we’re on the subject… Do a find-and-replace with any text editor to replace a double space with a single space. 15,064 replacements made. But try it again: 4,448 replacements. Then 1,043… Then 456… Then 76… Then 32… Then 16… Then 2… Finally …. 0
  • 12. So what are all these &…; things anyway? HTML character entities Mathematical symbols θ written as &theta; or &#952; ≈ written as &thickapprox; or &#8776; Accented characters ú written as &uacute; or &#250; Special punctuation, curly quotes and apostrophes, dashes, actual ampersands “ as &ldquo; or &#8220; Non-English characters þ written as &thorn; or &#254;
  • 13. While we’re cleaning house Tabs Maybe a search-and-replace MaybeWord’s special expansion Tabs are useless in HTML because the get compressed into spaces No matter what kind, multiple spaces are treated as one. Replace tabs with spaces … another 54,000 instances. Then repeat our double- to single-space search: another 6,000+.
  • 14. Now it gets more interesting Matching something that’s not always the same.
  • 15. What about these blocks? Font tags are evil ▪ <font face=""Times"">She had a time of 26.27.33. </font></div><div style=""margin: 0in 0in 0pt""> </div><div style=""margin: 0in 0in 0pt""><font face=""Times"">Freshman Eloisa Parades finished in 147th place with a time of 26.41.7.</font></div><div style=""margin: 0in 0in 0pt""> </div><div style=""margin: 0in 0in 0pt""><font face=""Times"">Freshman Hughnique Rolle finished in 157th place with a time of 27.16.3. </font></div><div style=""margin: 0in 0in 0pt""> </div><div style=""margin: 0in 0in 0pt""><font face=""Times"">Junior MadisonWrest finished in 161st place with a time of 27.29.66.</font></div><div style=""margin: 0in 0in 0pt""> </div><div style=""margin: 0in 0in 0pt ▪ <font color=""#221e1f"" size=""7""><font color=""#221e1f"" size=""7""><span _fck_bookmark=""1"" style=""display: none""> </span></font></font> ▪ These don’t bring anything to the party. ▪ But every one is different! ▪ Fonts and faces, colors, sizes and more. ▪ Also – look at the <span> and <div> tags
  • 16. Regular expressions Essentially, pattern matching Using a special set of meta-characters and wild cards Found in Python, R, Perl, PHP and some common (and free) text editors, all using the POSIX standard But not in Excel, sorry. How do we find a <font…> tag when we don’t know what’s in it?
  • 17. Basic matches and anchors ▪ /any/ – Matches “any,” “many”, “anymore” ▪ /^any/ – Matches “any,” “anymore” ▪ /any$/ – Matches “many,” “any”
  • 18. Optionals, multiples, wildcards SPECIAL SYMBOLS: ? – may or may not be there (lazy) + - one or more instances (greedy) * - zero or more (greedy) {0,2} – zero to two instances ▪ /an*y/ – Matches “any,” “canny”, “cay”, “annnnnnny” ▪ /an?y/ – Matches “many,” “may” ▪ /an+y/ – Matches “any,” “anymore” but not “may” ▪ /an{1,2}y/ – Matches “many,” “any”, “canny” but not “day”
  • 19. Character sets SPECIAL SYMBOLS: . - Any character except a newline - used to escape special characters . – a period - a backslash [ - start of a character set ( - start of a capturing group ▪ /d/ – Any digit ▪ /w/ – Any word character, including digits and underscores ▪ /s/ – Any whitespace character ▪ /D/ – Any non-digit ▪ /W/ – Any non-character ▪ /S/ – Any non-whitespace character ▪ /n/ – New line ▪ /[any]/ – Matches either “a” or “n” or “y”
  • 20. Using Sublime Text’s RegEx Search
  • 21. Matching a font tag ▪ <font color=“”black”” face=“”Tahoma”” size=“”1””> is easy ▪ But it only finds 34 matches ▪ We have 680 font tags in this file
  • 22. Build out a regular expression Sublime Text. shows you a preview as you add to it This: ▪ <font ▪ <fontscolor= ▪ <fonts[a-z]+= But we can’t stop with that ▪ <fonts[a-z]+=[a-z]+> ▪ <fonts[a-z]+=['"]+[a-z]+['"]+> ▪ <fonts[w]+=['"]+[w]+['"]+> Matches: ▪ 691 ▪ 395 ▪ 680 ▪ Nothing ▪ 184 - ['"] is different from [‘”] ▪ 191 – expands to include digits, “-”
  • 23. Pressing onwards There are multiple attributes to find This: ▪ <font(s[w]+=['"]+[w]+['"]+)+> What are we missing? Matches: ▪ 311 <font color=""#0066cc"" style=""list-style-type: none; list-style- position: initial; list-style-image: initial; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding- right: 0px; padding-bottom: 0px; padding-left: 0px; ""> Pound signs, colons, semicolons and frankly who knows what else.
  • 24. Sorry, this was just to show examples There’s a far simpler way: the “negation” pattern. This: ▪ [^"] ▪ [^89] What we’ve been looking for is ▪ <font[^>]+> ▪ <font> ▪ <font[^>]*> Matches: ▪ Anything BUT a double quote ▪ Anything BUT an 8 or 9 ▪ But it only finds 680!?! ▪ 11 matches ▪ Finds all 691
  • 25. Replace them with “” (nothing) ▪ And then, get rid of </font> ▪ Just check – yup, 691 of them.
  • 26. Get rid of other junk tags But it’s not so easy ▪ <span ▪ <span class=“something”> ▪ <span[^>]+(class=['"]+[^'"]+['"]+)[^>]+> ▪ <span id=“something”> ▪ </span> <span> isn’t necessarily a junk tag ▪ 31,263 instances ▪ 4,470 – Maybe keep those ▪ Replace with <span $1>? ▪ Fortunately doesn’t occur. ▪ Remove in some cases
  • 27. Get rid of the junk attributes ▪ Inline styles – sstyle=['"]+[^'"]+['"]+ -- 9,392 matches
  • 28. Get rid of the junk attributes ▪ Proprietary tags – sdata-scayt_word=['"]+[^'"]+['"]+ -- 23,643 matches
  • 29. And so on ▪ data-scaytid – 23,646 matches ▪ … Come to think of it, let’s just get rid of all the <span> tags. But still, get rid of whatever other junk you find ▪ <o:p></o:p> -- 1,036 of these ▪ <p>s+</p> (empty paragraphs) – 4,537 of these
  • 30. But just when you thought it was safe…
  • 31. Dealing with line breaks ▪ The “n” character is special ▪ It matches the end of a line, not a character but the carriage return ▪ Synonymous with “r” ▪ You can search for it. – n([^d]) replaced with $1 yields 12,989 matches ▪ But we want to run it multiple times…10,000 more matches. ▪ We’ve gone from 16.8mb to 14.4mb!
  • 32. Are we there yet? Let’s try opening it in Excel to see where we are.
  • 33.
  • 34. How did we do? Sort the spreadsheet by the last column…
  • 35. Only one wrong one – that’s not so bad.
  • 36. Fortunately it’s just the one ▪ We can delete all the junk MicrosoftWord code ▪ Re-save it and try again ▪ If there were more, it would be easy enough to write regular expressions to track them down.
  • 37. But it still didn’t work. Corrupted data
  • 38. Here’s the line … prevention metho",,,,,,,,,,,,,,,,,,,,s," """"The""",1,9/17/2009 0:00 … Who knows where those commas came from. The most efficient way is to just edit it manually.
  • 39. Different formats ▪ The data export put the “tags” into curly braces: ▪ "{tag1},{tag2}" ▪ We need to get them out somehow. ▪ Let’s make an assumption: – Replace ,“{ with ,“ 676 matches – Replace },{ with , 1060 matches – Replace }”, with “, 676 matches! ▪ That will work! Opening into Excel, a quick Data::Sort shows good results.
  • 40. Some other easy things ▪ The story ID won’t exist ▪ Change a number at the beginning of a line to the word “post” ▪ ^d+ ▪ Change the column headers ▪ Change the status from “1” to “publish” ▪ Get rid of the doubled quotation marks now.
  • 41. So far so good. Now we can remove the columns we don’t need.
  • 42. And clear out the first column too.
  • 43. Some other harder things ▪ Consolidate the thumbnail columns ▪ Large – Medium – Small ▪ You want the first one, the largest one, that has a value. ▪ Excel is kind of great for this ▪ Create a new column ▪ Paste in =IF(ISNA(VLOOKUP("*",I3:K3,1,FALSE)),"",VLOOKUP("*",I3:K3,1,FA LSE)) ▪ Drag it down to the bottom…
  • 44. Be sure to capture the values ▪ If you try to delete the three columns you’ll get a reference error ▪ Make a new column ▪ Copy the previous column ▪ Use the Paste::Values function ▪ Then delete the small, medium and large columns, plus your formula. ▪ There’s a few more steps you could take, but not now…
  • 45. Where Excel Falls Down ▪ Would you be surprised to learn that Excel doesn’t follow the standards forCSV files? ▪ It gives you this: – post,HRL alters housing fee after University reactions,"<p>The Department of Housing and Residence Life … ▪ When what you need is this: – "post","Poetry-in-the-Round features literary legend","<p>As a special presentation by Seton Hall … ▪ Open Office Calc is the tool for this job. – Open your saved file into it, and export it from there as a CSV
  • 46. Montrose Park Historic District Turning a book into SQL queries
  • 48. What to capture? ▪ Street as the category ▪ Address as the “title” ▪ Block # ▪ Lot # ▪ Outbuildings – numbers and types ▪ Descriptions ▪ Styles? ▪ Images?
  • 49. Some basic transformations ▪ Preparing for automatic reading ▪ For this step I used to like the Windows Store app “CodeWriter” ▪ It handled whitespaces, especially new lines, much better ▪ But it’s pretty unstable – save your work often! ▪ Newer versions of SublimeText work well.
  • 50. ▪ s*Keys*ns*Outbuildings:s*(.*) => n$1n ▪ s*Non-Contributings*ns*Outbuildings:s*(.*) -> nn$1 ▪ s*Contributings*ns*Outbuildings:s*(.*) -> n$1nn ▪ s*Blocks(d*)s*Lots(.*) -> n$1n$2 These change this: 470 Berkeley Avenue Block 506 Lot 1 Key Outbuildings: 1 stylistically similar detached carriage house (C) Into this: 470 Berkeley Avenue 506 1 1 stylistically similar detached carriage house (C)
  • 51. ▪ (d*s([A-z]*s[A-z]*)sis.*) => n$1n_ watch that _! Transforms: 470 Berkeley Avenue is a 2 1/2 story, 5 bay, rectangular plan, brick, Neoclassical-influenced, residential building. Constructed c. 1920, the slate-clad, side gambrel roofed house is articulated by a colossal order, fluted Ionic column-supported full front porch with mutule-supported entablature and balustrade above. Three round-arched, pilastered dormers with lancet upper sashes ornament the slate roofline. The fenestration on the facade consists of 9/1 windows with brick lintels featuring stone keystones and sills. The projecting enclosed portico features a segmentally arched brick surround, with a leaded fanlight and matching sidelights. Above the portico entablature is a wrought iron balcony. At one side of the house is a one story, set back sun porch, and at the back of the house, is a cross gambrel wing. This Neoclassical house is located at the corner of Montrose and Berkeley Avenues, in an estate setting. To: 470 Berkeley Avenue is a 2 1/2 story, 5 bay, rectangular plan, brick, Neoclassical-influenced, residential building. Constructed c. 1920, the slate-clad, side gambrel roofed house is articulated by a colossal order, fluted Ionic column-supported full front porch with mutule-supported entablature and balustrade above. Three round-arched, pilastered dormers with lancet upper sashes ornament the slate roofline. The fenestration on the facade consists of 9/1 windows with brick lintels featuring stone keystones and sills. The projecting enclosed portico features a segmentally arched brick surround, with a leaded fanlight and matching sidelights. Above the portico entablature is a wrought iron balcony. At one side of the house is a one story, set back sun porch, and at the back of the house, is a cross gambrel wing. This Neoclassical house is located at the corner of Montrose and Berkeley Avenues, in an estate setting. Berkeley Avenue post
  • 52. ▪ n -> “n” – Adds quotation marks to beginning and end of each line ▪ n -> , – This takes out all the line feeds, and makes a confusing mess ▪ ,"_","", -> ,""n – Makes sense of it again ▪ "post","","" -> "post","“ – Clears an empty field at the end of the line
  • 53. A useful comma-quote delimited entry ▪ "","BERKELEY AVENUE","","470 Berkeley Avenue","506","1","1 stylistically similar detached carriage house (C)","","470 Berkeley Avenue is a 2 1/2 story, 5 bay, rectangular plan, brick, Neoclassical-influenced, residential building. Constructed c. 1920, the slate-clad, side gambrel roofed house is articulated by a colossal order, fluted Ionic column-supported full front porch with mutule-supported entablature and balustrade above. Three round-arched, pilastered dormers with lancet upper sashes ornament the slate roofline. The fenestration on the facade consists of 9/1 windows with brick lintels featuring stone keystones and sills. The projecting enclosed portico features a segmentally arched brick surround, with a leaded fanlight and matching sidelights. Above the portico entablature is a wrought iron balcony. At one side of the house is a one story, set back sun porch, and at the back of the house, is a cross gambrel wing. This Neoclassical house is located at the corner of Montrose and Berkeley Avenues, in an estate setting. ","Berkeley Avenue","post",""
  • 54. You could do something like this: "(d*)","Blocks(d*)nLots(d*)n([^:]*):s*(.*) -> a ready-to-run MySQL query -> INSERT INTO wp_postmeta (post_id,meta_key,meta_value) ($1,'Block','$2');nINSERT INTO wp_postmeta (post_id,meta_key,meta_value) ($1,'Lot','$3');nINSERT INTO wp_postmeta (post_id,meta_key,meta_value) ($1,'$4','$5');n
  • 55. Extracting GeoCoordinates for Mapping Using the Excel 2013-16 WEBSERVICE and FILTERXML functions Column C: =WEBSERVICE(CONCATENATE("htt p://nominatim.openstreetmap.org/s earch/?format=xml&q=",A2,",", B2)) Column D: =FILTERXML(C2,"//place/@lat") Column E: =FILTERXML(C2,"//place/@lon") address Town query latitude longitude 264 Walton Ave. South Orange, NJ <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:21:13+0000'attribution='Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='264 Walton Ave.,South Orange, NJ' polygon='false' exclude_place_ids='1618056838' more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=1618056838 &amp;q=264+Walton+Ave.%2CSouth+Orange%2C+NJ'> <place place_id='1618056838' place_rank='30' boundingbox="40.743006454546,40.743106454546,- 74.267518212121,-74.267418212121"lat='40.7430564545455' lon='-74.2674682121212' display_name='264, Walton Avenue, Academy Heights, South Orange, Essex County, New Jersey, 07079, United States of America' class='place' type='house' importance='0.401'/></searchresults> 40.74305645 -74.26746821 400 South Orange Ave. South Orange, NJ <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:09:51+0000'attribution='Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='400 South Orange Ave.,South Orange, NJ' polygon='false' exclude_place_ids='55320221,111676295,129751356,112011838' more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=55320221,11 1676295,129751356,112011838&amp;q=400+South+Orange+Ave.%2CSouth+Orange%2C+NJ'> <place place_id='55320221'osm_type='way' osm_id='11619078' place_rank='26' boundingbox="40.7457257,40.7463348,-74.2598793,-74.2580144" lat='40.746138' lon='-74.259232' display_name='South Orange Avenue, Academy Heights, South Orange, Essex County, New Jersey, 07079, United States of America' class='highway' type='primary' importance='0.6'/></searchresults> 40.746138 -74.259232 191 Parker Ave. Maplewood, NJ <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:09:51+0000'attribution='Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='191 Parker Ave.,Maplewood, NJ' polygon='false' exclude_place_ids='1618398991' more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=1618398991 &amp;q=191+Parker+Ave.%2CMaplewood%2C+NJ'> <place place_id='1618398991' place_rank='30' boundingbox="40.731509,40.731609,-74.2508735,-74.2507735" lat='40.731559' lon='-74.2508235' display_name='191, Parker Avenue, Maplewood, Essex County, New Jersey, 07040, United States of America' class='place' type='house' importance='0.311'/></searchresults> 40.731559 -74.2508235 6016 Morrow Dr. Brook Park, OH <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:09:51+0000'attribution='Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='6016 Morrow Dr.,Brook Park, OH' polygon='false' exclude_place_ids='1831991659' more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=1831991659 &amp;q=6016+Morrow+Dr.%2CBrook+Park%2C+OH'> <place place_id='1831991659' place_rank='30' boundingbox="41.39944379397,41.39954379397,- 81.797624723618,-81.797524723618" lat='41.3994937939698' lon='-81.7975747236181' display_name='6016, Morrow Drive, Brook Park, Cuyahoga County, Ohio,44142, United States of America' class='place' type='house' importance='0.501'/></searchresults> 41.39949379 -81.79757472 Carrer del Duc, 4 2-1 Barcelona, Catalonia 08002 <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:12:47+0000'attribution='Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='Carrer del Duc, 4 2-1,Barcelona, Catalonia 08002'polygon='false' exclude_place_ids='50632377' more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=50632377&a mp;q=Carrer+del+Duc%2C+4+2-1%2CBarcelona%2C+Catalonia+08002'> <place place_id='50632377' osm_type='way' osm_id='4747020' place_rank='26' boundingbox="41.3838107,41.3848529,2.1728333,2.1736705" lat='41.3845007' lon='2.1731101' display_name='Carrer del Duc, el Gòtic, Ciutat Vella, Barcelona, BCN, CAT, 08002,España' class='highway' type='pedestrian' importance='0.51'/></searchresults> 41.3845007 2.1731101 2 Lafayette Street Fairhaven, MA 02719 <?xml version="1.0" encoding="UTF-8" ?> <searchresults timestamp='Thu, 12 May 16 14:12:30+0000'attribution='Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright' querystring='2 Lafayette Street,Fairhaven, MA 02719' polygon='false' exclude_place_ids='1215278014' more_url='http://nominatim.openstreetmap.org/search.php?format=xml&amp;exclude_place_ids=1215278014 &amp;q=2+Lafayette+Street%2CFairhaven%2C+MA+02719'> <place place_id='1215278014'place_rank='30' boundingbox="41.646861,41.646961,-70.912316,-70.912216" lat='41.646911' lon='-70.912266' display_name='2, Lafayette Street, Fairhaven, Bristol County, Massachusetts, 02719, United States of America' class='place' type='house' importance='0.411'/></searchresults> 41.646911 -70.912266
  • 56. For further study… Matching addresses to images ▪ Folder: GlensideDr ▪ Files: – Dykeman19Glenside.jpg – Fenrich23Glenside.jpg – Finlay8Glenside.jpg – w+(d+)([^.]+).jpg ▪ Automatically generate an “image” field?