The document describes methods for cleaning up and transforming unstructured data into a structured format like CSV. It discusses using regular expressions and find-and-replace functions to remove unnecessary HTML tags, consolidate columns, standardize formats and deal with line breaks. The goal is to take around 5,000 news stories from an unstructured format and prepare the data to be imported into a database for analysis.
9. Start with some low-hanging fruit
What exactly is “­” ?
It’s a “soft hyphen.”
Do a find-and-replace with any text editor to
replace with “” (nothing).
3,777 replacements made.
10. And more…
What exactly is “ ” ?
It’s a “non-breaking space.”
Do a find-and-replace with any text editor to
replace with with “ ” (just a space).
28,177 replacements made.
11. And while we’re on the subject…
Do a find-and-replace with any text editor to
replace a double space with a single space.
15,064 replacements made.
But try it again: 4,448 replacements.
Then 1,043…
Then 456…
Then 76…
Then 32…
Then 16…
Then 2…
Finally …. 0
12. So what are all these &…; things anyway?
HTML character entities
Mathematical symbols
θ written as θ or θ
≈ written as ≈ or ≈
Accented characters
ú written as ú or ú
Special punctuation, curly quotes and apostrophes, dashes, actual ampersands
“ as “ or “
Non-English characters
þ written as þ or þ
13. While we’re cleaning house
Tabs
Maybe a search-and-replace
MaybeWord’s special expansion
Tabs are useless in HTML because the get compressed into spaces
No matter what kind, multiple spaces are treated as one.
Replace tabs with spaces … another 54,000 instances.
Then repeat our double- to single-space search: another 6,000+.
14. Now it gets more interesting
Matching something that’s not always the same.
15. What about these blocks?
Font tags are evil
▪ <font face=""Times"">She had a time of 26.27.33.
</font></div><div style=""margin: 0in 0in 0pt"">
</div><div style=""margin: 0in 0in 0pt""><font
face=""Times"">Freshman Eloisa Parades finished in 147th
place with a time of 26.41.7.</font></div><div
style=""margin: 0in 0in 0pt""> </div><div style=""margin:
0in 0in 0pt""><font face=""Times"">Freshman Hughnique
Rolle finished in 157th place with a time of 27.16.3.
</font></div><div style=""margin: 0in 0in 0pt"">
</div><div style=""margin: 0in 0in 0pt""><font
face=""Times"">Junior MadisonWrest finished in 161st
place with a time of 27.29.66.</font></div><div
style=""margin: 0in 0in 0pt""> </div><div style=""margin:
0in 0in 0pt
▪ <font color=""#221e1f"" size=""7""><font
color=""#221e1f"" size=""7""><span _fck_bookmark=""1""
style=""display: none""> </span></font></font>
▪ These don’t bring anything to
the party.
▪ But every one is different!
▪ Fonts and faces, colors, sizes
and more.
▪ Also – look at the <span> and
<div> tags
16. Regular expressions
Essentially, pattern matching
Using a special set of meta-characters and wild cards
Found in Python, R, Perl, PHP and some common (and
free) text editors, all using the POSIX standard
But not in Excel, sorry.
How do we find a <font…> tag when we don’t
know what’s in it?
18. Optionals, multiples, wildcards
SPECIAL SYMBOLS:
? – may or may not be there (lazy)
+ - one or more instances (greedy)
* - zero or more (greedy)
{0,2} – zero to two instances
▪ /an*y/
– Matches “any,” “canny”, “cay”, “annnnnnny”
▪ /an?y/
– Matches “many,” “may”
▪ /an+y/
– Matches “any,” “anymore” but not “may”
▪ /an{1,2}y/
– Matches “many,” “any”, “canny” but not “day”
19. Character sets
SPECIAL SYMBOLS:
. - Any character except a
newline
- used to escape special
characters
. – a period
- a backslash
[ - start of a character set
( - start of a capturing group
▪ /d/
– Any digit
▪ /w/
– Any word character, including digits and underscores
▪ /s/
– Any whitespace character
▪ /D/
– Any non-digit
▪ /W/
– Any non-character
▪ /S/
– Any non-whitespace character
▪ /n/
– New line
▪ /[any]/
– Matches either “a” or “n” or “y”
21. Matching a font tag
▪ <font color=“”black”” face=“”Tahoma”” size=“”1””> is easy
▪ But it only finds 34 matches
▪ We have 680 font tags in this file
22. Build out a regular expression
Sublime Text. shows you a preview as you add to it
This:
▪ <font
▪ <fontscolor=
▪ <fonts[a-z]+=
But we can’t stop with that
▪ <fonts[a-z]+=[a-z]+>
▪ <fonts[a-z]+=['"]+[a-z]+['"]+>
▪ <fonts[w]+=['"]+[w]+['"]+>
Matches:
▪ 691
▪ 395
▪ 680
▪ Nothing
▪ 184 - ['"] is different from [‘”]
▪ 191 – expands to include digits, “-”
23. Pressing onwards
There are multiple attributes to find
This:
▪ <font(s[w]+=['"]+[w]+['"]+)+>
What are we missing?
Matches:
▪ 311
<font color=""#0066cc"" style=""list-style-type: none; list-style-
position: initial; list-style-image: initial; margin-top: 0px; margin-right:
0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-
right: 0px; padding-bottom: 0px; padding-left: 0px; "">
Pound signs, colons, semicolons and frankly who knows what else.
24. Sorry, this was just to show examples
There’s a far simpler way: the “negation” pattern.
This:
▪ [^"]
▪ [^89]
What we’ve been looking for is
▪ <font[^>]+>
▪ <font>
▪ <font[^>]*>
Matches:
▪ Anything BUT a double quote
▪ Anything BUT an 8 or 9
▪ But it only finds 680!?!
▪ 11 matches
▪ Finds all 691
25. Replace them with “” (nothing)
▪ And then, get rid of </font> ▪ Just check – yup, 691 of them.
26. Get rid of other junk tags
But it’s not so easy
▪ <span
▪ <span class=“something”>
▪ <span[^>]+(class=['"]+[^'"]+['"]+)[^>]+>
▪ <span id=“something”>
▪ </span>
<span> isn’t necessarily a junk tag
▪ 31,263 instances
▪ 4,470 – Maybe keep those
▪ Replace with <span $1>?
▪ Fortunately doesn’t occur.
▪ Remove in some cases
27. Get rid of the junk attributes
▪ Inline styles
– sstyle=['"]+[^'"]+['"]+ -- 9,392 matches
28. Get rid of the junk attributes
▪ Proprietary tags
– sdata-scayt_word=['"]+[^'"]+['"]+ -- 23,643 matches
29. And so on
▪ data-scaytid – 23,646 matches
▪ …
Come to think of it, let’s just get rid of all the <span> tags.
But still, get rid of whatever other junk you find
▪ <o:p></o:p> -- 1,036 of these
▪ <p>s+</p> (empty paragraphs) – 4,537 of these
31. Dealing with line breaks
▪ The “n” character is special
▪ It matches the end of a line, not a character but the carriage return
▪ Synonymous with “r”
▪ You can search for it.
– n([^d]) replaced with $1 yields 12,989 matches
▪ But we want to run it multiple times…10,000 more matches.
▪ We’ve gone from 16.8mb to 14.4mb!
32. Are we there yet?
Let’s try opening it in Excel to see where we are.
33.
34. How did we do?
Sort the spreadsheet by the last column…
36. Fortunately it’s just the one
▪ We can delete all the junk MicrosoftWord code
▪ Re-save it and try again
▪ If there were more, it would be easy enough to write regular
expressions to track them down.
38. Here’s the line
… prevention metho",,,,,,,,,,,,,,,,,,,,s," """"The""",1,9/17/2009 0:00 …
Who knows where those commas came from.
The most efficient way is to just edit it manually.
39. Different formats
▪ The data export put the “tags” into curly braces:
▪ "{tag1},{tag2}"
▪ We need to get them out somehow.
▪ Let’s make an assumption:
– Replace ,“{ with ,“ 676 matches
– Replace },{ with , 1060 matches
– Replace }”, with “, 676 matches!
▪ That will work! Opening into Excel, a quick Data::Sort shows good
results.
40. Some other easy things
▪ The story ID won’t exist
▪ Change a number at the beginning of a line to the word “post”
▪ ^d+
▪ Change the column headers
▪ Change the status from “1” to “publish”
▪ Get rid of the doubled quotation marks now.
41. So far so good. Now we can remove the
columns we don’t need.
43. Some other harder things
▪ Consolidate the thumbnail columns
▪ Large – Medium – Small
▪ You want the first one, the largest one, that has a value.
▪ Excel is kind of great for this
▪ Create a new column
▪ Paste in
=IF(ISNA(VLOOKUP("*",I3:K3,1,FALSE)),"",VLOOKUP("*",I3:K3,1,FA
LSE))
▪ Drag it down to the bottom…
44. Be sure to capture the values
▪ If you try to delete the three columns you’ll get a reference error
▪ Make a new column
▪ Copy the previous column
▪ Use the Paste::Values function
▪ Then delete the small, medium and large columns, plus your formula.
▪ There’s a few more steps you could take, but not now…
45. Where Excel Falls Down
▪ Would you be surprised to learn that Excel doesn’t follow the
standards forCSV files?
▪ It gives you this:
– post,HRL alters housing fee after University reactions,"<p>The Department of
Housing and Residence Life …
▪ When what you need is this:
– "post","Poetry-in-the-Round features literary legend","<p>As a special
presentation by Seton Hall …
▪ Open Office Calc is the tool for this job.
– Open your saved file into it, and export it from there as a CSV
48. What to capture?
▪ Street as the category
▪ Address as the “title”
▪ Block #
▪ Lot #
▪ Outbuildings – numbers and types
▪ Descriptions
▪ Styles?
▪ Images?
49. Some basic transformations
▪ Preparing for automatic reading
▪ For this step I used to like the Windows Store app “CodeWriter”
▪ It handled whitespaces, especially new lines, much better
▪ But it’s pretty unstable – save your work often!
▪ Newer versions of SublimeText work well.
50. ▪ s*Keys*ns*Outbuildings:s*(.*) => n$1n
▪ s*Non-Contributings*ns*Outbuildings:s*(.*) ->
nn$1
▪ s*Contributings*ns*Outbuildings:s*(.*) -> n$1nn
▪ s*Blocks(d*)s*Lots(.*) -> n$1n$2
These change this:
470 Berkeley Avenue Block 506 Lot 1
Key
Outbuildings: 1 stylistically similar detached carriage house (C)
Into this:
470 Berkeley Avenue
506
1
1 stylistically similar detached carriage house (C)
51. ▪ (d*s([A-z]*s[A-z]*)sis.*) => n$1n_ watch that _!
Transforms:
470 Berkeley Avenue is a 2 1/2 story, 5 bay, rectangular plan, brick, Neoclassical-influenced,
residential building. Constructed c. 1920, the slate-clad, side gambrel roofed house is articulated by
a colossal order, fluted Ionic column-supported full front porch with mutule-supported entablature
and balustrade above. Three round-arched, pilastered dormers with lancet upper sashes ornament
the slate roofline. The fenestration on the facade consists of 9/1 windows with brick lintels featuring
stone keystones and sills. The projecting enclosed portico features a segmentally arched brick
surround, with a leaded fanlight and matching sidelights. Above the portico entablature is a
wrought iron balcony. At one side of the house is a one story, set back sun porch, and at the back of
the house, is a cross gambrel wing. This Neoclassical house is located at the corner of Montrose and
Berkeley Avenues, in an estate setting.
To:
470 Berkeley Avenue is a 2 1/2 story, 5 bay, rectangular plan, brick, Neoclassical-influenced,
residential building. Constructed c. 1920, the slate-clad, side gambrel roofed house is articulated by
a colossal order, fluted Ionic column-supported full front porch with mutule-supported entablature
and balustrade above. Three round-arched, pilastered dormers with lancet upper sashes ornament
the slate roofline. The fenestration on the facade consists of 9/1 windows with brick lintels featuring
stone keystones and sills. The projecting enclosed portico features a segmentally arched brick
surround, with a leaded fanlight and matching sidelights. Above the portico entablature is a
wrought iron balcony. At one side of the house is a one story, set back sun porch, and at the back of
the house, is a cross gambrel wing. This Neoclassical house is located at the corner of Montrose and
Berkeley Avenues, in an estate setting.
Berkeley Avenue
post
52. ▪ n -> “n”
– Adds quotation marks to beginning and end of each line
▪ n -> ,
– This takes out all the line feeds, and makes a confusing mess
▪ ,"_","", -> ,""n
– Makes sense of it again
▪ "post","","" -> "post","“
– Clears an empty field at the end of the line
53. A useful comma-quote delimited entry
▪ "","BERKELEY AVENUE","","470 Berkeley Avenue","506","1","1 stylistically
similar detached carriage house (C)","","470 Berkeley Avenue is a 2 1/2
story, 5 bay, rectangular plan, brick, Neoclassical-influenced, residential
building. Constructed c. 1920, the slate-clad, side gambrel roofed house is
articulated by a colossal order, fluted Ionic column-supported full front
porch with mutule-supported entablature and balustrade above. Three
round-arched, pilastered dormers with lancet upper sashes ornament the
slate roofline. The fenestration on the facade consists of 9/1 windows with
brick lintels featuring stone keystones and sills. The projecting enclosed
portico features a segmentally arched brick surround, with a leaded fanlight
and matching sidelights. Above the portico entablature is a wrought iron
balcony. At one side of the house is a one story, set back sun porch, and at
the back of the house, is a cross gambrel wing. This Neoclassical house is
located at the corner of Montrose and Berkeley Avenues, in an estate
setting. ","Berkeley Avenue","post",""
54. You could do something like this:
"(d*)","Blocks(d*)nLots(d*)n([^:]*):s*(.*)
-> a ready-to-run MySQL query ->
INSERT INTO wp_postmeta (post_id,meta_key,meta_value)
($1,'Block','$2');nINSERT INTO wp_postmeta
(post_id,meta_key,meta_value) ($1,'Lot','$3');nINSERT INTO
wp_postmeta (post_id,meta_key,meta_value) ($1,'$4','$5');n