Your SlideShare is downloading. ×
The Swiss Army knife of programming languages, perl, has many blades but you need to know how to use them if you want accurate results.
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The Swiss Army knife of programming languages, perl, has many blades but you need to know how to use them if you want accurate results.

144

Published on

Complex terms, of no resemblance to human language and not existing standard, are often used to label lab samples. The alphanumeric character combination often used therein however, make the …

Complex terms, of no resemblance to human language and not existing standard, are often used to label lab samples. The alphanumeric character combination often used therein however, make the downstream data-analysis tricky because accurate counting of complex patterns is not a trivial task. I have described one of the accurate solutions for the above problem in the current .pdf.

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
144
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. By Mehis Pold Email: mehisp@hotmail.com 07/06/2014 $REGEX++ if $_ =~ ;m/^$Symbol$/ In spirit of sharing tools: for those perl monks tired of soccer and fireworks! Large data files pose an interesting challenge – accurate counting of terms, which are often full of rich imagination associates with the labels of lab samples. These labels often contain weird combinations of non-alphanumeric characters, and they come in all kinds of character lengths – from a single character up…you name it! If embedded in a large file containing multiple categories of data then counting complex terms directly in the original file will produce inaccurate results because the short unique labels often make up parts of longer labels. I tried several different perl syntaxes to accurately count the complex terms containing non- alphanumeric characters. The following approach produced the accurate results: 1. Always have the terms you want the precise count for in a separate file, each term on a separate line 2. Always negate the terms of interest because: a. Complex terms containing some non-alphanumeric characters won't be counted at all if un- negated. 3. Use SINGLE QUOTES instead of double quotes to EXCLUSIVELY define the terms of interest because the double-quoted terms containing certain non-alphanumeric character combinations won't be counted otherwise. a. For example, a regex containing parentheses like S02-3565(2-3) won't be counted if double-quoted b. Even worse, a regex like S02-3565(2-3 will make a pattern matching perl script exit if double-quoted. 4. Define the start and end of your term of interest as equivalent using the following perl pattern matching syntax to process the input: $REGEX++ if $_ =~ ;m/^$Symbol$/ Where the /^ and $/ denote the start and end of each line in the input file. In other words, make the perl understand that shorter patterns are not parts of longer, more complex patterns. The above approach is not very fast. But if you really want to be accurate then this current approach will take you there! I have posted the full script with additional explanations at GitHub. Please see: https://github.com/mpold/Exo-Me/blob/master/EXCLUSIVE%20PATTERN%20MATCHING

×