Peter Scott, author of the O'Reilly School of Technology's Perl Programming Certificate series, talks about how to deal with "legacy" Perl code - written by someone else, or maybe even yourself when you were younger and less wise.
O'Reilly MediaSr. Publicist / Online Event & Webcast Producer at O'Reilly Media
Thank you for coming. Ask questions before break so I have time to research.
Code written by someone else Or you, long enough ago Say a couple of weeks Why is Perl so susceptible?
Perl’s motto is also a curse Perl is like English If you have William F. Buckley Jr., you also have Homer Simpson
100 line Perl script may not get the same attention to coding standards, documentation, or other methodology as a 1,000 line C program even though it deserves it just as much.
Slow - how often is that really a problem? Odds are that if it’s too slow in Perl it’s going to be too slow in any language. Or, it was written using a poor algorithm to begin with, which is something that likewise you can do in any language.
Books like PBP help with this problem by telling you what not to do as much as what to do. Brian Kernighan: Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are by definition, not smart enough to debug it.
How good a programmer are/were they? How fluent in Perl?
Too many people skip this step and assume they know what they should do. Are original optimization goals still valid?
Think about testing now. Test with same tools as for Test First. Tests inline make code harder to read though.
Sounds obvious, but if tests aren’t fun to write, haven’t followed. Look for more Test:: modules on CPAN.
Could do same as before - save HTML to file and compare. This is better.
Explain how works.
Pretty to look at means easy on the eyes - not in the usual sense but easy to read. Not formatted with Acme::Eyedropts to look like Mona Lisa. Consistent layout is easy with editors - hit TAB in Emacs.
Common indentation style - none at all.
Not quite my style - braces in K&R style to use fewer lines.
Ok… here it is using my style. So what if it takes up more space. How many people prefer this style, anyway?
You probably want to know which parts of the code are being executed. You might want to know how fast those parts are being executed.
Analysis you do with eyeballs. Example app: 2.5 hrs/1GB -> 20min/12MB, code halved, using hashes. Regexes - except may be optimized for performance. So profile.
If you're working in a team, once you've settled on indentation style you ought to settle commenting style. Here you see the result of paying someone by the line. Note that
this is the only thing here that actually does anything. Dealing with the Documentation Hound is an insidious problem because of course many comments *are* useful and on the whole, people don't use enough of them. It’s like talking to a random group of people and saying, “Y’all need to eat more.” There might be a couple of anorexics in the audience, but for most people, that’s not the right message. But the point is that everything in the program, whether it's code or comments, should contribute towards getting the job done or understanding how it works, and anything that doesn't do that is getting in the way of the stuff that does. So some of these comments would be helpful, but the majority of them are just making it harder for you to read the code.
What can you do about it? Firstly, get rid of the ASCII art. If you want a gap, use a blank line. -click- Next, if the comments relate to how to call public methods or functions, then put them in documentation so anyone can see a proper interface document just by running perldoc. But if you're the kind of project that uses comments about the signature of a function, those are fine. Personally, I leave that information to POD. -click- The whole point of this pruning exercise is to expose the code itself. It was pretty obvious in this example that there was no point at all in having a function for adding one to a number since it would be shorter and clearer just to use ++ in line. Usually, you're not going to be able to inline a subroutine, but if you're taking over maintenance of a coding horror this is the first phase in a rewriting campaign and the next step is line by line editing that may shorten the code still further. Ultimately, you'd like every function and method to be no longer than one screen of your editor. Look at Brian Ingerson. His methods are so short that just the line "my $self = shift" accounted for 10% of their line counts, so he wrote a module to put it in automatically. -click- The comments that you really want are any that answer the question, "Why?" That is the whole point of all documentation: Why is this line of code so weird, why does this function have a bizarre name, why does my database not follow normalization rules, why am I using substr() and index() instead of a simpler regular expression, why, why, why? If you come back to a program you wrote six months ago without perfect documentation, I guarantee that the first question you ask when you look at it again is going to begin with "Why". Anything that answers that question needs to be preserved, enshrined, and embalmed.
I'm not saying you should get rid of all that information about who wrote the program and when, and what changes they made; just that it belongs in the right place. If the program is something you're using in your own environment then you can just use a source code control system. Everything from RCS to Subversion will keep track of all that stuff for you. That's where you'd expect to find that information if you were looking for it, so put it there. If you're making a distribution to send to someone else - such as CPAN - then you can stick that information in a README and a change log file. Okay, moving on to the next example…
Ok, next pop quiz: what does this do? (beat) Beep - time's up.
You just look at that and go, "Oh, of course, it replaces groups of consecutive non-digits with a single space."
Okay, here we have something I copied here verbatim from where I saw it so you could see how the author daringly defied the normal rules of layout for the sake of conciseness, alignment, and general prettiness, and it certainly is pretty. Too bad it's so darned repetitive. Repetitive code is a coding horror that's like a leech sucking on your brain, because part of your brain is going, "okay, skip all these, they're all the same," and another, wiser part of your brain is going, "Wait a minute, I have to check to make sure they *are* all the same." So is this like the regex example - is one of the digits missing? No! But the more cautious among you probably wondered, "Gee, should it include zero in the list? Well, it's a month, and they're usually numbered starting at one, so…" Too much stuff for your brain to think about! You need it for more important things, like how to explain to your wife why you just bought that 50” TV.
So repetitive code is a giant wake-up call that there's a better way of doing something, and of course, here it is. Speaking of repetitive code,
here's another example that's all too common… again, we have someone who's just not impatient enough. Either they like typing, or they like using their editor macros, either way, you've got better things to do than stare at repetitive code making sure all of those things are really the same. Trust me, it's happening, even if you don't realize it - it's the subconscious programmer's brain at work, the same thing that tells you when it's time to eat. Give it something better to do.
This kind of task is exactly why here docs were invented. You just type what you want with the line breaks the way you want them and you don't have to worry about quoting delimiters. "Oh, but Peter, I don't like the text being up against the left margin instead of indented to make it look separate." Well then,
indent it, if that's what you want to do. It's just a regex away. There are umpteen ways of doing this, and if you want to indent the heredoc terminator as well, <click> you can do that too if you quote it right to begin with, or you can use one of a couple of nifty source code filters that do the job for you. The point is, don't be satisfied with repetitive code.
Especially something heinous like this, which always drives me crazy. Folks, when I'm reading a Perl program, I'm in Perl reading mode, and maybe sometimes in regex reading mode. I'm not in HTML reading mode, and I don't want to be. -click- I prefer not to look at the HTML at all, and there are a lot of ways of just get it out into a separate file or files. If you put it in a separate file then you can use an HTML editor on it - no HTML editor is going to be able to validate HTML that’s intertwingled with Perl like this. My favorite HTML editor is - someone else. Because inevitably you get all kinds of requests to make this font two points bigger, or make this background a little bit pinker, and it’s not my kind of thing. But there are people who get off on that, and I’m only too happy to let them edit HTML without having to look at my code. HTML and Perl are like dogs and cats; they just shouldn't breed together. It's bad enough when the HTML is in a heredoc, but when they're mixed together on the same line like this over and over, it's nuts. If you really want to put the HTML in the same file - say because you are the HTML editor and you won’t want anyone else touching your HTML - then use Inline::Files so you can have it in a nice clearly marked separate section where it can't escape and start molesting the Perl code.
Does this look familiar to anyone? It should, because it is pasted verbatim from the perldoc documentation for localtime. You see this all the time, but then inevitably <click> they only usae a couple of the variables. And that part of your brain that acts like a little Perl interpreter is left wondering what those other variables are for and when they’re going to get used.
Just declare the ones you’re going to use.
And if you don’t like the numbers 4 and 3 here,
well you don't even have to do that either, if you don't mind using a module that's come with Perl since at least version 5.004. The anal-retentive part of me is forced to point out that there’s a small bug here in that if in between the first and the second call to localtime the system clock passes midnight at the end of the last day of the month, then the results will be inconsistent. If that should happen to bother you, then by all means use the previous example with the list slice and symbolic constants set to 4 and 3 so you know what they mean. Next…
How many people have seen a program that starts with a laundry list like this? How many of you liked it? This has already filled up the entire screen and we've hardly gotten to any executable code yet. Of course, the Documentation Hound cures this problem by declaring each one on a separate line with its own comment block preceding it saying what it's for, when it first appeared in the program, what values it's allowed to take, and other stuff that makes this ten times longer than it already is. So what do we do about it?
First and most important, variable declarations should go as late as possible. Remember, part of the reader's brain is going to be occupied with remembering every variable for as long as it's in scope, so push them to the innermost scope levels possible. The one exception to that is the variable that's used for specifying some global configuration setting, like the path to an important directory or the value of Avagadro's constant, or something. The main reason for putting those at the top of the code is so that a lesser programmer who comes in wanting to change one of those things in your program will find it quickly and not go spelunking through the rest of your code where they might break something. -click- You can help keep variables in the innermost scope possible by using the fact that 'my' can appear before a variable just about anywhere in Perl and in particular, you can declare a loop variable at the point you say it's a loop variable so that it'll go out of scope as soon as the loop finishes. -click- In fact, you can put 'my' in some pretty unlikely places. The second argument to the getopts() function exported by Getopt::Std is a reference to a hash to store the options specified by the user on the command line. You can even put the 'my' inside an enreferencing backslash there to save on having to declare it in a separate line. -click- This kind of coding horror is usually a symptom of another kind.
It's not really possible to show an example of this on the screen so you'll have to use your imagination. -click- But I'm sure it's familiar to many of you. (How many people know what JCL is?) -click- This is the usual excuse for how the program got that way. -click- And this is the inevitable defense when you point out that you need a machete to hack through the code.
So what do you do? It's not easy. You have to identify areas where variables are used only over a short part of the program, and then see whether that means tghat that part can be taken out into a subroutine. -click- If you use Eclipse - I'm afraid I haven't yet - then you might want to uses the Devel::Refactor module from CPAN, which can figure out what the subroutine for any given chunk of code should be. That doesn't necessarily mean it's going to be a good choice, mind you. -click- Because Monolithic Madness seldom uses strict, that means you're going to have to add it at some point if you want to stay sane. Or become sane, whichever applies to you. You can take a piecemeal approach to this if you want; first, turn on strictness for the whole program, but then embed the rest of the whole program in a naked block and immediately turn strictness off, which of course is going to be a giant no-op. But now you can start pulling code out of that no strictness block at the beginning and the end and fixing the code to be strict compliant as you go, and keep shrinking the size of the inner no strictness block.
I have taken some creative license with the spacing in these examples for the sake of some entertainment value. So this first example is code written by someone whose favorite language is - what?
Yep, C. This just prepends a root path to a bunch of relative paths. Of course, the C programmer is probably wondering at this point how Perl works without a malloc or realloc function.
Here's how a native Perl speaker would do that same task.
This next example is what might come from someone who was more used to… what?
Yep, FORTRAN. This prints a matrix of numbers to a file. I've tried to pick examples here that are sort of in keeping with the kinds of things those languages are generally used for. So a FORTRAN programmer when they want to print something might start looking for something called FORMAT and figure that's what they need to use for printing formatted values. And… yeessss, technically that's correct, but of course that's not how we'd really do that most of the time.
And here's the native Perl solution, or at least one way of doing it.
Okay, this should be a fairly easy one… this is Perl spoken by a what programmer?
That's right. This being a COBOL program, it just adds two numbers together, but because they're numbers representing money, that's okay.
And here's the native Perl solution. It's not really changed much because the original COBOL implies strict fixed-length formatted records, and we use unpack() for handling those in Perl. Of course, we're more used to writing applications that handle less structured input formats, or more modern structured input formats like XML.
Okay, this last example should be pretty easy, you've got the hang of this by now…
Right, BASIC. Okay, so this is the kind of program I used to write when I was learning BASIC: guess what number someone's thinking of, the hard way.
And here's a native Perl solution. The astute reader will note that since the original program never actually *did* anything with the right answer, in this version, it doesn't even stay in scope once we found it. CHANGE TO USE IO::Prompt
Okay, something a little more serious now. I'm going to assume that all of you have seen this code by now, because it's multiplying around the internet like a bad virus. --click- Of course, this is the code to parse CGI inputs - well, the beginning of the code, anyway, I couldn't bring myself to finish pasting it. This is just one instance of it I found somewhere - it wasn't hard, I just stuck out my foot and this tripped over it - and it's better than many I've seen, but still, isn't
*this* a lot easier to read? Not to mention the fact that it's come bundled with the Perl core since 5.004, and it gets the decoding right, unlike the previous code and virtually every other home-grown solution floating out there. It's just a lot harder than most people think to handle all the possibilities. Of course, it doesn't end with CGI.pm. There are still zillions of people parsing XML and HTML with simple regular expressions, people calling out to programs like 'date' and 'cal' instead of using a Date:: module, and people calling database programs like sqlplus instead of using DBI.
Mixed abstraction levels refers to code of different complexity in a method - it should all look the same level of abstraction. Time dependencies means the caller must call methods in a certain order - this pushes some of the responsibilities of the class onto the caller - create methods managing correct calling sequence
Restructuring is easier the less code you have. I recently compressed subroutine from 200 lines to 12. Ideally all blocks fit on screen (mine is 50 lines high). Code that’s not covered, write test for or delete (have got revision control).
Typical. Bug in this line, what is it? Quote missing after $zx4. Another bug. Says $x2 in the values line where the corresponding column name is x3. Tim wouldn’t settle - read DBI, discover placeholders
Still has a bug. $y2 and $y3 are reversed in the argument list. That's not the only bug. There's one too many question marks in the values line.
Maintaining associations between two sets is a hash. Now every time you need to insert values in a table, just call the subroutine.
Nothing ruins reading like text lump. Lots of print statements -> Here document -> separate file (config file) Usually reason for text is creating some web page etc. Candidates for templating, give work to somone else.
People post code using symref unwittingly. Flamed ‘cos couldn't be using strict. strict subs stops calling sub without parens if no def yet, so define sub earlier, or declare stub, or put empty parens after call. strict vars requires lexicals or effort to use package var.
use warnings is relatively new - so probably see -w. And that's fine. use warnings is lexically scoped, so… -W forces warnings everywhere, regardless of attempts to turn off. So use it, decide, remove, use warnings.
Still zillions of people parsing XML and HTML with regexen, calling out to 'date' and 'cal' & sql+. Some code may have been pasted from third-party modules before was okay to trust CPAN.