Real World Haskell: Lecture 7

2,513 views

Published on

Published in: Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,513
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
64
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Real World Haskell: Lecture 7

  1. 1. Real World Haskell: Lecture 7 Bryan O’Sullivan 2009-12-09
  2. 2. Getting things done It’s great to dwell so much on purity, but we’d like to maybe use Haskell for practical programming some time. This leaves us concerned with talking to the outside world.
  3. 3. Word count import System . E n v i r o n m e n t ( getArgs ) import C o n t r o l . Monad ( f o r M ) countWords p a t h = do c o n t e n t <− r e a d F i l e p a t h l e t numWords = l e n g t h ( words c o n t e n t ) putStrLn ( show numWords ++ ” ” ++ p a t h ) main = do a r g s <− getArgs mapM countWords a r g s
  4. 4. New notation! There was a lot to digest there. Let’s run through it all, from top to bottom. import System . E n v i r o n m e n t ( getArgs ) “Import only the thing named getArgs from System.Environment.” Without an explicit (comma separated) list of names to import, everything that a module exports is imported into this one.
  5. 5. The do block Notice that this function’s body starts with the keyword do: countWords p a t h = do ... That keyword introduces a series of actions. Each action is somewhat similar to a statement in C or Python.
  6. 6. Executing an action and using its result The first line of our function’s body: countWords p a t h = do c o n t e n t <− r e a d F i l e p a t h This performs the action “readFile path”, and assigns the result to the name “content”. The special notation “<−” makes it clear that we are executing an action, i.e. not applying a pure function.
  7. 7. Applying a pure function We can use the let keyword inside a do block, and it applies a pure function, but the code that follows does not need to start with an in keyword. l e t numWords = l e n g t h ( words c o n t e n t ) putStrLn ( show numWords ++ ” ” ++ p a t h ) With both let and <−, the result is immutable as usual, and stays in scope until the end of the do block.
  8. 8. Executing an action This line executes an action, and ignores its return value: putStrLn ( show numWords ++ ” ” ++ p a t h )
  9. 9. Compare and contrast Wonder how different imperative programming in Haskell is from other languages? def c o u n t w o r d s ( p a t h ) : c o n t e n t = open ( p a t h ) . r e a d ( ) num words = l e n ( c o n t e n t . s p l i t ( ) ) p r i n t r e p r ( num words ) + ” ” + p a t h countWords p a t h = do c o n t e n t <− r e a d F i l e p a t h l e t numWords = l e n g t h ( words c o n t e n t ) putStrLn ( show numWords ++ ” ” ++ p a t h )
  10. 10. A few handy rules When you want to introduce a new name inside a do block: Use name <− action to perform an action and keep its result. Use let name = expression to evaluate a pure expression, and omit the in.
  11. 11. More adventures with ghci If we load our source file into ghci, we get an interesting type signature: *Main> :type countWords countWords :: FilePath -> IO () See the result type of IO ()? That means “this is an action that performs I/O, and which returns nothing useful when it’s done.”
  12. 12. Main In Haskell, the entry point to an executable is named main. You are shocked by this, I am sure. main = do a r g s <− getArgs mapM countWords a r g s Instead of main being passed its command line arguments as in C, it uses the getArgs action to retrieve them.
  13. 13. What’s this mapM business? The map function can only call pure functions, so it has an equivalent named mapM that maps an impure action over a list of arguments and returns the list of results. The mapM function has a cousin, mapM , that throws away the result of each action it performs. In other words, this is one way to perform a loop over a list in Haskell. “mapM countWords args” means “apply countWords to every element of args in turn, and throw away each result.”
  14. 14. Compare and contrast II, electric boogaloo These don’t look as similar as their predecessors: def main ( ) : f o r name i n s y s . a r g v [ 1 : ] : c o u n t w o r d s ( name ) main = do a r g s <− getArgs mapM countWords a r g s I wonder if we could change that.
  15. 15. Idiomatic word count in Python If we were writing “real” Python code, it would look more like this: def main ( ) : for path in s y s . argv [ 1 : ] : c = open ( p a t h ) . r e a d ( ) p r i n t l e n ( c . s p l i t ( ) ) , path
  16. 16. Meet forM In the Control .Monad module, there are two functions named forM and forM . They are nothing more than mapM and mapM with their arguments flipped. In other words, these are identical: mapM countWords a r g s f o r M a r g s countWords That seems a bit gratuitous. Why should we care?
  17. 17. Function application as an operator In our last lecture, we were introduced to function composition: f . g = x −> f ( g x ) We can also write a function to apply a function: f $ x = f x This operator has a very low precedence, so we can use it to get rid of parentheses. Sometimes this makes code easier to read: putStrLn ( show numWords ++ ” ” ++ p a t h ) putStrLn $ show numWords ++ ” ” ++ p a t h
  18. 18. Idiomatic word counting in Haskell See what’s different about this word counting? main = do a r g s <− getArgs f o r M a r g s $ a r g −> do c o n t e n t <− r e a d F i l e a r g l e t l e n = l e n g t h ( words c o n t e n t ) putStrLn ( show l e n ++ ” ” ++ a r g ) Doesn’t that use of forM look remarkably like a for loop in some other language? That’s because it is one.
  19. 19. The reason for the $ Notice that the body of the forM loop is an anonymous function of one argument. We put the $ in there so that we wouldn’t have to either wrap the entire function body in parentheses, or split it out and give it a name.
  20. 20. The good Here’s our original code, using the $ operator: f o r M a r g s $ a r g −> do c o n t e n t <− r e a d F i l e a r g l e t l e n = l e n g t h ( words c o n t e n t ) putStrLn ( show l e n ++ ” ” ++ a r g )
  21. 21. The bad If we omit the $, we could use parentheses: f o r M a r g s ( a r g −> do c o n t e n t <− r e a d F i l e a r g l e t l e n = l e n g t h ( words c o n t e n t ) putStrLn ( show l e n ++ ” ” ++ a r g ) )
  22. 22. And the ugly Or we could give our loop body a name: l e t body a r g = do c o n t e n t <− r e a d F i l e a r g l e t l e n = l e n g t h ( words c o n t e n t ) putStrLn ( show l e n ++ ” ” ++ a r g ) ) f o r M a r g s body Giving such a trivial single-use function a name seems gratuitous. Nevertheless, it should be clear that all three pieces of code are identical in their operation.
  23. 23. Trying it out Let’s assume we’ve saved our source file as WC.hs, and give it a try: $ ghc --make WC [1 of 1] Compiling Main ( WC.hs, WC.o ) Linking WC ... $ du -h ascii.txt 58M ascii.txt $ time ./WC ascii.txt 9873630 ascii.txt real 0m8.043s
  24. 24. Comparison shopping How does the performance of our WC program compare with the system’s built-in wc command? $ export LANG=C $ time wc -w ascii.txt 9873630 ascii.txt real 0m0.447s Ouch! The C version is almost 18 times faster.
  25. 25. A second try Does it help if we recompile with optimisation? $ ghc -fforce-recomp -O --make WC $ time ./WC ascii.txt 9873630 ascii.txt real 0m7.696s So that made our code 5% faster. Ugh.
  26. 26. What’s going on here? Remember that in Haskell, a string is a list. And a list is represented as a linked list. This means that every character gets its own list element, and list elements are not allocated contiguously. For large data structures, list overhead is negligible, but for characters, it’s a total killer. So what’s to be done? Enter the bytestring.
  27. 27. The original code main = do a r g s <− getArgs f o r M a r g s $ a r g −> do c o n t e n t <− r e a d F i l e a r g l e t l e n = l e n g t h ( words c o n t e n t ) putStrLn ( show l e n ++ ” ” ++ a r g )
  28. 28. The bytestring code A bytestring is a contiguously-allocated array of bytes. Because there’s no pointer-chasing overhead, this should be faster. import q u a l i f i e d Data . B y t e S t r i n g . Char8 a s B main = do a r g s <− getArgs f o r M a r g s $ a r g −> do c o n t e n t <− B . r e a d F i l e a r g l e t l e n = l e n g t h (B . words c o n t e n t ) putStrLn ( show l e n ++ ” ” ++ a r g ) Notice the import qualified—this allows us to write B instead of Data.ByteString.Char8 wherever we want to use a name imported from that module.
  29. 29. So is it faster? How does this code perform? $ time ./WC ascii.txt 9873630 ascii.txt real 0m8.043s $ time ./WC-BS ascii.txt 9873630 ascii.txt real 0m1.434s Not bad! We’re 6x faster than the String code, and now just 3x slower than the C code.
  30. 30. Seriously? Bytes for text? There is, of course, a snag to using bytestrings: they’re strings of bytes, not characters. This is the 21st century, and everyone should be using Unicode now, right? Our answer to this problem in Haskell is to use a package named Data.Text.
  31. 31. Unicode-aware word count import q u a l i f i e d Data . Text a s T import Data . Text . E n c o d i n g ( d e c o d e U t f 8 ) import q u a l i f i e d Data . B y t e S t r i n g . Char8 a s B main = do a r g s <− getArgs f o r M a r g s $ a r g −> do b y t e s <− B . r e a d F i l e a r g l e t content = decodeUtf8 bytes l e n = l e n g t h (T . words c o n t e n t ) putStrLn ( show l e n ++ ” ” ++ a r g )
  32. 32. What happens here? Notice that we still use bytestrings to read the initial data in. Now, however, we use decodeUtf8 to turn the raw bytes from UTF-8 into the Unicode representation that Data.Text uses internally. We then use Data.Text’s words function to split the big string into a list of words.
  33. 33. Comparing Unicode performance For comparison, let’s first try a Unicode-aware word count in C, on a file containing 112.6 million characters of UTF-8-encoded Greek: $ du -h greek.txt 196M greek.txt $ export LANG=en_US.UTF-8 $ time wc -w greek.txt 16917959 greek.txt real 0m8.306s $ time ./WC-T greek.txt 16917959 greek.txt real 0m7.350s
  34. 34. What did we just see? Wow! Our tiny Haskell program is actually 13% faster than the system’s wc command! This suggests that if we choose the right representation, we can write real-world code that is both brief and highly efficient. This ought to be immensely cheering.

×