Text manipulation
               with/without parsec
      October 11, 2011 Vancouver Haskell UnMeetup

                            Tatsuhiro Ujihisa




Tuesday, October 11, 2011
• Tatsuhiro Ujihisa
               • @ujm
               • HootSuite Media inc
               • Osaka, Japan
               • Vim: 14
               • Haskell: 5
Tuesday, October 11, 2011
Topics
               • text manipulation functions with/
                     without parsec
               • parsec library
               • texts in Haskell
               • attoparsec library


Tuesday, October 11, 2011
Haskell for work
               • Something academical
               • Something methematical
               • Web app
               • Better shell scripting
               • (Improve yourself )

Tuesday, October 11, 2011
Text manipulation
               • The concept of text
               • String is [Char]
                • lazy
                • Pattern matching


Tuesday, October 11, 2011
Example: split
               • Ruby/Python example
                • 'aaa<>bb<>c<><>d'.split('<>')
                            ['aaa', 'bb', 'c', '', 'd']
               • Vim script example
                • split('aaa<>bb<>c<><>d', '<>')


Tuesday, October 11, 2011
split in Haskell
               • split :: String -> String -> [String]
                • split "aaa<>bb<>c<><>d" "<>"
                            ["aaa", "bb", "c", "", "d"]
                    • "aaa<>bb<>c<><>d" `split` "<>"



Tuesday, October 11, 2011
Design of split
               • split "aaa<>bb<>c<><>d" "<>"
               • "aaa" : split "bb<>c<><>d" "<>"
               • "aaa" : "bb" : split "c<><>d" "<>"
               • "aaa" : "bb" : "c" : split "<>d" "<>"
               • "aaa" : "bb" : "c" : "" : split "d" "<>"
               • "aaa" : "bb" : "c" : "" : "d" split "" "<>"
               • "aaa" : "bb" : "c" : "" : "d" : []
Tuesday, October 11, 2011
Design of split
               • split "aaa<>bb<>c<><>d" "<>"
               • "aaa" : split "bb<>c<><>d" "<>"




Tuesday, October 11, 2011
Design of split
               • split "aaa<>bb<>c<><>d" "<>"
               • split' "aaa<>bb<>c<><>d" "" "<>"
               • split' "aa<>bb<>c<><>d" "a" "<>"
               • split' "a<>bb<>c<><>d" "aa" "<>"
               • split' "<>bb<>c<><>d" "aaa" "<>"
               • "aaa" : split "bb<>c<><>d" "<>"
Tuesday, October 11, 2011
•   split "aaa<>bb<>c<><>d" "<>"

                                               •   split' "aaa<>bb<>c<><>d" "" "<>"

                                               •   split' "aa<>bb<>c<><>d" "a" "<>"

                                               •   split' "a<>bb<>c<><>d" "aa" "<>"

  1    split :: String -> String -> [String]   •   split' "<>bb<>c<><>d" "aaa" "<>"
  2
  3
       str `split` pat = split' str pat ""
                                               •   "aaa" : split "bb<>c<><>d" "<>"

  4    split' :: String -> String -> String -> [String]
  5    split' "" _ memo = [reverse memo]
  6    split' str pat memo = let (a, b) = splitAt (length pat) str in
  7    ______________________if a == pat
  8    _________________________then (reverse memo) : (b `split` pat)
  9    _________________________else split' (tail str) pat (head str : memo)



Tuesday, October 11, 2011
Another approach
               • Text.Parsec: v3
               • Text.ParserCombinators.Parsec: v2
               • Real World Haskell Parsec chapter
                • csv parser

Tuesday, October 11, 2011
Design of split
               • split "aaa<>bb<>c<><>d" "<>"
               • many of
                • any char except for the string of
                            "<>"
               • that separated by "<>" or the end
                     of string



Tuesday, October 11, 2011
1   import qualified Text.Parsec as P
2
3   str `split` pat = case P.parse (split' (P.string pat)) "split" str of
4   _______________________Right x -> x
5   split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat




Tuesday, October 11, 2011
1   import qualified Text.Parsec as P
2
3   str `split` pat = case P.parse (split' (P.string pat)) "split" str of
4   _______________________Right x -> x
5   split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat



       Any char

       Except for end of the string or the pattern to separate
                     (without consuming text)



Tuesday, October 11, 2011
1       import qualified Text.Parsec as P
  2
  3       main = do
  4        print $ abc1 "abc" -- True
  5        print $ abc1 "abcd" -- False
  6        print $ abc2 "abc" -- True
  7        print $ abc2 "abcd" -- False
  8
  9       abc1 str = str == "abc"
 10       abc2 str = case P.parse (P.string "abc" >> P.eof ) "abc" str of
 11                Right _ -> True
 12                Left _ -> False


Tuesday, October 11, 2011
1 import qualified Text.Parsec as P
  2
  3 main = do
  4 print $ parenthMatch1 "(a (b c))" -- True
  5 print $ parenthMatch1 "(a (b c)" -- False
  6 print $ parenthMatch1 ")(a (b c)" -- False
  7 print $ parenthMatch2 "(a (b c))" -- True
  8 print $ parenthMatch2 "(a (b c)" -- False
  9 print $ parenthMatch2 ")(a (b c)" -- False
 10
 11 parenthMatch1 str = f str 0             1 parenthMatch2 str =
 12 where                                   2 case P.parse (f >> P.eof ) "parenthMatch" str of
 13 f "" 0 = True                           3     Right _ -> True
 14 f "" _ = False                          4     Left _ -> False
 15 f ('(':xs) n = f xs (n + 1)             5 where
 16 f (')':xs) 0 = False                    6 f = P.many (P.noneOf "()" P.<|> g)
 17 f (')':xs) n = f xs (n - 1)             7 g = do
 18 f (_:xs) n = f xs n                     8    P.char '('
                                            9    f
                                           10    P.char ')'

Tuesday, October 11, 2011
Parsec API
               • anyChar
               • char 'a'
               • string "abc"
                     == string ['a', 'b', 'c']
                     == char 'a' >> char 'b' >> char 'c'
               • oneOf ['a', 'b', 'c']
               • noneOf "abc"
               • eof
Tuesday, October 11, 2011
Parsec API (combinator)
               • >>, >>=, return, and fail
               • <|>
               • many p
               • p1 `manyTill` p2
               • p1 `sepBy` p2
               • p1 `chainl` op
Tuesday, October 11, 2011
Parsec API (etc)
               • try
               • lookAhead p
               • notFollowedBy p



Tuesday, October 11, 2011
texts in Haskell



Tuesday, October 11, 2011
three types of text
               • String
               • ByteString
               • Text



Tuesday, October 11, 2011
String
               • [Char]
               • Char: a UTF-8 character
               • "aaa" is String
               • List is lazy and slow


Tuesday, October 11, 2011
ByteString
               • import Data.ByteString
                • Base64
                • Char8
                • UTF8
                • Lazy (Char8, UTF8)
               • Fast. The default of snap
Tuesday, October 11, 2011
ByteString (cont'd)
                       1    {-# LANGUAGE OverloadedStrings #-}
                       2    import Data.ByteString.Char8 ()
                       3    import Data.ByteString (ByteString)
                       4
                       5    main = print ("hello" :: ByteString)


               • OverloadedStrings with Char8
               • Give type expliticly or use with
                     ByteString functions

Tuesday, October 11, 2011
ByteString (cont'd)

       1    import Data.ByteString.UTF8 ()
       2    import qualified Data.ByteString as B
       3    import Codec.Binary.UTF8.String (encode)
       4
       5    main = B.putStrLn (B.pack $ encode "       " :: B.ByteString)




Tuesday, October 11, 2011
Text
               • import Data.Text
               • import Data.Text.IO
               • always UTF8
               • import Data.Text.Lazy
               • Fast

Tuesday, October 11, 2011
Text (cont'd)
                 1      {-# LANGUAGE OverloadedStrings #-}
                 2      import Data.Text (Text)
                 3      import qualified Data.Text.IO as T
                 4
                 5      main = T.putStrLn ("         " :: Text)



               • UTF-8 friendly
Tuesday, October 11, 2011
Parsec supports
               • String
               • ByteString




Tuesday, October 11, 2011
Attoparsec supports
               • ByteString
               • Text




Tuesday, October 11, 2011
Attoparsec
               • cabal install attoparsec
                • attoparsec-text
                • attoparsec-enumerator
                • attoparsec-iteratee
                • attoparsec-text-enumerator

Tuesday, October 11, 2011
Attoparsec pros/cons
               • Pros
                • fast
                • text support
                • enumerator/iteratee
               • Cons
                • no lookAhead/notFollowedBy
Tuesday, October 11, 2011
Parsec and Attoparsec
                                          1   {-# LANGUAGE OverloadedStrings #-}
1   import qualified Text.Parsec as P 2        import qualified Data.Attoparsec.Text as P
2                                         3
3   main = print $ abc "abc"              4   main = print $ abc "abc"
4                                         5
5   abc str = case P.parse f "abc" str of 6   abc str = case P.parseOnly f str of
6             Right _ -> True             7             Right _ -> True
7             Left _ -> False             8             Left _ -> False
8   f = P.string "abc"                    9   f = P.string "abc"




Tuesday, October 11, 2011
return ()



Tuesday, October 11, 2011
Practice
               • args "f(x, g())"
                     -- ["x", "g()"]
               • args "f(, aa(), bb(c))"
                     -- ["", "aa()", "bb(c)"]




Tuesday, October 11, 2011

Text Manipulation with/without Parsec

  • 1.
    Text manipulation with/without parsec October 11, 2011 Vancouver Haskell UnMeetup Tatsuhiro Ujihisa Tuesday, October 11, 2011
  • 2.
    • Tatsuhiro Ujihisa • @ujm • HootSuite Media inc • Osaka, Japan • Vim: 14 • Haskell: 5 Tuesday, October 11, 2011
  • 3.
    Topics • text manipulation functions with/ without parsec • parsec library • texts in Haskell • attoparsec library Tuesday, October 11, 2011
  • 4.
    Haskell for work • Something academical • Something methematical • Web app • Better shell scripting • (Improve yourself ) Tuesday, October 11, 2011
  • 5.
    Text manipulation • The concept of text • String is [Char] • lazy • Pattern matching Tuesday, October 11, 2011
  • 6.
    Example: split • Ruby/Python example • 'aaa<>bb<>c<><>d'.split('<>') ['aaa', 'bb', 'c', '', 'd'] • Vim script example • split('aaa<>bb<>c<><>d', '<>') Tuesday, October 11, 2011
  • 7.
    split in Haskell • split :: String -> String -> [String] • split "aaa<>bb<>c<><>d" "<>" ["aaa", "bb", "c", "", "d"] • "aaa<>bb<>c<><>d" `split` "<>" Tuesday, October 11, 2011
  • 8.
    Design of split • split "aaa<>bb<>c<><>d" "<>" • "aaa" : split "bb<>c<><>d" "<>" • "aaa" : "bb" : split "c<><>d" "<>" • "aaa" : "bb" : "c" : split "<>d" "<>" • "aaa" : "bb" : "c" : "" : split "d" "<>" • "aaa" : "bb" : "c" : "" : "d" split "" "<>" • "aaa" : "bb" : "c" : "" : "d" : [] Tuesday, October 11, 2011
  • 9.
    Design of split • split "aaa<>bb<>c<><>d" "<>" • "aaa" : split "bb<>c<><>d" "<>" Tuesday, October 11, 2011
  • 10.
    Design of split • split "aaa<>bb<>c<><>d" "<>" • split' "aaa<>bb<>c<><>d" "" "<>" • split' "aa<>bb<>c<><>d" "a" "<>" • split' "a<>bb<>c<><>d" "aa" "<>" • split' "<>bb<>c<><>d" "aaa" "<>" • "aaa" : split "bb<>c<><>d" "<>" Tuesday, October 11, 2011
  • 11.
    split "aaa<>bb<>c<><>d" "<>" • split' "aaa<>bb<>c<><>d" "" "<>" • split' "aa<>bb<>c<><>d" "a" "<>" • split' "a<>bb<>c<><>d" "aa" "<>" 1 split :: String -> String -> [String] • split' "<>bb<>c<><>d" "aaa" "<>" 2 3 str `split` pat = split' str pat "" • "aaa" : split "bb<>c<><>d" "<>" 4 split' :: String -> String -> String -> [String] 5 split' "" _ memo = [reverse memo] 6 split' str pat memo = let (a, b) = splitAt (length pat) str in 7 ______________________if a == pat 8 _________________________then (reverse memo) : (b `split` pat) 9 _________________________else split' (tail str) pat (head str : memo) Tuesday, October 11, 2011
  • 12.
    Another approach • Text.Parsec: v3 • Text.ParserCombinators.Parsec: v2 • Real World Haskell Parsec chapter • csv parser Tuesday, October 11, 2011
  • 13.
    Design of split • split "aaa<>bb<>c<><>d" "<>" • many of • any char except for the string of "<>" • that separated by "<>" or the end of string Tuesday, October 11, 2011
  • 14.
    1 import qualified Text.Parsec as P 2 3 str `split` pat = case P.parse (split' (P.string pat)) "split" str of 4 _______________________Right x -> x 5 split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat Tuesday, October 11, 2011
  • 15.
    1 import qualified Text.Parsec as P 2 3 str `split` pat = case P.parse (split' (P.string pat)) "split" str of 4 _______________________Right x -> x 5 split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat Any char Except for end of the string or the pattern to separate (without consuming text) Tuesday, October 11, 2011
  • 16.
    1 import qualified Text.Parsec as P 2 3 main = do 4 print $ abc1 "abc" -- True 5 print $ abc1 "abcd" -- False 6 print $ abc2 "abc" -- True 7 print $ abc2 "abcd" -- False 8 9 abc1 str = str == "abc" 10 abc2 str = case P.parse (P.string "abc" >> P.eof ) "abc" str of 11 Right _ -> True 12 Left _ -> False Tuesday, October 11, 2011
  • 17.
    1 import qualifiedText.Parsec as P 2 3 main = do 4 print $ parenthMatch1 "(a (b c))" -- True 5 print $ parenthMatch1 "(a (b c)" -- False 6 print $ parenthMatch1 ")(a (b c)" -- False 7 print $ parenthMatch2 "(a (b c))" -- True 8 print $ parenthMatch2 "(a (b c)" -- False 9 print $ parenthMatch2 ")(a (b c)" -- False 10 11 parenthMatch1 str = f str 0 1 parenthMatch2 str = 12 where 2 case P.parse (f >> P.eof ) "parenthMatch" str of 13 f "" 0 = True 3 Right _ -> True 14 f "" _ = False 4 Left _ -> False 15 f ('(':xs) n = f xs (n + 1) 5 where 16 f (')':xs) 0 = False 6 f = P.many (P.noneOf "()" P.<|> g) 17 f (')':xs) n = f xs (n - 1) 7 g = do 18 f (_:xs) n = f xs n 8 P.char '(' 9 f 10 P.char ')' Tuesday, October 11, 2011
  • 18.
    Parsec API • anyChar • char 'a' • string "abc" == string ['a', 'b', 'c'] == char 'a' >> char 'b' >> char 'c' • oneOf ['a', 'b', 'c'] • noneOf "abc" • eof Tuesday, October 11, 2011
  • 19.
    Parsec API (combinator) • >>, >>=, return, and fail • <|> • many p • p1 `manyTill` p2 • p1 `sepBy` p2 • p1 `chainl` op Tuesday, October 11, 2011
  • 20.
    Parsec API (etc) • try • lookAhead p • notFollowedBy p Tuesday, October 11, 2011
  • 21.
    texts in Haskell Tuesday,October 11, 2011
  • 22.
    three types oftext • String • ByteString • Text Tuesday, October 11, 2011
  • 23.
    String • [Char] • Char: a UTF-8 character • "aaa" is String • List is lazy and slow Tuesday, October 11, 2011
  • 24.
    ByteString • import Data.ByteString • Base64 • Char8 • UTF8 • Lazy (Char8, UTF8) • Fast. The default of snap Tuesday, October 11, 2011
  • 25.
    ByteString (cont'd) 1 {-# LANGUAGE OverloadedStrings #-} 2 import Data.ByteString.Char8 () 3 import Data.ByteString (ByteString) 4 5 main = print ("hello" :: ByteString) • OverloadedStrings with Char8 • Give type expliticly or use with ByteString functions Tuesday, October 11, 2011
  • 26.
    ByteString (cont'd) 1 import Data.ByteString.UTF8 () 2 import qualified Data.ByteString as B 3 import Codec.Binary.UTF8.String (encode) 4 5 main = B.putStrLn (B.pack $ encode " " :: B.ByteString) Tuesday, October 11, 2011
  • 27.
    Text • import Data.Text • import Data.Text.IO • always UTF8 • import Data.Text.Lazy • Fast Tuesday, October 11, 2011
  • 28.
    Text (cont'd) 1 {-# LANGUAGE OverloadedStrings #-} 2 import Data.Text (Text) 3 import qualified Data.Text.IO as T 4 5 main = T.putStrLn (" " :: Text) • UTF-8 friendly Tuesday, October 11, 2011
  • 29.
    Parsec supports • String • ByteString Tuesday, October 11, 2011
  • 30.
    Attoparsec supports • ByteString • Text Tuesday, October 11, 2011
  • 31.
    Attoparsec • cabal install attoparsec • attoparsec-text • attoparsec-enumerator • attoparsec-iteratee • attoparsec-text-enumerator Tuesday, October 11, 2011
  • 32.
    Attoparsec pros/cons • Pros • fast • text support • enumerator/iteratee • Cons • no lookAhead/notFollowedBy Tuesday, October 11, 2011
  • 33.
    Parsec and Attoparsec 1 {-# LANGUAGE OverloadedStrings #-} 1 import qualified Text.Parsec as P 2 import qualified Data.Attoparsec.Text as P 2 3 3 main = print $ abc "abc" 4 main = print $ abc "abc" 4 5 5 abc str = case P.parse f "abc" str of 6 abc str = case P.parseOnly f str of 6 Right _ -> True 7 Right _ -> True 7 Left _ -> False 8 Left _ -> False 8 f = P.string "abc" 9 f = P.string "abc" Tuesday, October 11, 2011
  • 34.
  • 35.
    Practice • args "f(x, g())" -- ["x", "g()"] • args "f(, aa(), bb(c))" -- ["", "aa()", "bb(c)"] Tuesday, October 11, 2011