Text manipulation               with/without parsec      October 11, 2011 Vancouver Haskell UnMeetup                      ...
• Tatsuhiro Ujihisa               • @ujm               • HootSuite Media inc               • Osaka, Japan               • ...
Topics               • text manipulation functions with/                     without parsec               • parsec library...
Haskell for work               • Something academical               • Something methematical               • Web app      ...
Text manipulation               • The concept of text               • String is [Char]                • lazy              ...
Example: split               • Ruby/Python example                • aaa<>bb<>c<><>d.split(<>)                            [...
split in Haskell               • split :: String -> String -> [String]                • split "aaa<>bb<>c<><>d" "<>"      ...
Design of split               • split "aaa<>bb<>c<><>d" "<>"               • "aaa" : split "bb<>c<><>d" "<>"              ...
Design of split               • split "aaa<>bb<>c<><>d" "<>"               • "aaa" : split "bb<>c<><>d" "<>"Tuesday, Octob...
Design of split               • split "aaa<>bb<>c<><>d" "<>"               • split "aaa<>bb<>c<><>d" "" "<>"              ...
•   split "aaa<>bb<>c<><>d" "<>"                                               •   split "aaa<>bb<>c<><>d" "" "<>"        ...
Another approach               • Text.Parsec: v3               • Text.ParserCombinators.Parsec: v2               • Real Wo...
Design of split               • split "aaa<>bb<>c<><>d" "<>"               • many of                • any char except for ...
1   import qualified Text.Parsec as P23   str `split` pat = case P.parse (split (P.string pat)) "split" str of4   _________...
1   import qualified Text.Parsec as P23   str `split` pat = case P.parse (split (P.string pat)) "split" str of4   _________...
1       import qualified Text.Parsec as P  2  3       main = do  4        print $ abc1 "abc" -- True  5        print $ abc1...
1 import qualified Text.Parsec as P  2  3 main = do  4 print $ parenthMatch1 "(a (b c))" -- True  5 print $ parenthMatch1 "...
Parsec API               • anyChar               • char a               • string "abc"                     == string [a, b...
Parsec API (combinator)               • >>, >>=, return, and fail               • <|>               • many p              ...
Parsec API (etc)               • try               • lookAhead p               • notFollowedBy pTuesday, October 11, 2011
texts in HaskellTuesday, October 11, 2011
three types of text               • String               • ByteString               • TextTuesday, October 11, 2011
String               • [Char]               • Char: a UTF-8 character               • "aaa" is String               • List...
ByteString               • import Data.ByteString                • Base64                • Char8                • UTF8    ...
ByteString (contd)                       1    {-# LANGUAGE OverloadedStrings #-}                       2    import Data.By...
ByteString (contd)       1    import Data.ByteString.UTF8 ()       2    import qualified Data.ByteString as B       3    im...
Text               • import Data.Text               • import Data.Text.IO               • always UTF8               • impo...
Text (contd)                 1      {-# LANGUAGE OverloadedStrings #-}                 2      import Data.Text (Text)     ...
Parsec supports               • String               • ByteStringTuesday, October 11, 2011
Attoparsec supports               • ByteString               • TextTuesday, October 11, 2011
Attoparsec               • cabal install attoparsec                • attoparsec-text                • attoparsec-enumerato...
Attoparsec pros/cons               • Pros                • fast                • text support                • enumerator/...
Parsec and Attoparsec                                          1   {-# LANGUAGE OverloadedStrings #-}1   import qualified T...
return ()Tuesday, October 11, 2011
Practice               • args "f(x, g())"                     -- ["x", "g()"]               • args "f(, aa(), bb(c))"    ...
Upcoming SlideShare
Loading in …5
×

Text Manipulation with/without Parsec

2,564 views

Published on

At Vancouver Haskell UnMeetup on Oct 11, 2011

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,564
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
16
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Text Manipulation with/without Parsec

  1. 1. Text manipulation with/without parsec October 11, 2011 Vancouver Haskell UnMeetup Tatsuhiro UjihisaTuesday, October 11, 2011
  2. 2. • Tatsuhiro Ujihisa • @ujm • HootSuite Media inc • Osaka, Japan • Vim: 14 • Haskell: 5Tuesday, October 11, 2011
  3. 3. Topics • text manipulation functions with/ without parsec • parsec library • texts in Haskell • attoparsec libraryTuesday, October 11, 2011
  4. 4. Haskell for work • Something academical • Something methematical • Web app • Better shell scripting • (Improve yourself )Tuesday, October 11, 2011
  5. 5. Text manipulation • The concept of text • String is [Char] • lazy • Pattern matchingTuesday, October 11, 2011
  6. 6. Example: split • Ruby/Python example • aaa<>bb<>c<><>d.split(<>) [aaa, bb, c, , d] • Vim script example • split(aaa<>bb<>c<><>d, <>)Tuesday, October 11, 2011
  7. 7. split in Haskell • split :: String -> String -> [String] • split "aaa<>bb<>c<><>d" "<>" ["aaa", "bb", "c", "", "d"] • "aaa<>bb<>c<><>d" `split` "<>"Tuesday, October 11, 2011
  8. 8. Design of split • split "aaa<>bb<>c<><>d" "<>" • "aaa" : split "bb<>c<><>d" "<>" • "aaa" : "bb" : split "c<><>d" "<>" • "aaa" : "bb" : "c" : split "<>d" "<>" • "aaa" : "bb" : "c" : "" : split "d" "<>" • "aaa" : "bb" : "c" : "" : "d" split "" "<>" • "aaa" : "bb" : "c" : "" : "d" : []Tuesday, October 11, 2011
  9. 9. Design of split • split "aaa<>bb<>c<><>d" "<>" • "aaa" : split "bb<>c<><>d" "<>"Tuesday, October 11, 2011
  10. 10. Design of split • split "aaa<>bb<>c<><>d" "<>" • split "aaa<>bb<>c<><>d" "" "<>" • split "aa<>bb<>c<><>d" "a" "<>" • split "a<>bb<>c<><>d" "aa" "<>" • split "<>bb<>c<><>d" "aaa" "<>" • "aaa" : split "bb<>c<><>d" "<>"Tuesday, October 11, 2011
  11. 11. • split "aaa<>bb<>c<><>d" "<>" • split "aaa<>bb<>c<><>d" "" "<>" • split "aa<>bb<>c<><>d" "a" "<>" • split "a<>bb<>c<><>d" "aa" "<>" 1 split :: String -> String -> [String] • split "<>bb<>c<><>d" "aaa" "<>" 2 3 str `split` pat = split str pat "" • "aaa" : split "bb<>c<><>d" "<>" 4 split :: String -> String -> String -> [String] 5 split "" _ memo = [reverse memo] 6 split str pat memo = let (a, b) = splitAt (length pat) str in 7 ______________________if a == pat 8 _________________________then (reverse memo) : (b `split` pat) 9 _________________________else split (tail str) pat (head str : memo)Tuesday, October 11, 2011
  12. 12. Another approach • Text.Parsec: v3 • Text.ParserCombinators.Parsec: v2 • Real World Haskell Parsec chapter • csv parserTuesday, October 11, 2011
  13. 13. Design of split • split "aaa<>bb<>c<><>d" "<>" • many of • any char except for the string of "<>" • that separated by "<>" or the end of stringTuesday, October 11, 2011
  14. 14. 1 import qualified Text.Parsec as P23 str `split` pat = case P.parse (split (P.string pat)) "split" str of4 _______________________Right x -> x5 split pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` patTuesday, October 11, 2011
  15. 15. 1 import qualified Text.Parsec as P23 str `split` pat = case P.parse (split (P.string pat)) "split" str of4 _______________________Right x -> x5 split pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat Any char Except for end of the string or the pattern to separate (without consuming text)Tuesday, October 11, 2011
  16. 16. 1 import qualified Text.Parsec as P 2 3 main = do 4 print $ abc1 "abc" -- True 5 print $ abc1 "abcd" -- False 6 print $ abc2 "abc" -- True 7 print $ abc2 "abcd" -- False 8 9 abc1 str = str == "abc" 10 abc2 str = case P.parse (P.string "abc" >> P.eof ) "abc" str of 11 Right _ -> True 12 Left _ -> FalseTuesday, October 11, 2011
  17. 17. 1 import qualified Text.Parsec as P 2 3 main = do 4 print $ parenthMatch1 "(a (b c))" -- True 5 print $ parenthMatch1 "(a (b c)" -- False 6 print $ parenthMatch1 ")(a (b c)" -- False 7 print $ parenthMatch2 "(a (b c))" -- True 8 print $ parenthMatch2 "(a (b c)" -- False 9 print $ parenthMatch2 ")(a (b c)" -- False 10 11 parenthMatch1 str = f str 0 1 parenthMatch2 str = 12 where 2 case P.parse (f >> P.eof ) "parenthMatch" str of 13 f "" 0 = True 3 Right _ -> True 14 f "" _ = False 4 Left _ -> False 15 f ((:xs) n = f xs (n + 1) 5 where 16 f ():xs) 0 = False 6 f = P.many (P.noneOf "()" P.<|> g) 17 f ():xs) n = f xs (n - 1) 7 g = do 18 f (_:xs) n = f xs n 8 P.char ( 9 f 10 P.char )Tuesday, October 11, 2011
  18. 18. Parsec API • anyChar • char a • string "abc" == string [a, b, c] == char a >> char b >> char c • oneOf [a, b, c] • noneOf "abc" • eofTuesday, October 11, 2011
  19. 19. Parsec API (combinator) • >>, >>=, return, and fail • <|> • many p • p1 `manyTill` p2 • p1 `sepBy` p2 • p1 `chainl` opTuesday, October 11, 2011
  20. 20. Parsec API (etc) • try • lookAhead p • notFollowedBy pTuesday, October 11, 2011
  21. 21. texts in HaskellTuesday, October 11, 2011
  22. 22. three types of text • String • ByteString • TextTuesday, October 11, 2011
  23. 23. String • [Char] • Char: a UTF-8 character • "aaa" is String • List is lazy and slowTuesday, October 11, 2011
  24. 24. ByteString • import Data.ByteString • Base64 • Char8 • UTF8 • Lazy (Char8, UTF8) • Fast. The default of snapTuesday, October 11, 2011
  25. 25. ByteString (contd) 1 {-# LANGUAGE OverloadedStrings #-} 2 import Data.ByteString.Char8 () 3 import Data.ByteString (ByteString) 4 5 main = print ("hello" :: ByteString) • OverloadedStrings with Char8 • Give type expliticly or use with ByteString functionsTuesday, October 11, 2011
  26. 26. ByteString (contd) 1 import Data.ByteString.UTF8 () 2 import qualified Data.ByteString as B 3 import Codec.Binary.UTF8.String (encode) 4 5 main = B.putStrLn (B.pack $ encode " " :: B.ByteString)Tuesday, October 11, 2011
  27. 27. Text • import Data.Text • import Data.Text.IO • always UTF8 • import Data.Text.Lazy • FastTuesday, October 11, 2011
  28. 28. Text (contd) 1 {-# LANGUAGE OverloadedStrings #-} 2 import Data.Text (Text) 3 import qualified Data.Text.IO as T 4 5 main = T.putStrLn (" " :: Text) • UTF-8 friendlyTuesday, October 11, 2011
  29. 29. Parsec supports • String • ByteStringTuesday, October 11, 2011
  30. 30. Attoparsec supports • ByteString • TextTuesday, October 11, 2011
  31. 31. Attoparsec • cabal install attoparsec • attoparsec-text • attoparsec-enumerator • attoparsec-iteratee • attoparsec-text-enumeratorTuesday, October 11, 2011
  32. 32. Attoparsec pros/cons • Pros • fast • text support • enumerator/iteratee • Cons • no lookAhead/notFollowedByTuesday, October 11, 2011
  33. 33. Parsec and Attoparsec 1 {-# LANGUAGE OverloadedStrings #-}1 import qualified Text.Parsec as P 2 import qualified Data.Attoparsec.Text as P2 33 main = print $ abc "abc" 4 main = print $ abc "abc"4 55 abc str = case P.parse f "abc" str of 6 abc str = case P.parseOnly f str of6 Right _ -> True 7 Right _ -> True7 Left _ -> False 8 Left _ -> False8 f = P.string "abc" 9 f = P.string "abc"Tuesday, October 11, 2011
  34. 34. return ()Tuesday, October 11, 2011
  35. 35. Practice • args "f(x, g())" -- ["x", "g()"] • args "f(, aa(), bb(c))" -- ["", "aa()", "bb(c)"]Tuesday, October 11, 2011

×