BBS Crawler     for Taiwanbsdconv + pyte + telnetlib by Buganini @ PyHUG      Sep. 2012
Obstacles●   Big5/UAO●   Segmented Big5●   Control Sequence●   Ambiguous Width
●   Big5/UAO           Gov.tw: BIG5-2003●   Segmented Big5     Windows: CP950●   Control Sequence   Libiconv: BIG5(?), CP9...
Big5/UAO                       xAExE1●●   Segmented Big5●   Control Sequence   xAE●   Ambiguous Width    x1B[1;33m        ...
●   Big5/UAO●   Segmented Big5●   Control Sequence●   Ambiguous Width                       08 08 20 20   ← ← SP SP       ...
●   Big5/UAO●   Segmented Big5●   Control Sequence●   Ambiguous Width
Obstacles                                             Not anymore…●   Big5/UAO●   Segmented Big5                    Solved...
bsdconv                           (1/4)import bsdconvbsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip...
bsdconv                      (2/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,bi...
bsdconv                      (3/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,bi...
bsdconv                      (4/4)import bsdconvbsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5...
_                                                                           | |                                    pyte   ...
pyte           (2/2)                                   #Ambiguous Widthscreens.pywidth_counter=bsdconv.Bsdconv("utf-8:widt...
telnetlib           (1/3)Whats wrong with read_until/expect?  What telnetlib does:    Server → telnetlib connection→ telne...
telnetlib             (2/3)                    #Deal with lagging/noopdef term_comm(feed=None, wait=None):   if feed!=None...
telnetlib            (3/3)                  #Deal with lagging/noopAction with or without screen refresh   term_comm(Actio...
- Demo -
- End -
Upcoming SlideShare
Loading in …5
×

BBS crawler for Taiwan

6,508 views

Published on

Published in: Technology
3 Comments
22 Likes
Statistics
Notes
  • Notice: .info() has been renamed to .counter()
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • UPDATE:
    s/bsdconv_raw/pass/g
    s/bsdconv_stdout/bsdconv-stdout/g

    for latest bsdconv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Refer http://www.slideshare.net/Buganini/bsdconv for more detail about bsdconv.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
6,508
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
46
Comments
3
Likes
22
Embeds 0
No embeds

No notes for slide

BBS crawler for Taiwan

  1. 1. BBS Crawler for Taiwanbsdconv + pyte + telnetlib by Buganini @ PyHUG Sep. 2012
  2. 2. Obstacles● Big5/UAO● Segmented Big5● Control Sequence● Ambiguous Width
  3. 3. ● Big5/UAO Gov.tw: BIG5-2003● Segmented Big5 Windows: CP950● Control Sequence Libiconv: BIG5(?), CP950, BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001,● Ambiguous Width BIG5-HKSCS:1999, BIG5-2003 (experimental) Mozilla: UAO 2.41 BBS: UAO 2.50(?) etc.. ref: http://moztw.org/docs/big5/ UAO == Unicode At Once == Unicode 補完計畫 != Unicode UAO is extended Big5 (by using PUA), including Chinese (trad/sim/hk), Japanese, Cyrillic Ex: 喆 (95ED), 轮 (8879), Я(C854), か (C6F1)
  4. 4. Big5/UAO xAExE1●● Segmented Big5● Control Sequence xAE● Ambiguous Width x1B[1;33m xE1 PCMAN Standard Tool
  5. 5. ● Big5/UAO● Segmented Big5● Control Sequence● Ambiguous Width 08 08 20 20 ← ← SP SP 08 08 0a ←←↓ e2 97 8f ●
  6. 6. ● Big5/UAO● Segmented Big5● Control Sequence● Ambiguous Width
  7. 7. Obstacles Not anymore…● Big5/UAO● Segmented Big5 Solved in bug5, using bsdconv● Ambiguous Width● Control Sequence Solved, using pytehttps://github.com/buganini/bug5https://github.com/buganini/bsdconvhttps://github.com/selectel/pyte
  8. 8. bsdconv (1/4)import bsdconvbsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") xAExE1xAEx1B[1;33mxE1 --------------------------------------------------------- AE E1 AE 1B 5B 31 3B 33 33 6D E1 ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ Bsdconv Internal Prefix: 03AE 03E1 03AE 1B5B313B33336D 03E1 03: Byte 1B: ANSI Control Sequence ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ 03AE 03E1 03AE 03E1 1B5B313B33336D ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ AE E1 AE E1 1B 5B 31 3B 33 33 6D ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ 016851 016851 1B5B313B33336D #U+6851 == 桑
  9. 9. bsdconv (2/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")>>> c=bsdconv.Bsdconv("ansi-control,byte:bsdconv_stdout")>>> c.conv("xAExE1xAEx1B[1;33mxE1")03AE03E103AE1B5B313B33336D ( FREE )03E1>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:bsdconv_stdout")>>> c.conv("xAExE1xAEx1B[1;33mxE1")03AE03E103AE03E11B5B313B33336D ( FREE )Bsdconv Internal Prefix:03: Byte1B: ANSI Control Sequence
  10. 10. bsdconv (3/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| pass:bsdconv_stdout")>>> c.conv("xAExE1xAEx1B[1;33mxE1")AEE1AEE11B5B313B33336D ( FREE SKIP )>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| skip,big5:bsdconv_stdout")>>> c.conv("xAExE1xAEx1B[1;33mxE1")0168510168511B5B313B33336D ( FREE )Bsdconv Internal Prefix:01: Unicode1B: ANSI Control Sequence#U+6851 == 桑
  11. 11. bsdconv (4/4)import bsdconvbsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| skip,big5:utf-8,bsdconv_raw")>>> s=c.conv("xAExE1xAEx1B[1;33mxE1")>>> sxe6xa1x91xe6xa1x91x1b[1;33m>>> s.decode("utf-8")uu6851u6851x1b[1;33m#U+6851 == 桑
  12. 12. _ | | pyte (1/2) _ __ _ _ | |_ ___ | _ | | | || __|/ _ | |_) || |_| || |_| __/import pyte | .__/ __, | __|___|stream = pyte.Stream() | | __/ | |_| |___/screen = pyte.Screen(80, 24) Python Terminal Emulatorscreen.mode.discard(pyte.modes.LNM)stream.attach(screen)seq=SEQUENCE_FROM_SERVERuseq=c.conv(seq)stream.feed(useq.decode("utf-8"))RESULT_SCREEN="n".join(screen.display).encode("utf-8") With pyte.modes.LNM: r → CR+LF (CarriageReturn / LineFeed) Without pyte.modes.LNM: r → CR
  13. 13. pyte (2/2) #Ambiguous Widthscreens.pywidth_counter=bsdconv.Bsdconv("utf-8:width:null")
  14. 14. telnetlib (1/3)Whats wrong with read_until/expect? What telnetlib does: Server → telnetlib connection→ telnetlib.read_until What I need: Server → telnetlib connection → bsdconv → telnetlib.read_until Regular ExpressionSolutions: a) Implement bsdconv → telnetlib.read_until (current) b) Hack telnetlib (maybe cleaner) c) Other telnetlib implementation?
  15. 15. telnetlib (2/3) #Deal with lagging/noopdef term_comm(feed=None, wait=None): if feed!=None: conn.write(feed) if wait: s=conn.read_some() s=conv.conv_chunk(s) stream.feed(s.decode("utf-8")) if wait!=False: time.sleep(0.1) s=conn.read_very_eager() s=conv.conv_chunk(s) stream.feed(s.decode("utf-8")) ret="n".join(screen.display).encode("utf-8") return ret Reading Feed No Feed Wait=None Non-blocking Non-blocking Wait=True Blocking Non-blocking (unused) Wait=False No No
  16. 16. telnetlib (3/3) #Deal with lagging/noopAction with or without screen refresh term_comm(Action A, False) term_comm(Action B, True) #Action A+B cause screen refreshAction with screen refresh (important content) term_comm(Action, True)Action with screen refresh term_comm(Action)Wait+Retry Reading Feed No Feed Wait=None Non-blocking Non-blocking Wait=True Blocking Non-blocking (unused) Wait=False No No
  17. 17. - Demo -
  18. 18. - End -

×