Your SlideShare is downloading. ×
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
BBS crawler for Taiwan
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

BBS crawler for Taiwan

4,962

Published on

Published in: Technology
3 Comments
16 Likes
Statistics
Notes
  • Notice: .info() has been renamed to .counter()
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • UPDATE:
    s/bsdconv_raw/pass/g
    s/bsdconv_stdout/bsdconv-stdout/g

    for latest bsdconv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Refer http://www.slideshare.net/Buganini/bsdconv for more detail about bsdconv.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
4,962
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
39
Comments
3
Likes
16
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. BBS Crawler for Taiwanbsdconv + pyte + telnetlib by Buganini @ PyHUG Sep. 2012
  • 2. Obstacles● Big5/UAO● Segmented Big5● Control Sequence● Ambiguous Width
  • 3. ● Big5/UAO Gov.tw: BIG5-2003● Segmented Big5 Windows: CP950● Control Sequence Libiconv: BIG5(?), CP950, BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001,● Ambiguous Width BIG5-HKSCS:1999, BIG5-2003 (experimental) Mozilla: UAO 2.41 BBS: UAO 2.50(?) etc.. ref: http://moztw.org/docs/big5/ UAO == Unicode At Once == Unicode 補完計畫 != Unicode UAO is extended Big5 (by using PUA), including Chinese (trad/sim/hk), Japanese, Cyrillic Ex: 喆 (95ED), 轮 (8879), Я(C854), か (C6F1)
  • 4. Big5/UAO xAExE1●● Segmented Big5● Control Sequence xAE● Ambiguous Width x1B[1;33m xE1 PCMAN Standard Tool
  • 5. ● Big5/UAO● Segmented Big5● Control Sequence● Ambiguous Width 08 08 20 20 ← ← SP SP 08 08 0a ←←↓ e2 97 8f ●
  • 6. ● Big5/UAO● Segmented Big5● Control Sequence● Ambiguous Width
  • 7. Obstacles Not anymore…● Big5/UAO● Segmented Big5 Solved in bug5, using bsdconv● Ambiguous Width● Control Sequence Solved, using pytehttps://github.com/buganini/bug5https://github.com/buganini/bsdconvhttps://github.com/selectel/pyte
  • 8. bsdconv (1/4)import bsdconvbsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") xAExE1xAEx1B[1;33mxE1 --------------------------------------------------------- AE E1 AE 1B 5B 31 3B 33 33 6D E1 ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ Bsdconv Internal Prefix: 03AE 03E1 03AE 1B5B313B33336D 03E1 03: Byte 1B: ANSI Control Sequence ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ 03AE 03E1 03AE 03E1 1B5B313B33336D ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ AE E1 AE E1 1B 5B 31 3B 33 33 6D ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ 016851 016851 1B5B313B33336D #U+6851 == 桑
  • 9. bsdconv (2/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")>>> c=bsdconv.Bsdconv("ansi-control,byte:bsdconv_stdout")>>> c.conv("xAExE1xAEx1B[1;33mxE1")03AE03E103AE1B5B313B33336D ( FREE )03E1>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:bsdconv_stdout")>>> c.conv("xAExE1xAEx1B[1;33mxE1")03AE03E103AE03E11B5B313B33336D ( FREE )Bsdconv Internal Prefix:03: Byte1B: ANSI Control Sequence
  • 10. bsdconv (3/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| pass:bsdconv_stdout")>>> c.conv("xAExE1xAEx1B[1;33mxE1")AEE1AEE11B5B313B33336D ( FREE SKIP )>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| skip,big5:bsdconv_stdout")>>> c.conv("xAExE1xAEx1B[1;33mxE1")0168510168511B5B313B33336D ( FREE )Bsdconv Internal Prefix:01: Unicode1B: ANSI Control Sequence#U+6851 == 桑
  • 11. bsdconv (4/4)import bsdconvbsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| skip,big5:utf-8,bsdconv_raw")>>> s=c.conv("xAExE1xAEx1B[1;33mxE1")>>> sxe6xa1x91xe6xa1x91x1b[1;33m>>> s.decode("utf-8")uu6851u6851x1b[1;33m#U+6851 == 桑
  • 12. _ | | pyte (1/2) _ __ _ _ | |_ ___ | _ | | | || __|/ _ | |_) || |_| || |_| __/import pyte | .__/ __, | __|___|stream = pyte.Stream() | | __/ | |_| |___/screen = pyte.Screen(80, 24) Python Terminal Emulatorscreen.mode.discard(pyte.modes.LNM)stream.attach(screen)seq=SEQUENCE_FROM_SERVERuseq=c.conv(seq)stream.feed(useq.decode("utf-8"))RESULT_SCREEN="n".join(screen.display).encode("utf-8") With pyte.modes.LNM: r → CR+LF (CarriageReturn / LineFeed) Without pyte.modes.LNM: r → CR
  • 13. pyte (2/2) #Ambiguous Widthscreens.pywidth_counter=bsdconv.Bsdconv("utf-8:width:null")
  • 14. telnetlib (1/3)Whats wrong with read_until/expect? What telnetlib does: Server → telnetlib connection→ telnetlib.read_until What I need: Server → telnetlib connection → bsdconv → telnetlib.read_until Regular ExpressionSolutions: a) Implement bsdconv → telnetlib.read_until (current) b) Hack telnetlib (maybe cleaner) c) Other telnetlib implementation?
  • 15. telnetlib (2/3) #Deal with lagging/noopdef term_comm(feed=None, wait=None): if feed!=None: conn.write(feed) if wait: s=conn.read_some() s=conv.conv_chunk(s) stream.feed(s.decode("utf-8")) if wait!=False: time.sleep(0.1) s=conn.read_very_eager() s=conv.conv_chunk(s) stream.feed(s.decode("utf-8")) ret="n".join(screen.display).encode("utf-8") return ret Reading Feed No Feed Wait=None Non-blocking Non-blocking Wait=True Blocking Non-blocking (unused) Wait=False No No
  • 16. telnetlib (3/3) #Deal with lagging/noopAction with or without screen refresh term_comm(Action A, False) term_comm(Action B, True) #Action A+B cause screen refreshAction with screen refresh (important content) term_comm(Action, True)Action with screen refresh term_comm(Action)Wait+Retry Reading Feed No Feed Wait=None Non-blocking Non-blocking Wait=True Blocking Non-blocking (unused) Wait=False No No
  • 17. - Demo -
  • 18. - End -

×