Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Douban-Linguist
by liluo
关于我

• liluo@github	

• liluoliluo@douban	

• liluoliluo@twitter
Linguist 是什么?
从视觉上看是这样的

Github 现在的版本

Github 之前的版本 (⺫⽬目前⾖豆瓣在⽤用)
Douban-linguist 是这样描述的
Github-linguist README
Linguist 可以做什么

• 编程语⾔言检测	

• 语法⾼高亮	

• 代码仓库编程语⾔言统计
•
•

统计时忽略通⽤用第三⽅方库特定⺫⽬目录代码	

检测是否⽣生成⽂文件
Linguist 是如何检测语⾔言的?
当输⼊入路径是⺫⽬目录时
•
•

遍历⺫⽬目录下所有⽂文件	


•

对余下的⽂文件进⾏行分析并汇总

忽略以 . 开头的⺫⽬目录, 忽略⼆二进制⽂文件/⽣生成⽂文件(如 coffeescript ⽣生成的
js)/压缩⽂文件(如 jque...
分析内容
算法: statistical classifier

•
•

(之前⽂文档中写的是 Bayesian classifier)

使⽤用 Tokenizer 将内容转为 tokens	

拿 tokens 分别与所有(根据扩展名)匹配到...
编程语⾔言的 Tokens
根据 samples/ ⺫⽬目录下的⽂文件统计(训练) 得来的
languages.yml & samples.json
Douban-Linguist
因为 Code
2012年5⽉月	

!

@huanghuang: 	

!

我们需要两个库 grit 和 linguist
然后...
断篇了.
时光荏苒, 莺⻜飞草⻓长
到了	

	

2013年01⽉月
准备⼯工作
计划时对依赖处理是这样想的
Ruby

Python 替代

#

pygments.rb

pygments

!
前者是后者的 Ruby 封装实现

mime-types

mimetypes

Python 内置

escap_utils...
开始动⼿手

• git init	

• cp blabla	

• added blabla	

• unittest blabla
Code 来需求了!!!
判断是否⽣生成⽂文件
•
•

移动组: PR diff 中不需要显⽰示 .pbxproj, .mobileprovision 	

前端组: 统计时不计⼊入压缩版本以及 coffeescript ⽣生成⽂文件
先把这个弄了给它⽤用
继续~
遭遇 CharlockHolmes
• 尝试过 Chardet, 但是只能检测编码	

• 尝试过 mimetypes.guess_type(file) 检测
是否⼆二进制⽂文件, 不靠谱!!!	


• 还尝试过下⾯面这样:
好像是能解决?
但是好纠结...
要是有 ICU 的 Python 实现就美好了...
可是不会写 C 扩展 > . <

求给⼒力, 求

+1
@XTao 来了!!!
发布第⼀一个版本 v0.0.1
Python mimetypes 怪怪的

此恨绵绵⽆无绝期...
移植⼀一个 Python 版本吧
Github custom lexers(pygments.rb)
写个 Pygments 插件
某天发现性能好差!!!

• ⽐比 Github-linguist 慢了 2~4 倍多 (不太记
得具体数据了)	


• 跑 unittest 要 20s 左右
捉⿁鬼(1)
捉⿁鬼(2)
捉⿁鬼(3)
捉⿁鬼(4)
捉⿁鬼(5)

• 和 @xtao 讨论是 Python 正则性能问题	

• 需要⼀一个⾼高性能的 Python 版本的
StringScanner
于是, 有了 scanner

Like 不只是说说, 	

正则引擎使⽤用 oniguruma.	

(Ruby 正则引擎就是它)
Scanner 带来的性能提升
github-linguist 与使⽤用 Scanner 后的douban-linguist 对⽐比

Travis-ci 中使⽤用 Scanner 前后对⽐比

注: 减少的 22 个 test case ⻅见...
感谢 Scanner 的作者

赞美 Code Team, @XTao!!!
发布版本 v0.1.0
⻢马上就讲完了, 别捉鸡...
Douban-linguist 最新进展	

在等 Pygments release 新版本
与 Github-linguist 作者

• Drinkup	

• Pull Request
2013 Drinkup@北京

• 咨询 Linguist 与 Github 交互实现	

PUSH > HOOK > QUEUE > (PULL) > CALCULATE > CALLBACK
!

!

• 问我

Python 版有没⽐...
提 Pull Request (1)

Drinkup 当天晚上被 merge
提 Pull Request (2)
提 Pull Request (3)
就是这样了.
相关链接
•
•
•
•
•
•

https://github.com/douban/linguist	

https://github.com/douban/PyCharlockHolmes	

https://github.com/lil...
End.
Upcoming SlideShare
Loading in …5
×

Douban linguist

6,750 views

Published on

Douban Linguist

  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy &amp; Proven Way to Build Good Habits &amp; Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy &amp; Proven Way to Build Good Habits &amp; Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Douban linguist

  1. 1. Douban-Linguist by liluo
  2. 2. 关于我 • liluo@github • liluoliluo@douban • liluoliluo@twitter
  3. 3. Linguist 是什么?
  4. 4. 从视觉上看是这样的 Github 现在的版本 Github 之前的版本 (⺫⽬目前⾖豆瓣在⽤用)
  5. 5. Douban-linguist 是这样描述的
  6. 6. Github-linguist README
  7. 7. Linguist 可以做什么 • 编程语⾔言检测 • 语法⾼高亮 • 代码仓库编程语⾔言统计 • • 统计时忽略通⽤用第三⽅方库特定⺫⽬目录代码 检测是否⽣生成⽂文件
  8. 8. Linguist 是如何检测语⾔言的?
  9. 9. 当输⼊入路径是⺫⽬目录时 • • 遍历⺫⽬目录下所有⽂文件 • 对余下的⽂文件进⾏行分析并汇总 忽略以 . 开头的⺫⽬目录, 忽略⼆二进制⽂文件/⽣生成⽂文件(如 coffeescript ⽣生成的 js)/压缩⽂文件(如 jquery.min.js)/通⽤用的第三⽅方类库(如 bootstrap) 当输⼊入路径是⽂文件时 • • • • 根据⽂文件扩展名查找(数据源⾃自samples.json, languages.yml) 未匹配到时返回空(None) 匹配到⼀一个结果时将其返回 匹配到多个结果时分析⽂文件内容
  10. 10. 分析内容 算法: statistical classifier • • (之前⽂文档中写的是 Bayesian classifier) 使⽤用 Tokenizer 将内容转为 tokens 拿 tokens 分别与所有(根据扩展名)匹配到的编程语⾔言的 Tokens 进⾏行⽐比 较, 将概率最⼤大编程语⾔言判定为结果
  11. 11. 编程语⾔言的 Tokens 根据 samples/ ⺫⽬目录下的⽂文件统计(训练) 得来的
  12. 12. languages.yml & samples.json
  13. 13. Douban-Linguist
  14. 14. 因为 Code 2012年5⽉月 ! @huanghuang: ! 我们需要两个库 grit 和 linguist
  15. 15. 然后...
  16. 16. 断篇了.
  17. 17. 时光荏苒, 莺⻜飞草⻓长 到了 2013年01⽉月
  18. 18. 准备⼯工作
  19. 19. 计划时对依赖处理是这样想的 Ruby Python 替代 # pygments.rb pygments ! 前者是后者的 Ruby 封装实现 mime-types mimetypes Python 内置 escap_utils urllib 毫⽆无鸭梨 charlock_holmes ? 先⾛走着
  20. 20. 开始动⼿手 • git init • cp blabla • added blabla • unittest blabla
  21. 21. Code 来需求了!!! 判断是否⽣生成⽂文件 • • 移动组: PR diff 中不需要显⽰示 .pbxproj, .mobileprovision 前端组: 统计时不计⼊入压缩版本以及 coffeescript ⽣生成⽂文件
  22. 22. 先把这个弄了给它⽤用
  23. 23. 继续~
  24. 24. 遭遇 CharlockHolmes
  25. 25. • 尝试过 Chardet, 但是只能检测编码 • 尝试过 mimetypes.guess_type(file) 检测 是否⼆二进制⽂文件, 不靠谱!!! • 还尝试过下⾯面这样:
  26. 26. 好像是能解决? 但是好纠结... 要是有 ICU 的 Python 实现就美好了... 可是不会写 C 扩展 > . < 求给⼒力, 求 +1
  27. 27. @XTao 来了!!!
  28. 28. 发布第⼀一个版本 v0.0.1
  29. 29. Python mimetypes 怪怪的 此恨绵绵⽆无绝期...
  30. 30. 移植⼀一个 Python 版本吧
  31. 31. Github custom lexers(pygments.rb)
  32. 32. 写个 Pygments 插件
  33. 33. 某天发现性能好差!!! • ⽐比 Github-linguist 慢了 2~4 倍多 (不太记 得具体数据了) • 跑 unittest 要 20s 左右
  34. 34. 捉⿁鬼(1)
  35. 35. 捉⿁鬼(2)
  36. 36. 捉⿁鬼(3)
  37. 37. 捉⿁鬼(4)
  38. 38. 捉⿁鬼(5) • 和 @xtao 讨论是 Python 正则性能问题 • 需要⼀一个⾼高性能的 Python 版本的 StringScanner
  39. 39. 于是, 有了 scanner Like 不只是说说, 正则引擎使⽤用 oniguruma. (Ruby 正则引擎就是它)
  40. 40. Scanner 带来的性能提升 github-linguist 与使⽤用 Scanner 后的douban-linguist 对⽐比 Travis-ci 中使⽤用 Scanner 前后对⽐比 注: 减少的 22 个 test case ⻅见 https://github.com/douban/linguist/blob/eba200742c9f7ebd433b7aa73774381b80ddb0fa/tests/test_strscan.py
  41. 41. 感谢 Scanner 的作者 赞美 Code Team, @XTao!!!
  42. 42. 发布版本 v0.1.0
  43. 43. ⻢马上就讲完了, 别捉鸡...
  44. 44. Douban-linguist 最新进展 在等 Pygments release 新版本
  45. 45. 与 Github-linguist 作者 • Drinkup • Pull Request
  46. 46. 2013 Drinkup@北京 • 咨询 Linguist 与 Github 交互实现 PUSH > HOOK > QUEUE > (PULL) > CALCULATE > CALLBACK ! ! • 问我 Python 版有没⽐比 Ruby 快 ! • 告诉他提了个 pull request
  47. 47. 提 Pull Request (1) Drinkup 当天晚上被 merge
  48. 48. 提 Pull Request (2)
  49. 49. 提 Pull Request (3)
  50. 50. 就是这样了.
  51. 51. 相关链接 • • • • • • https://github.com/douban/linguist https://github.com/douban/PyCharlockHolmes https://github.com/liluo/mime https://github.com/liluo/pygments-github-lexers https://github.com/cuteio/scanner https://github.com/github/linguist
  52. 52. End.

×