Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to pygments

249 views

Published on

Pygments is a Python package to syntax hightlight over 300 programming languages and text formats. This talk gives an overview on using the pygmentize tool to render source code as HTML, RTF or latex. It then explains the basics of lexers and tokens and show how to use the pygments API to implement source code converters. Finally a step by step life coding section describes how to implement your own lexer step by step.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Introduction to pygments

  1. 1. Introduction to https://github.com/roskakori/talks/tree/master/pygraz/pygments Thomas Aglassinger http://www.roskakori.at @TAglassinger
  2. 2. What is pygments? ● Generic syntax highlighter ● Suitable for use in code hosting, forums, wikis or other applications ● Supports 300+ programming languages and text formats ● Provides a simple API to write your own lexers
  3. 3. Agenda ● Basic usage ● A glimpse at the API: lexers and tokens ● Use case: convert source code ● Use case: write your own lexer
  4. 4. Basic usage
  5. 5. Applications that use pygments ● Wikipedia ● Jupyter notebook ● Sphinx documentation builder ● Trac ticket tracker and wiki ● Bitbucket source code hosting ● Pygount source lines of code counter (shameless plug) ● And many others
  6. 6. Try it online
  7. 7. Try it online
  8. 8. Use the command line ● pygmentize -f html -O full,style=emacs -o example.html example.sql ● Renders example.sql to example.html ● Without “-O full,style=emacs” you have to provide your own CSS ● Other formats: LaTex, RTF, ANSI sequences -- Simple SQL example. select customer_number, first_name, surname, date_of_birth from customer where date_of_birth >= '1990-01-01' and rating <= 20
  9. 9. Choose a specific SQL dialect ● There are many SQL dialects ● Most use “.sql” as file suffix ● Use “-l <lexer>” to choose a specific lexer ● pygmentize -l tsql -f html -O full,style=emacs -o example.html transact.sql -- Simple Transact-SQL example. declare @date_of_birth date = '1990-01-01'; select top 10 * from [customer] where [date_of_birth] = @date_of_birth order by [customer_number]
  10. 10. A glimpse at the API: lexers and tokens
  11. 11. What are lexers? ● Lexers split a text into a list of tokens ● Tokens are strings with an assigned meaning ● For example, a Python source code might resolve to tokens like: – Comment: # Some comment – String: ‘Hellonworld!’ – Keyword: while – Number: 1.23e-45 ● Lexers only see single “words”, parsers see the whole syntax
  12. 12. Split a source code into tokens Source code for example.sql: -- Simple SQL example. select customer_number, first_name, surname, date_of_birth from customer where date_of_birth >= '1990-01-01' and rating <= 20
  13. 13. Tokens for example.sql (Token.Comment.Single, '-- Simple SQL example.n') (Token.Keyword, 'select') (Token.Text, 'n ') (Token.Name, 'customer_number') (Token.Punctuation, ',') … (Token.Operator, '>') ... (Token.Literal.String.Single, "'1990-01-01'") ... (Token.Literal.Number.Integer, '20') ... -- Simple SQL example. select customer_number, first_name, surname, date_of_birth from customer where date_of_birth >= '1990-01-01' and rating <= 20
  14. 14. Source code to lex example.sql import pygments.lexers import pygments.token def print_tokens(source_path): # Read source code into string. with open(source_path, encoding='utf-8') as source_file: source_text = source_file.read() # Find a fitting lexer. lexer = pygments.lexers.guess_lexer_for_filename( source_path, source_text) # Print tokens from source code. for items in lexer.get_tokens(source_text): print(items)
  15. 15. Source code to lex example.sql Obtain token sequence Find lexer matching the source code import pygments.lexers import pygments.token def print_tokens(source_path): # Read source code into string. with open(source_path, encoding='utf-8') as source_file: source_text = source_file.read() # Find a fitting lexer. lexer = pygments.lexers.guess_lexer_for_filename( source_path, source_text) # Print tokens from source code. for items in lexer.get_tokens(source_text): print(items)
  16. 16. Tokens in pygments ● Tokens are tuples with 2 items: – Type, e.g. Token.Comment – Text, e.g. ‘# Some comment’ ● Tokens are defined in pygments.token ● Some token types have subtypes, e.g. Comment has Comment.Single, Comment.Multiline etc. ● In that case, use “in” instead of “==” to check if a token type matches, e.g.: if token_type in pygments.token.Comment: ...
  17. 17. Convert source code
  18. 18. Convert source code ● Why? To match coding guidelines! ● Example: “SQL keywords must be lower case”→ faster to read ● Despite that, a lot of SQL code uses upper case for keywords. ● Legacy from the mainframe era and when text editors did not have syntax highlighting. SELECT CustomerNumber, FirstName, Surname FROM Customer WHERE DateOfBirth >= '1990-01-01'
  19. 19. Convert source code SELECT CustomerNumber, FirstName, Surname FROM Customer WHERE DateOfBirth >= '1990-01-01' select CustomerNumber, FirstName, Surname from Customer where DateOfBirth >= '1990-01-01'
  20. 20. Convert source code Check for keywords and convert them to lower case def lowify_sql_keywords(source_path, target_path): # Read source code into string. with open(source_path, encoding='utf-8') as source_file: source_text = source_file.read() # Find a fitting lexer. lexer = pygments.lexers.guess_lexer_for_filename( source_path, source_text) # Lex the source, convert keywords and write target file. with open(target_path, 'w', encoding='utf-8') as target_file: for token_type, token_text in lexer.get_tokens(source_text): # Check for keywords and convert them to lower case. if token_type == pygments.token.Keyword: token_text = token_text.lower() target_file.write(token_text)
  21. 21. Write your own lexer
  22. 22. Why write your own lexer? ● To support new languages ● To support obscure languages (mainframe FTW!) ● To support in house domain specific languages (DSL)
  23. 23. How to write your own lexer ● All the gory details: http://pygments.org/docs/lexerdevelopment/ ● For most practical purposes, inherit from RegexLexer ● Basic knowledge of regular expressions required (“import re”)
  24. 24. NanoSQL ● Small subset if SQL ● Comment: -- Some comment ● Keyword: select ● Integer number: 123 ● String: ‘Hello’; use ‘’ to escape ● Name: Customer ● Punctuation: .,;:
  25. 25. External lexers with pygmentize Use -l and -x to: pygmentize -f html -O full,style=emacs -l nanosqllexer.py:NanoSqlLexer -x -o example.html example.nsql
  26. 26. Source code for NanoSQL lexer ● Life coding! ● Starting from a skeleton ● Gradually adding regular expressions to render more elements
  27. 27. Skeleton for NanoSQL lexer from pygments.lexer import RegexLexer, words from pygments.token import Comment, Keyword, Name, Number, String, Operator, Punctuation, Whitespace _NANOSQL_KEYWORDS = ( 'as', 'from', 'select', 'where', ) class NanoSqlLexer(RegexLexer): name = 'NanoSQL' aliases = ['nanosql'] filenames = ['*.nsql'] tokens = { 'root': [ # TODO: Add rules. ], } Words to be treated as keywords. Names recognized by pygmentize’s -l option Patterns recognized by get_lexer_by_filename().
  28. 28. Render unknown tokens as Error from pygments.lexer import RegexLexer, words from pygments.token import Comment, Keyword, Name, Number, String, Operator, Punctuation, Whitespace _NANOSQL_KEYWORDS = ( 'as', 'from', 'select', 'where', ) class NanoSqlLexer(RegexLexer): name = 'NanoSQL' aliases = ['nanosql'] filenames = ['*.nsql'] tokens = { 'root': [ # TODO: Add rules. ], }
  29. 29. Detect comments class NanoSqlLexer(RegexLexer): name = 'NanoSQL' aliases = ['nanosql'] filenames = ['*.nsql'] tokens = { 'root': [ (r'--.*?$', Comment), ], }
  30. 30. Detect whitespace class NanoSqlLexer(RegexLexer): name = 'NanoSQL' aliases = ['nanosql'] filenames = ['*.nsql'] tokens = { 'root': [ (r's+', Whitespace), (r'--.*?$', Comment), ], }
  31. 31. Detect names class NanoSqlLexer(RegexLexer): name = 'NanoSQL' aliases = ['nanosql'] filenames = ['*.nsql'] tokens = { 'root': [ (r's+', Whitespace), (r'--.*?$', Comment), (r'w+', Name), ], } w = [a-zA-Z0-9_]
  32. 32. Detect numbers class NanoSqlLexer(RegexLexer): name = 'NanoSQL' aliases = ['nanosql'] filenames = ['*.nsql'] tokens = { 'root': [ (r's+', Whitespace), (r'--.*?$', Comment), (r'd+', Number), (r'w+', Name), ], } d = [0-9] Must check before w
  33. 33. Detect keywords class NanoSqlLexer(RegexLexer): name = 'NanoSQL' aliases = ['nanosql'] filenames = ['*.nsql'] tokens = { 'root': [ (words(_NANOSQL_KEYWORDS, suffix=r'b'), Keyword), (r's+', Whitespace), (r'--.*?$', Comment), (r'd+', Number), (r'w+', Name), ] }
  34. 34. Detect keywords class NanoSqlLexer(RegexLexer): name = 'NanoSQL' aliases = ['nanosql'] filenames = ['*.nsql'] tokens = { 'root': [ (words(_NANOSQL_KEYWORDS, suffix=r'b'), Keyword), (r's+', Whitespace), (r'--.*?$', Comment), (r'd+', Number), (r'w+', Name), ] } words() takes a list of strings and returns an optimized pattern for a regular expression that matches any of these strings. b = end of word
  35. 35. Detect punctuation and operators class NanoSqlLexer(RegexLexer): name = 'NanoSQL' aliases = ['nanosql'] filenames = ['*.nsql'] tokens = { 'root': [ (words(_NANOSQL_KEYWORDS, suffix=r'b'), Keyword), (r's+', Whitespace), (r'--.*?$', Comment), (r'd+', Number), (r'w+', Name), (r'[.,;:]', Punctuation), (r'[<>=/*+-]', Operator), ], }
  36. 36. Detect string – finished! class NanoSqlLexer(RegexLexer): name = 'NanoSQL' aliases = ['nanosql'] filenames = ['*.nsql'] tokens = { 'root': [ (words(_NANOSQL_KEYWORDS, suffix=r'b'), Keyword), (r's+', Whitespace), (r'--.*?$', Comment), (r'd+', Number), (r'w+', Name), (r'[.,;:]', Punctuation), (r'[<>=/*+-]', Operator), (''', String, 'string'), ], 'string': [ ("''", String), (r'[^']', String), ("'", String, '#pop') ] }
  37. 37. Detect string – finished! class NanoSqlLexer(RegexLexer): name = 'NanoSQL' aliases = ['nanosql'] filenames = ['*.nsql'] tokens = { 'root': [ (words(_NANOSQL_KEYWORDS, suffix=r'b'), Keyword), (r's+', Whitespace), (r'--.*?$', Comment), (r'd+', Number), (r'w+', Name), (r'[.,;:]', Punctuation), (r'[<>=/*+-]', Operator), (''', String, 'string'), ], 'string': [ ("''", String), (r'[^']', String), ("'", String, '#pop') ] } Change state to ‘string’ Double single quote (escaped quote) On single quote, terminate string and revert lexer to previous state (‘root’) “Anything except single quote”
  38. 38. Regex fetish note You can squeeze string tokens in a single regex rule without the need for a separate state: (r"'(|'|''|[^'])*'", String),
  39. 39. Conclusion
  40. 40. Summary ● Pygments is a versatile Python package to syntax highlight over 300 programming languages and text formats. ● Use pygmentize to create highlighted code as HTML, LaTex or RTF. ● Utilize lexers to implement code converters and analyzers. ● Writing your own lexers is simple.

×