A short description of Perly grammar processors leading up to Regexp::Grammars. Develops two R::G modules, one for single-line logfile entries, another for larger FASTA format entries in the NCBI "nr.gz" file. The second example shows how to derive one grammar from another by overriding tags in the base grammar.
This is the first set of slightly updated slides from a Perl programming course that I held some years ago for the QA team of a big international company.
I want to share it with everyone looking for intransitive Perl-knowledge.
The updates after 1st of June 2014 are made with the kind support of Chain Solutions (http://chainsolutions.net/)
A table of content for all presentations can be found at i-can.eu.
The source code for the examples and the presentations in ODP format are on https://github.com/kberov/PerlProgrammingCourse
This is the first set of slightly updated slides from a Perl programming course that I held some years ago for the QA team of a big international company.
I want to share it with everyone looking for intransitive Perl-knowledge.
The updates after 1st of June 2014 are made with the kind support of Chain Solutions (http://chainsolutions.net/)
A table of content for all presentations can be found at i-can.eu.
The source code for the examples and the presentations in ODP format are on https://github.com/kberov/PerlProgrammingCourse
Have you ever wondered how you would implement a parser for simple arithmetic expressions in C++? Faced with this problem, many C++ programmers would generate a parser using Flex and Bison, or perhaps Boost Spirit. In this talk, I introduced some alternative tools, Ragel and Lemon, showing how they can be used to build a calculator that handles simple arithmetic expressions.
A subject named "Internet Technology and its Applications" subject's introduction taught as part of B.E.IV (Computer), 8th semester students, VNSGU, Surat, Gujarat State, India.
Regular expressions are very commonly used to process and validate text data. Unfortunately, they suffer from various limitations. I will discuss the limitations and illustrate how using grammars instead can be an improvement. I will make the examples oncrete using parsing libraries in a couple of representative languages, although the ideas are language-independent.
I will emphasize ease of use, since one reason for the overuse of regular expressions is that they are so easy to pull out of one's toolbox.
Agenda
Setting up an angular app.
Introduction to tools - Babel, Webpack
Alternative to Gulp, Grunt & Bower.
Writing Controllers, Services, Directives etc..
Testing Javascript with Jasmine.
Setting up Karma with Webpack.
Let’s understand code coverage.
An alternative: JEST
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
Have you ever wondered how you would implement a parser for simple arithmetic expressions in C++? Faced with this problem, many C++ programmers would generate a parser using Flex and Bison, or perhaps Boost Spirit. In this talk, I introduced some alternative tools, Ragel and Lemon, showing how they can be used to build a calculator that handles simple arithmetic expressions.
A subject named "Internet Technology and its Applications" subject's introduction taught as part of B.E.IV (Computer), 8th semester students, VNSGU, Surat, Gujarat State, India.
Regular expressions are very commonly used to process and validate text data. Unfortunately, they suffer from various limitations. I will discuss the limitations and illustrate how using grammars instead can be an improvement. I will make the examples oncrete using parsing libraries in a couple of representative languages, although the ideas are language-independent.
I will emphasize ease of use, since one reason for the overuse of regular expressions is that they are so easy to pull out of one's toolbox.
Agenda
Setting up an angular app.
Introduction to tools - Babel, Webpack
Alternative to Gulp, Grunt & Bower.
Writing Controllers, Services, Directives etc..
Testing Javascript with Jasmine.
Setting up Karma with Webpack.
Let’s understand code coverage.
An alternative: JEST
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
Sooner or later we all have to work with HTML, despite its verbosity. Those of us who claim to love HTML may just be victims of Stockholm Syndrome, both praising yet secretly loathing it.
Basho designer John Newman is making the trek from the swamps of Florida to show us the way. In the modern world of markup preprocessors, these alternative syntaxes allow you to write simpler, cleaner, more concise code in a shorter amount of time. Certain techniques can even allow your team members who may be less-tech-savvy to contribute content directly without forcing you to wire up a WYSIWYG style CMS.
This talk explores great alternatives to plain HTML and CSS, and covers how Basho put these tools together to facilitate a painless, team-oriented approach to building sites and web apps.
Dart is a new language for the web, enabling you to write JavaScript on a secure and manageable way. No need to worry about "JavaScript: The bad parts".
This presentation concentrates on the developer experience converting from the Java based GWT to Dart.
Fixed width data can be processed efficiently in Perl using forks and shared file handles. This talk describes the basic mechanism and alternatives for improving the performance in dealing with the records.
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...digitalwave
This presentation about a configuration parser and transformer for ModSecurity ruleset was introduced at OWASP CRS community Summit on 25th of September, 2019.
The tool is available here: https://github.com/digitalwave/msc_pyparser
The use of the code analysis library OpenC++: modifications, improvements, er...PVS-Studio
The article may be interesting for developers who use or plan to use OpenC++ library (OpenCxx). The author tells about his experience of improving OpenC++ library and modifying the library for solving special tasks.
Presentation of the Dolnośląska Ruby User Group, which took place on 20.02.2023.
Compared to the previous version, it contains 1 regular expression fix and Ruby 3.2 improvements.
Regex is used for Google Analytics and much more!
We explore Ruby's default regexp class, but go beyond (so it's for advanced developers):
- Explaining how backtracking and deterministic finite automaton work.
- Why get the RE2 library from Google? or regular expressions at all?
- What Ruby regex improvements ship with Ruby 3.2?
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Workhorse Computing
Scalar::Util, List::Util, and List::MoreUtils provide simpler, cleaner, and faster solutions in XS for scalar introspection and list management than what is available in Pure Perl. This is a short introduction to the utilities and how they work with more recent Perl features like smart matching.
Similar to Perly Parsing with Regexp::Grammars (20)
With all of the focus on OOP and frameworks, sometimes the utilities get ignored. These modules provide us with lightweight, simple, effective solutions to everyday problems, saving us all from reinventing the wheel. This talk looks at a several of the utilities and shows some of the less common ways they can save a lot of time.
Nonparametric statistics show up in all sorts of places with fuzzy, ranked, or labeled data. The techniques allow handling messy data with more robust results than assuming normality. This talk describes the basics of nonparametric analysis and shows some examples with the Kolomogrov-Smirnov test, one of the most commonly used.
The $path to knowledge: What little it take to unit-test Perl.Workhorse Computing
Metadata-driven lazyness, Perl, and Jenkins provide a nice mix for automated testing. With Perl the only thing required to start testing is a files path, from there the possibilities are endless. Using Symbol's qualify_to_ref makes it easy to validate @EXPORT & @EXPORT_OK, knowing the path makes it easy to use "perl -wc" to get diagnostics.
The beautiful thing is all of it can be lazy... er, "automated". And repeatable. And simple.
perl often doesn't get updated because people don't have a way to know if their current code works with the new one. The problem is that they lack unit tests. This talk describes how simple it is to generate unit tests with Perl and shell, use them to automate solving problems like missing modules, and test a complete code base.
Using a base date, intervals, and ranges makes it easy to generate lookup tables for calendar intervals like annual or quarterly reports. The SQL for generating and searching the tables is made much easier using PG's built in range and interval types and more efficient with GiST indexes.
Face it, backticks are a pain. BASH $() construct provides a simpler, more effective approach. This talk uses examples from automating git branches and command line processing with getopt(1) to show how $() works in shell scripts.
This talk describes refactoring FindBin::libs from Perl5 to Raku: breaking the module up into functional pieces, writing the tests using Raku, testing and releasing the module with mi6.
Starting with the system calll "getrusage", this returns synchronous, process-level information, mainly max RSS used. This talk describes the output from getrusage, the rusage formatting utility in ProcStats, and several examples of using it to examine time and memory use.
Optional first & final outputs to give baseline and total status, differencing avoids extraneous output, and user messages allow arbitrary stat's and tracking content.
The combination makes this nice for tracking both long-lived and shorter, more intensive processing.
Variable interpolation is a standard way to BASH your head. This talk looks at interpolation, eval, ${} handling and "set -vx" to debug basic variable handling.
Performance benchmarks are all too often inaccurate. This talk introduces some things to look for in setting up and running benchmarks to make them effective.
A short description of the W-curve and its application to aligning genomic sequences. This includes a short introduction to the W-curve, example of SQL-based alignment of a crossover, suggestions for further work on its application.
We have all seen repetitive code, maintained by cut+paste, that creates an object, calls a method, checks a return, calls a method, checks a return... all of it difficult to maintain because of its sheer size.
Object::Exercise replaces the pasted loops with data-driven code, the operation controlled by a data structure of methods, arguments, and expected return values. This replaces cut+paste with declarative data.
This talk describes O::E and shows a few ways to apply it for testing the MadMongers' Adventure game.
Perl6 regular expression ("regex") syntax has a number of improvements over the Perl5 syntax. The inclusion of grammars as first-class entities in the language makes many uses of regexes clearer, simpler, and more maintainable. This talk looks at a few improvements in the regex syntax and also at how grammars can help make regex use cleaner and simpler.
Building a Perl5 smoketest environment in Docker using CPAN::Reporter::Smoker. Includes an overview of "smoke testing", shell commands to contstruct a hybrid environment with underlying O/S image and data volumes for /opt, /var/lib/CPAN. This allows maintaining the Perly smoke environemnt without having to rebuild it.
A few general pointers for Perl programmers starting out to write tests using Perl6. This describes a few of the differences in handling arrays vs. hashes, comparing objects, flattening, and value vs. immutable object contents.
This describes a Functional Programming approach to computing AWS Glacier "tree hash" values, hiding the tail-call elimination in Perl5 with a keyword and also shows how to accomplish the same result in Perl6.
This was the talk actually given at YAPC::NA 2016 by Dr. Conway and myself.
Implementing Glacier's Tree Hash using recursive, functional programming in Perl5. With Keyword::Declare we get clean syntax for tail-call elimination. Result is a simple, fast, functional solution.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
2. Grammars are the guts of compilers
● Compilers convert text from one form to another.
– C compilers convert C source to CPU-specific assembly.
– Databases compile SQL into RDBMS op's.
● Grammars define structure, precedence, valid inputs.
– Realistic ones are often recursive or context-sensitive.
– The complexity in defining grammars led to a variety of tools for defining
them.
– The standard format for a long time has been “BNF”, which is the input to
YACC.
● They are wasted on for 'flat text'.
– If “split /t/” does the job skip grammars entirely.
3. The first Yet Another: YACC
● Yet Another Compiler Compiler
– YACC takes in a standard-format grammar structure.
– It processes tokens and their values, organizing the
results according to the grammar into a structure.
● Between the source and YACC is a tokenizer.
– This parses the inputs into individual tokens defined by
the grammar.
– It doesn't know about structure, only breaking the text
stream up into tokens.
4. Parsing is a pain in the lex
● The real pain is gluing the parser and tokenizer
together.
– Tokenizers deal in the language of patterns.
– Grammars are defined in terms of structure.
● Passing data between them makes for most of the
difficulty.
– One issue is the global yylex call, which makes having
multiple parsers difficult.
– Context-sensitive grammars with multiple sub-
grammars are painful.
5. The perly way
● Regexen, logic, glue... hmm... been there before.
– The first approach most of us try is lexing with regexen.
– Then add captures and if-blocks or excute (?{code})
blocks inside of each regex.
● The problem is that the grammar is embedded in
your code structure.
– You have to modify the code structure to change the
grammar or its tokens.
– Hubris, maybe, but Truly Lazy it ain't.
– Was the whole reason for developing standard
grammars & their handlers in the first place.
6. Early Perl Grammar Modules
● These take in a YACC grammar and spit out
compiler code.
● Intentionally looked like YACC:
– Able to re-cycle existing YACC grammar files.
– Benefit from using Perl as a built-in lexer.
– Perl-byacc & Parse::Yapp.
● Good: Recycles knowledge for YACC users.
● Bad: Still not lazy: The grammars are difficult to
maintain and you still have to plug in post-
processing code to deal with the results.
8. The Swiss Army Chainsaw
● Parse::RecDescent extended the original BNF
syntax, combining the tokens & handlers.
● Grammars are largely declarative, using OO Perl to
do the heavy lifting.
– OO interface allows multiple, context sensitive parsers.
– Rules with Perl blocks allows the code to do anything.
– Results can be acquired from a hash, an array, or $1.
– Left, right, associative tags simplify messy situations.
9. Example P::RD
● This is part
of an infix
formula
compiler I
wrote.
● It compiles
equations to
a sequence
of closures.
add_op : '+' | '-' | '%' { $item[ 1 ] }
mult_op : '*' | '/' | '^' { $item[ 1 ] }
add : <leftop: mult add_op mult>
{
compile_binop @{ $item[1] }
}
mult : <leftop: factor mult_op factor>
{
compile_binop @{ $item[1] }
}
10. Just enough rope to shoot yourself...
● The biggest problem: P::RD is sloooooooowsloooooooow.
● Learning curve is perl-ish: shallow and long.
– Unless you really know what all of it does you may not
be able to figure out the pieces.
– Lots of really good docs that most people never read.
● Perly blocks also made it look too much like a job-
dispatcher.
– People used it for a lot of things that are not compilers.
– Good & Bad thing: it really is a compiler.
11. R.I.P. P::RD
● Supposed to be replaced with Parse::FastDescent.
– Damian dropped work on P::FD for Perl6.
– His goal was to replace the shortcomings with P::RD with
something more complete, and quite a bit faster.
● The result is Perl6 Grammars.
– Declarative syntax extends matching with rules.
– Built into Perl6 as a structure, not an add-on.
– Much faster.
– Not available in Perl5
12. Regex::Grammars
● Perl5 implementation derived from Perl6.
– Back-porting an idea, not the Perl6 syntax.
– Much better performance than P::RD.
● Extends the v5.10 recursive matching syntax,
leveraging the regex engine.
– Most of the speed issues are with regex design, not the
parser itself.
– Simplifies mixing code and matching.
– Single place to get the final results.
– Cleaner syntax with automatic whitespace handling.
13. Extending regexen
● “use Regexp::Grammar” turns on added syntax.
– block-scoped (avoids collisions with existing code).
● You will probably want to add “xm” or “xs”
– extended syntax avoids whitespace issues.
– multi-line mode (m) simplifies line anchors for line-
oriented parsing.
– single-line mode (s) makes ignoring line-wrap
whitespace largely automatic.
– I use “xm” with explicit “n” or “s” matches to span
lines where necessary.
14. What you get
● The parser is simply a regex-ref.
– You can bless it or have multiple parsers for context
grammars.
● Grammars can reference one another.
– Extending grammars via objects or modules is
straightforward.
● Comfortable for incremental development or
refactoring.
– Largely declarative syntax helps.
– OOP provides inheritance with overrides for rules.
15. my $compiler
= do
{
use Regexp::Grammars;
qr
{
<data>
<rule: data > <[text]>+
<rule: text > .+
}xm
};
Example: Creating a compiler
● Context can be
a do-block,
subroutine, or
branch logic.
● “data” is the
entry rule.
● All this does is
read lines into
an array with
automatic ws
handling.
16. Results: %/
● The results of parsing are in a tree-hash named %/.
– Keys are the rule names that produced the results.
– Empty keys ('') hold input text (for errors or
debugging).
– Easy to handle with Data::Dumper.
● The hash has at least one key for the entry rule, one
empty key for input data if context is being saved.
● For example, feeding two lines of a Gentoo emerge
log through the line grammar gives:
17. {
'' => '1367874132: Started emerge on: May 06, 2013
21:02:12
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk',
data =>
{
'' => '1367874132: Started emerge on: May 06, 2013
21:02:12
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk',
text =>
[
'1367874132: Started emerge on: May 06, 2013
21:02:12',
'
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk'
]
Parsing a few lines of logfile
18. Getting rid of context
● The empty-keyed values are useful for
development or explicit error messages.
● They also get in the way and can cost a lot of
memory on large inputs.
● You can turn them on and off with <context:> and
<nocontext:> in the rules.
19. qr
{
<nocontext:> # turn off globally
<data>
<rule: data > <text>+ # oops, left off the []!
<rule: text > .+
}xm;
warn | Repeated subrule <text>+ will only capture its
final match
| (Did you mean <[text]>+ instead?)
|
{
data => {
text => '
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk'
}
}
You usually want [] with +
20. {
data =>
{
text => the [text] parses to an array of text
[
'1367874132: Started emerge on: May 06, 2013 21:02:12',
'
1367874132: *** emerge --jobs --autounmask-write –...
],
...
qr
{
<nocontext:> # turn off globally
<data>
<rule: data > <[text]>+
<rule: text > (.+)
}xm;
An array[ref] of text
21. Breaking up lines
● Each log entry is prefixed with an entry id.
● Parsing the ref_id off the front adds:
<data>
<rule: data > <[line]>+
<rule: line > <ref_id> <[text]>
<token: ref_id > ^(d+)
<rule: text > .+
line =>
[
{
ref_id => '1367874132',
text => ': Started emerge on: May 06, 2013 21:02:12'
},
…
]
22. Removing cruft: “ws”
● Be nice to remove the leading “: “ from text lines.
● In this case the “whitespace” needs to include a
colon along with the spaces.
● Whitespace is defined by <ws: … >
<rule: line> <ws:[s:]+> <ref_id> <text>
{
ref_id => '1367874132',
text => '*** emerge --jobs –autounmask-wr...
}
23. The '***' prefix means something
● Be nice to know what type of line was being
processed.
● <prefix= regex > asigns the regex's capture to the
“prefix” tag:
<rule: line > <ws:[s:]*> <ref_id> <entry>
<rule: entry >
<prefix=([*][*][*])> <text>
|
<prefix=([>][>][>])> <text>
|
<prefix=([=][=][=])> <text>
|
<prefix=([:][:][:])> <text>
|
<text>
24. {
entry => {
text => 'Started emerge on: May 06, 2013 21:02:12'
},
ref_id => '1367874132'
},
{
entry => {
prefix => '***',
text => 'emerge --jobs –autounmask-write...
},
ref_id => '1367874132'
},
{
entry => {
prefix => '>>>',
text => 'emerge (1 of 2) sys-apps/...
},
ref_id => '1367874256'
}
“entry” now contains optional prefix
25. Aliases can also assign tag results
● Aliases assign a
key to rule
results.
● The match from
“text” is aliased
to a named type
of log entry.
<rule: entry>
<prefix=([*][*][*])> <command=text>
|
<prefix=([>][>][>])> <stage=text>
|
<prefix=([=][=][=])> <status=text>
|
<prefix=([:][:][:])> <final=text>
|
<message=text>
27. Parsing without capturing
● At this point we don't really need the prefix strings
since the entries are labeled.
● A leading '.' tells R::G to parse but not store the
results in %/:
<rule: entry >
<.prefix=([*][*][*])> <command=text>
|
<.prefix=([>][>][>])> <stage=text>
|
<.prefix=([=][=][=])> <status=text>
|
<.prefix=([:][:][:])> <final=text>
|
<message=text>
29. The “entry” nesting gets in the way
● The named subrule is not hard to get rid of: just
move its syntax up one level:
<ws:[s:]*> <ref_id>
(
<.prefix=([*][*][*])> <command=text>
|
<.prefix=([>][>][>])> <stage=text>
|
<.prefix=([=][=][=])> <status=text>
|
<.prefix=([:][:][:])> <final=text>
|
<message=text>
)
30. data => {
line => [
{
message => 'Started emerge on: May 06, 2013 21:02:12',
ref_id => '1367874132'
},
{
command => 'emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y --deep
talk',
ref_id => '1367874132'
},
{
command => 'terminating.',
ref_id => '1367874133'
},
{
message => 'Started emerge on: May 06, 2013 21:02:17',
ref_id => '1367874137'
},
Result: array of “line” with ref_id & type
31. Funny names for things
● Maybe “command” and “status” aren't the best way
to distinguish the text.
● You can store an optional token followed by text:
<rule: entry > <ws:[s:]*> <ref_id> <type>? <text>
<token: type>
(
[*][*][*]
|
[>][>][>]
|
[=][=][=]
|
[:][:][:]
)
32. Entrys now have “text” and “type”
entry => [
{
ref_id => '1367874132',
text => 'Started emerge on: May 06, 2013 21:02:12'
},
{
ref_id => '1367874133',
text => 'terminating.',
type => '***'
},
{
ref_id => '1367874137',
text => 'Started emerge on: May 06, 2013 21:02:17'
},
{
ref_id => '1367874137',
text => 'emerge --jobs --autounmask-write –...
type => '***'
},
33. prefix alternations look ugly.
● Using a count works:
[*]{3} | [>]{3} | [:]{3} | [=]{3}
but isn't all that much more readable.
● Given the way these are used, use a block:
[*>:=] {3}
34. qr
{
<nocontext:>
<data>
<rule: data > <[entry]>+
<rule: entry >
<ws:[s:]*>
<ref_id> <prefix>? <text>
<token: ref_id > ^(d+)
<token: prefix > [*>=:]{3}
<token: text > .+
}xm;
This is the skeleton parser:
● Doesn't take much:
– Declarative syntax.
– No Perl code at all!
● Easy to modify by
extending the
definition of “text”
for specific types of
messages.
35. Finishing the parser
● Given the different line types it will be useful to
extract commands, switches, outcomes from
appropriate lines.
– Sub-rules can be defined for the different line types.
<rule: command> “emerge”
<.ws><[switch]>+
<token: switch> ([-][-]S+)
● This is what makes the grammars useful: nested,
context-sensitive content.
36. Inheriting & Extending Grammars
● <grammar: name> and <extends: name> allow a
building-block approach.
● Code can assemble the contents of for a qr{} without
having to eval or deal with messy quote strings.
● This makes modular or context-sensitive grammars
relatively simple to compose.
– References can cross package or module boundaries.
– Easy to define a basic grammar in one place and reference
or extend it from multiple other parsers.
37. The Non-Redundant File
● NCBI's “nr.gz” file is a list if sequences and all of
the places they are known to appear.
● It is moderately large: 140+GB uncompressed.
● The file consists of a simple FASTA format with
heading separated by ctrl-A char's:
>Heading 1
[amino-acid sequence characters...]
>Heading 2
...
38. Example: A short nr.gz FASTA entry
● Headings are grouped by species, separated by ctrl-A
(“cA”) characters.
– Each species has a set of sources & identifier pairs
followed by a single description.
– Within-species separator is a pipe (“|”) with optional
whitespace.
– Species counts in some header run into the thousands.
>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827
[Dictyostelium discoideum AX4]gi|1705556|sp|P54670.1|CAF1_DICDI
RecName: Full=Calfumirin-1; Short=CAF-1gi|793761|dbj|BAA06266.1|
calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL68086.1|
hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQ...
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEK...
VQKLLNPDQ
39. First step: Parse FASTA
qr
{
<grammar: Parse::Fasta>
<nocontext:>
<rule: fasta > <.start> <head> <.ws> <[body]>+
<rule: head > .+ <.ws>
<rule: body > ( <[seq]> | <.comment> ) <.ws>
<token: start > ^ [>]
<token: comment > ^ [;] .+
<token: seq > ^ [nw-]+
}xm;
● Instead of defining an entry rule, this just defines a
name “Parse::Fasta”.
– This cannot be used to generate results by itself.
– Accessible anywhere via Rexep::Grammars.
40. The output needs help, however.
● The “<seq>” token captures newlines that need to be
stripped out to get a single string.
● Munging these requires adding code to the parser using
Perl's regex code-block syntax: (?{...})
– Allows inserting almost-arbitrary code into the regex.
– “almost” because the code cannot include regexen.
seq =>
[ 'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYD
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDP
VQKLLNPDQ
'
]
41. Munging results: $MATCH
● The $MATCH and %MATCH can be assigned to alter
the results from the current or lower levels of the parse.
● In this case I take the “seq” match contents out of %/,
join them with nothing, and use “tr” to strip the
newlines.
– join + split won't work because split uses a regex.
<rule: body > ( <[seq]> | <.comment> ) <.ws>
(?{
$MATCH = join '' => @{ delete $MATCH{ seq } };
$MATCH =~ tr/n//d;
})
42. One more step: Remove the arrayref
● Now the body is a single string.
● No need for an arrayref to contain one string.
● Since the body has one entry, assign offset zero:
body =>
[
'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK
DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDT
KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ'
],
<rule: fasta> <.start> <head> <.ws> <[body]>+
(?{
$MATCH{ body } = $MATCH{ body }[0];
})
43. Result: a generic FASTA parser.
{
fasta => [
{
body =>
'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK
DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDIT
KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ',
head => 'gi|66816243|ref|XP_642131.1| hypothetical p
rotein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556
|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=C
AF-1gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium
discoideum]gi|60470106|gb|EAL68086.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]
'
}
]
}
● The head and body are easily accessible.
● Next: parse the nr-specific header.
44. Deriving a grammar
● Existing grammars are “extended”.
● The derived grammars are capable of producing
results.
● In this case:
● References the grammar and extracts a list of fasta
entries.
<extends: Parse::Fasta>
<[fasta]>+
45. Splitting the head into identifiers
● Overloading fasta's “head” rule handles allows
splitting identifiers for individual species.
● Catch: cA is separator, not a terminator.
– The tail item on the list does't have a cA to anchor on.
– Using “.+[cAn] walks off the header onto the sequence.
– This is a common problem with separators & tokenizers.
– This can be handled with special tokens in the grammar,
but R::G provides a cleaner way.
46. First pass: Literal “tail” item
● This works but is ugly:
– Have two rules for the main list and tail.
– Alias the tail to get them all in one place.
<rule: head> <[ident]>+ <[ident=final]>
(?{
# remove the matched anchors
tr/cAn//d for @{ $MATCH{ ident } };
})
<token: ident > .+? cA
<token: final > .+ n
47. Breaking up the header
● The last header item is aliased to “ident”.
● Breaks up all of the entries:
head => {
ident => [
'gi|66816243|ref|XP_642131.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]',
'gi|1705556|sp|P54670.1|CAF1_DICDI RecName:
Full=Calfumirin-1; Short=CAF-1',
'gi|793761|dbj|BAA06266.1| calfumirin-1
[Dictyostelium discoideum]',
'gi|60470106|gb|EAL68086.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]'
]
}
48. Dealing with separators: '% <sep>
● Separators happen often enough:
– 1, 2, 3 , 4 ,13, 91 # numbers by commas, spaces
– g-c-a-g-t-t-a-c-a # characters by dashes
– /usr/local/bin # basenames by dir markers
– /usr:/usr/local:bin # dir's separated by colons
that R::G has special syntax for dealing with them.
● Combining the item with '%' and a seprator:
<rule: list> <[item]>+ % <separator> # one-or-more
<rule: list_zom> <[item]>* % <separator> # zero-or-more
49. Cleaner nr.gz header rule
● Separator syntax cleans things up:
– No more tail rule with an alias.
– No code block required to strip the separators and trailing
newline.
– Non-greedy match “.+?” avoids capturing separators.
qr
{
<nocontext:>
<extends: Parse::Fasta>
<[fasta]>+
<rule: head > <[ident]>+ % [cA]
<token: ident > .+?
}xm
50. Nested “ident” tag is extraneous
● Simpler to replace the “head” with a list of
identifiers.
● Replace $MATCH from the “head” rule with the
nested identifier contents:
qr
{
<nocontext:>
<extends: Parse::Fasta>
<[fasta]>+
<rule: head > <[ident]>+ % [cA]
(?{
$MATCH = delete $MATCH{ ident };
})
<token: ident > .+?
}xm
51. Result:
{
fasta => [
{
body => 'MASTQNIVEEVQKMLDT...NPDQ',
head => [
'gi|66816243|ref|XP_6...rt=CAF-1',
'gi|793761|dbj|BAA0626...oideum]',
'gi|60470106|gb|EAL68086...m discoideum AX4]'
]
}
]
}
● The fasta content is broken into the usual “body” plus
a “head” broken down on cA boundaries.
● Not bad for a dozen lines of grammar with a few
lines of code:
52. One more level of structure: idents.
● Species have <source > | <identifier> pairs followed
by a description.
● Add a separator clause “ % (?:s*|s*)”
– This can be parsed into a hash something like:
gi|66816243|ref|XP_642131.1|hypothetical ...
Becomes:
{
gi => '66816243',
ref => 'XP_642131.1',
desc => 'hypothetical...'
}
53. Munging the separated input
<fasta>
(?{
my $identz = delete $MATCH{ fasta }{ head }{ ident };
for( @$identz )
{
my $pairz = $_->{ taxa };
my $desc = pop @$pairz;
$_ = { @$pairz, desc => $desc }
}
$MATCH{ fasta }{ head } = $identz;
})
<rule: head > <[ident]>+ % [cA]
<token: ident > <[taxa]>+ % (?: s* [|] s* )
<token: taxa > .+?
54. Result: head with sources, “desc”
{
fasta => {
body => 'MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKR...EDQN',
head => [
{
desc => '30S ribosomal protein S18 [Lactococ...
gi => '15674171',
ref => 'NP_268346.1'
},
{
desc => '30S ribosomal protein S18 [Lactoco...
gi => '116513137',
ref => 'YP_812044.1'
},
...
55. Balancing R::G with calling code
● The regex engine could process all of nr.gz.
– Catch: <[fasta]>+ returns about 250_000 keys and literally
millions of total identifiers in the head's.
– Better approach: <fasta> on single entries, but chunking input
on '>' removes it as a leading charactor.
– Making it optional with <.start>? fixes the problem:
local $/ = '>';
while( my $chunk = readline )
{
chomp;
length $chunk or do { --$.; next };
$chunk =~ $nr_gz;
# process single fasta record in %/
}
56. Fasta base grammar: 3 lines of code
qr
{
<grammar: Parse::Fasta>
<nocontext:>
<rule: fasta > <.start>? <head> <.ws> <[body]>+
(?{
$MATCH{ body } = $MATCH{ body }[0];
})
<rule: head > .+ <.ws>
<rule: body > ( <[seq]> | <.comment> ) <.ws>
(?{
$MATCH = join '' => @{ delete $MATCH{ seq } };
$MATCH =~ tr/n//d;
})
<token: start > ^ [>]
<token: comment > ^ [;] .+
<token: seq > ^ ( [nw-]+ )
}xm;
57. Extension to Fasta: 6 lines of code.
qr
{
<nocontext:>
<extends: Parse::Fasta>
<fasta>
(?{
my $identz = delete $MATCH{ fasta }{ head }{ ident };
for( @$identz )
{
my $pairz = $_->{ taxa };
my $desc = pop @$pairz;
$_ = { @$pairz, desc => $desc };
}
$MATCH{ fasta }{ head } = $identz;
})
<rule: head > <[ident]>+ % [cA]
<rule: ident > <[taxa]>+ % (?: s* [|] s* )
<token: taxa > .+?
}xm
58. Result: Use grammars
● Most of the “real” work is done under the hood.
– Regexp::Grammars does the lexing, basic compilation.
– Code only needed for cleanups or re-arranging structs.
● Code can simplify your grammar.
– Too much code makes them hard to maintain.
– Trick is keeping the balance between simplicity in the
grammar and cleanup in the code.
● Either way, the result is going to be more
maintainable than hardwiring the grammar into code.
59. Aside: KwikFix for Perl v5.18
● v5.17 changed how the regex engine handles inline
code.
● Code that used to be eval-ed in the regex is now
compiled up front.
– This requires “use re 'eval'” and “no strict 'vars'”.
– One for the Perl code, the other for $MATCH and friends.
● The immediate fix for this is in the last few lines of
R::G::import, which push the pragmas into the caller:
● Look up $^H in perlvars to see how it works.
require re; re->import( 'eval' );
require strict; strict->unimport( 'vars' );
60. Use Regexp::Grammars
● Unless you have old YACC BNF grammars to
convert, the newer facility for defining the
grammars is cleaner.
– Frankly, even if you do have old grammars...
● Regexp::Grammars avoids the performance pitfalls
of P::RD.
– It is worth taking time to learn how to optimize NDF
regexen, however.
● Or, better yet, use Perl6 grammars, available today
at your local copy of Rakudo Perl6.
61. More info on Regexp::Grammars
● The POD is thorough and quite descriptive
[comfortable chair, enjoyable beverage suggested].
● The ./demo directory has a number of working – if
un-annotated – examples.
● “perldoc perlre” shows how recursive matching in
v5.10+.
● PerlMonks has plenty of good postings.
● Perl Review article by brian d foy on recursive
matching in Perl 5.10.