parboiled2 |--| A Macro-Based PEG Parser Generator for Scala 2.12+
.. contents:: Contents of this Document
parboiled2 is a Scala 2.12+ library enabling lightweight and easy-to-use, yet powerful, fast and elegant parsing of
arbitrary input text. It implements a macro-based parser generator for Parsing Expression Grammars
_ (PEGs), which
runs at compile time and translates a grammar rule definition (written in an internal Scala DSL) into corresponding JVM
bytecode.
PEGs are an alternative to Context-Free Grammars
_ (CFGs) for formally specifying syntax, they make a good replacement
for regular expressions and have some advantages over the "traditional" way of building parsers via CFGs (like not
needing a separate lexer/scanner phase).
parboiled2 is the successor of parboiled 1.x
_ , which provides a similar capability (for Scala as well as Java) but
does not actually generate a parser. Rather parboiled 1.x
_ interprets a rule tree structure (which is also created
via an internal DSL) against the input, which results in a much lower parsing performance.
For more info on how parboiled 1.x
_ and parboiled2 compare see parboiled2 vs. parboiled 1.x
.
You might also be interested in reading about parboiled2 vs. Scala Parser Combinators
and
parboiled2 vs. Regular Expressions
_.
.. _PEG: .. _Parsing Expression Grammars: https://en.wikipedia.org/wiki/Parsing_expression_grammar .. _Context-Free Grammars: https://en.wikipedia.org/wiki/Context-free_grammar .. _parboiled 1.x: http://parboiled.org
Parsing Expression Grammars
_, for effectively dealing with most real-world parsing needsThe artifacts for parboiled2 live on Maven Central
_ and can be tied into your SBT-based Scala project like this:
.. code:: Scala
libraryDependencies += "org.parboiled" %% "parboiled" % "2.5.0"
The latest released version is 2.5.1. It is available for Scala 2.12, 2.13 and 3, Scala.js and Scala Native.
parboiled2 has no external dependencies. (It used to depend on shapeless_, but the few bits it was using have been internalized at some point).
Once on your classpath you can use this single import to bring everything you need into scope:
.. code:: Scala
import org.parboiled2._
.. _Maven Central: https://search.maven.org/search?q=g:org.parboiled .. _shapeless: https://github.com/milessabin/shapeless
This is what a simple parboiled2 parser looks like:
.. code:: Scala
import org.parboiled2._
class Calculator(val input: ParserInput) extends Parser {
def InputLine = rule { Expression ~ EOI }
def Expression: Rule1[Int] = rule {
Term ~ zeroOrMore(
'+' ~ Term ~> ((_: Int) + _)
| '-' ~ Term ~> ((_: Int) - _))
}
def Term = rule {
Factor ~ zeroOrMore(
'*' ~ Factor ~> ((_: Int) * _)
| '/' ~ Factor ~> ((_: Int) / _))
}
def Factor = rule { Number | Parens }
def Parens = rule { '(' ~ Expression ~ ')' }
def Number = rule { capture(Digits) ~> (_.toInt) }
def Digits = rule { oneOrMore(CharPredicate.Digit) }
}
new Calculator("1+1").InputLine.run() // evaluates to `scala.util.Success(2)`
This implements a parser for simple integer expressions like 1+(2-3*4)/5
and runs the actual calculation in-phase
with the parser. If you'd like to see it run and try it out yourself check out Running the Examples
_.
A parboiled2 parser is a class deriving from org.parboiled2.Parser
, which defines one abstract member:
.. code:: Scala
def input: ParserInput
holding the input for the parsing run. Usually it is best implemented as a val
parameter in the constructor
(as shown in the Example_ above). As you can see from this design you need to (re-)create a new parser instance for
every parsing run (parser instances are very lightweight).
The "productions" (or "rules") of your grammar are then defined as simple methods, which in most cases consist of a
single call to the rule
macro whose argument is a DSL expression
_ defining what input the rule is to match and
what actions_ to perform.
In order to run your parser against a given input you create a new instance and call run()
on the top-level rule,
e.g:
.. code:: Scala
val parser = new MyParser(input)
parser.topLevelRule.run() // by default returns a ``scala.util.Try``
For more info on what options you have with regard to accessing the results of a parsing run check out the section
on Access to Parser Results
_.
.. DSL expression: The Rule DSL
.. actions: Parser Actions
PEG_ parsers are quite easy to understand as they work just like most people without a lot of background in parsing theory would build a parser "by hand": recursive-descent with backtracking. They have only one parsing phase (not two, like most parsers produced by traditional parser generators like ANTLR_), do not require any look-ahead and perform quite well in most real-world scenarios (although they can exhibit exponential runtime for certain pathological languages and inputs).
A PEG_ parser consists of a number of rules that logically form a "tree", with one "root" rule at the top calling zero or more lower-level rules, which can each call other rules and so on. Since rules can also call themselves or any of their parents the rule "tree" is not really a tree but rather a potentially cyclic directed graph, but in most cases the tree structure dominates, which is why its useful to think of it as a tree with potential cycles.
When a rule is executed against the current position in an input buffer it applies its specific matching logic to the input, which can either succeed or fail. In the success case the parser advances the input position (the cursor) and potentially executes the next rule. Otherwise, when the rule fails, the cursor is reset and the parser backtracks in search of another parsing alternative that might succeed.
For example consider this simple parboiled2 rule:
.. code:: Scala
def foo = rule { 'a' ~ ('b' ~ 'c' | 'b' ~ 'd') }
When this rule is confronted with the input abd
the parser matches the input in these steps:
foo
starts executing, which calls its first sub-rule 'a'
. The cursor is at position 0.'a'
is executed against input position 0, matches (succeeds) and the cursor is advanced to position 1.'b' ~ 'c' | 'b' ~ 'd'
starts executing, which calls its first sub-rule 'b' ~ 'c'
.'b' ~ 'c'
starts executing, which calls its first sub-rule 'b'
.'b'
is executed against input position 1, matches (succeeds) and the cursor is advanced to position 2.'c'
is executed against input position 2 and mismatches (fails).'b' ~ 'c' | 'b' ~ 'd'
notices that its first sub-rule has failed, resets the cursor to position 1 and
calls its 2nd sub-rule 'b' ~ 'd'
.'b' ~ 'd'
starts executing, which calls its first sub-rule 'b'
.'b'
is executed against input position 1, matches and the cursor is advanced to position 2.'d'
is executed against input position 2, matches and the cursor is advanced to position 3.'b' ~ 'd'
completes successfully, as its last sub-rule has succeeded.'b' ~ 'c' | 'b' ~ 'd'
completes successfully, as one of its sub-rules has succeeded.foo
completes execution successfully, as its last sub-rule has succeeded.
The whole input "abd" was matched and the cursor is left at position 3 (after the last-matched character)... _ANTLR: https://www.antlr.org/
In order to work with parboiled2 effectively you should understand the core concepts behind its rule DSL, mainly the "Value Stack" and how parboiled2 encodes value stack operations in the Scala type system.
Apart from the input buffer and the cursor the parser manages another important structure: the "Value Stack".
The value stack is a simple stack construct that serves as temporary storage for your Parser Actions
. In many cases
it is used for constructing an AST during the parsing run but it can also be used for "in-phase" computations
(like in the Example_ above) or for any other purpose.
When a rule of a parboiled2 parser executes it performs any combination of the following three things:
Matching input is done by calling Basic Character Matching
_ rules, which do nothing but match input and advance
the cursor. Value stack operations (and other potential side-effects) are performed by Parser Actions
_.
It is important to understand that rules in parboiled2 (i.e. the rule methods in your parser class) do not directly return some custom value as a method result. Instead, all their consuming and producing values happens as side-effects to the value stack. Thereby the way that a rule interacts with value stack is encoded in the rule's type.
This is the general definition of a parboiled2 rule:
.. code:: Scala
class Rule[-I <: HList, +O <: HList]
This can look scary at first but is really quite simple. An HList
is defined by shapeless_ and is essentially a type
of list whose element number and element types are statically known at compile time. The I
type parameter on
Rule
encodes what values (the number and types) the rule pops off the value stack and the O
type parameter
encodes what values (the number and types) the rule then pushes onto the value stack.
Luckily, in most cases, you won't have to work with these types directly as they can either be inferred or you can use one of these predefined aliases:
.. code:: Scala
type Rule0 = RuleN[HNil]
type Rule1[+T] = RuleN[T :: HNil]
type Rule2[+A, +B] = RuleN[A :: B :: HNil]
type RuleN[+L <: HList] = Rule[HNil, L]
type PopRule[-L <: HList] = Rule[L, HNil]
Here is what these type aliases denote:
Rule0
A rule that neither pops off nor pushes to the value stack, i.e. has no effect on the value stack whatsoever.
All Basic Character Matching
_ rules are of this type.
Rule1[+T]
Pushes exactly one value of type T
onto the value stack. After Rule0
this is the second-most frequently
used rule type.
Rule2[+A, +B]
Pushes exactly two values of types A
and B
onto the value stack.
RuleN[+L <: HList]
Pushes a number of values onto the value stack, which correspond to the given L <: HList
type parameter.
PopRule[-L <: HList]
Pops a number of values off the value stack (corresponding to the given L <: HList
type parameter) and does
not produce any new value itself.
The rule DSL makes sure that the rule types are properly assembled and carried through your rule structure as you
combine Basic Character Matching
_ with Rule Combinators and Modifiers
_ and Parser Actions
_, so
as long as you don't write any logic that circumvents the value stack your parser will be completely type-safe and
the compiler will be able to catch you if you make mistakes by combining rules in an unsound way.
.. _AST: https://en.wikipedia.org/wiki/Abstract_syntax_tree
The following basic character matching rules are the only way to cause the parser to match actual input and
"make progress". They are the "atomic" elements of the rule DSL which are then used by the
Rule Combinators and Modifiers
_ to form higher-level rules.
implicit def ch(c: Char): Rule0
Char
values can be directly used in the rule DSL and match themselves. There is one notable case where you will
have to use the explicit ch
wrapper: You cannot use the |
operator directly on chars as it denotes the
built-in Scala binary "or" operator defined on numeric types (Char
is an unsigned 16-bit integer).
So rather than saying 'a' | 'b'
you will have to say ch('a') | 'b'
.
implicit def str(s: String): Rule0
String
values can be directly used in the rule DSL and match themselves.
implicit def predicate(p: CharPredicate): Rule0
You can use org.parboiled2.CharPredicate
values directly in the rule DSL. CharPredicate
is an efficient
implementation of character sets and already comes with a number pre-defined character classes like
CharPredicate.Digit
or CharPredicate.LowerHexLetter
.
implicit def valueMap[T](m: Map[String, T]): R
Values of type Map[String, T]
can be directly used in the rule DSL and match any of the given map's keys and
push the respective value upon a successful match. The resulting rule type depends on T
:
=================== =========================================
``T`` ``R``
=================== =========================================
``Unit`` ``Rule0``
``L <: HList`` ``RuleN[L]`` (pushes all values of ``L``)
``T`` (otherwise) ``Rule1[T]`` (pushes only one value)
=================== =========================================
def anyOf(chars: String): Rule0
This constructs a Rule0
which matches any of the given strings characters.
def noneOf(chars: String): Rule0
This constructs a Rule0
which matches any single character except the ones in the given string and except EOI.
def ignoreCase(c: Char): Rule0 Matches the given single character case insensitively. Note: The given character must be specified in lower-case! This requirement is currently NOT enforced!
def ignoreCase(s: String): Rule0 Matches the given string of characters case insensitively. Note: The given string must be specified in all lower-case! This requirement is currently NOT enforced!
def ANY: Rule0 Matches any character except EOI (end-of-input).
def EOI: Char The EOI (end-of-input) character, which is a virtual character that the parser "appends" after the last character of the actual input.
def MATCH: Rule0 Matches no character (i.e. doesn't cause the parser to make any progress) but succeeds always. It's the "empty" rule that is mostly used as a neutral element in rule composition.
def MISMATCH[I <: HList, O <: HList]: Rule[I, O] A rule that always fails. Fits any rule signature.
def MISMATCH0: Rule0
Same as MISMATCH
but with a clearly defined type. Use it (rather then MISMATCH
) if the call site doesn't
clearly "dictate" a certain rule type and using MISMATCH
therefore gives you a compiler error.
Rules can be freely combined/modified with these operations:
a ~ b
Two rules a
and b
can be combined with the ~
operator resulting in a rule that only matches if first
a
matches and then b
matches. The computation of the resulting rule type is somewhat involved.
Here is an illustration (using an abbreviated HList notation):
====================== ==================== =========================
a b a ~ b
====================== ==================== =========================
``Rule[, A]`` ``Rule[, B]`` ``Rule[, A:B]``
``Rule[A:B:C, D:E:F]`` ``Rule[F, G:H]`` ``Rule[A:B:C, D:E:G:H]``
``Rule[A, B:C]`` ``Rule[D:B:C, E:F]`` ``Rule[D:A, E:F]``
``Rule[A, B:C]`` ``Rule[D:C, E:F]`` Illegal if ``D`` != ``B``
====================== ==================== =========================
a | b
Two rules a
and b
can be combined with the |
operator to form an "ordered choice" in PEG_ speak.
The resulting rule tries to match
AI辅助编程,代码自动修复
Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。
AI小说写作助手,一站式润色、改写、扩写
蛙蛙写作—国内先进的AI写作平台,涵盖小说、学术、社交媒体等多场景。提供续写、改写、润色等功能,助力创作者高效优化写作流程。界面简洁,功能全面,适合各类写作者提升内容品质和工作效率。
全能AI智能助手, 随时解答生活与工作的多样问题
问小白,由元石科技研发的AI智能助手,快速准确地解答各种生活和工作问题,包括但不限于搜索、规划和社交互动,帮助用户在日常生活中提高效率,轻松管理个人事务。
实时语音翻译/同声传译工具
Transly是一个多场景的AI大语言模型驱动的同声传译、专业 翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。
一键生成PPT和Word,让学习生活更轻松
讯飞智文是一个利用 AI 技术的项目,能够帮助用户生成 PPT 以及各类文档。无论是商业领域的市场分析报告、年度目标制定,还是学生群体的职业生涯规划、实习避坑指南,亦或是活动策划、旅游攻略等内容,它都能提供支持,帮助用户精准表达,轻松呈现各种信息。
深度推理能力全新升级,全面对标OpenAI o1
科大讯 飞的星火大模型,支持语言理解、知识问答和文本创作等多功能,适用于多种文件和业务场景,提升办公和日常生活的效率。讯飞星火是一个提供丰富智能服务的平台,涵盖科技资讯、图像创作、写作辅助、编程解答、科研文献解读等功能,能为不同需求的用户提供便捷高效的帮助,助力用户轻松获取信息、解决问题,满足多样化使用场景。
一种基于大语言模型的高效单流解耦语音令牌文本到语音合成模型
Spark-TTS 是一个基于 PyTorch 的开源文本到语音合成项目,由多个知名机构联合参与。该项目提供了高效的 LLM(大语言模型)驱动的语音合成方案,支持语音克隆和语音创建功能,可通过命令行界面(CLI)和 Web UI 两种方式使用。用户可以根据需求调整语音的性别、音高、速度等参数,生成高质量的语音。该项目适用于多种场景,如有声读物制作、智能语音助手开发等。
AI助力,做PPT更简单!
咔片是一款轻量化在线演示设计工具,借助 AI 技术,实现从内容生成到智能设计的一站式 PPT 制作服务。支持多种文档格式导入生成 PPT,提供海量模板、智能美化、素材替换等功能,适用于销售、教师、学生等各类人群,能高效制作出高品质 PPT,满足不同场景演示需求。
选题、配图、成文,一站式创作,让内容运营更高效
讯飞绘文,一个AI集成平台,支持写作、选题、配图、排版和发布。高效生成适用于各类媒体的定制内容,加速品牌传播,提升内容营销效果。
专业的AI公文写作平台,公文写作神器
AI 材料星,专业的 AI 公文写作辅助平台,为体制内工作人员提供高效的公文写作解 决方案。拥有海量公文文库、9 大核心 AI 功能,支持 30 + 文稿类型生成,助力快速完成领导讲话、工作总结、述职报告等材料,提升办公效率,是体制打工人的得力写作神器。
最新AI工具、AI资讯
独家AI资源、AI项目落地
微信扫一扫关注公众号