Developing your own parsers¶

Sybil parsers are callables that take a document and yield a sequence of regions. A region contains the character position of the start and end of the example in the document’s text, along with a parsed version of the example and a callable evaluator. Parsers are free to access any documented attribute of the Document although will most likely only need to work with text. The namespace attribute should not be modified.

The parsed version can take any form and only needs to be understood by the evaluator.

That evaluator will be called with an example constructed from the document and the region and should return a false value if the example is as expected. Otherwise, it should either raise an exception or return a textual description in the event of the example not being as expected. Evaluators may also modify the document’s namespace or push and pop evaluators.

Example instances are used to wrap up all the attributes you’re likely to need when writing an evaluator and all documented attributes are fine to use. In particular, parsed is the parsed value provided by the parser when instantiating the Region and namespace is a reference to the document’s namespace. Evaluators are free to modify the namespace if they need to.

If you need to write your own parser, you should consult the API Reference so see if suitable lexers already exist for the source language containing your examples.

Worked example¶

As an example, let’s look at a parser suitable for evaluating bash commands in a subprocess and checking the output is as expected:

.. code-block:: bash

   $ echo hi there
   hi there

Since this is a ReStructured Text code block, the simplest thing we could do would be to use the existing support for other languages:

from subprocess import check_output
from sybil import Sybil
from sybil.parsers.rest import CodeBlockParser

def evaluate_bash_block(example):
    command, expected = example.parsed.strip().split('\n')
    assert command.startswith('$ ')
    command = command[2:].split()
    actual = check_output(command).strip().decode('ascii')
    assert actual == expected, repr(actual) + ' != ' + repr(expected)

bash_parser = CodeBlockParser(language='bash', evaluator=evaluate_bash_block)

sybil = Sybil(parsers=[bash_parser], pattern='*.rst')

Another alternative would be to start with the lexer for ReST directives. Here, the parsed version consists of a tuple of the command to run and the expected output:

from subprocess import check_output
from typing import Iterable
from sybil import Sybil, Document, Region, Example
from sybil.parsers.rest.lexers import DirectiveLexer

from subprocess import check_output

def evaluate_bash_block(example: Example):
    command, expected = example.parsed
    actual = check_output(command).strip().decode('ascii')
    assert actual == expected, repr(actual) + ' != ' + repr(expected)

def parse_bash_blocks(document: Document) -> Iterable[Region]:
    lexer = DirectiveLexer(directive='code-block', arguments='bash')
    for lexed in lexer(document):
        command, output = lexed.lexemes['source'].strip().split('\n')
        assert command.startswith('$ ')
        parsed = command[2:].split(), output
        yield Region(lexed.start, lexed.end, parsed, evaluate_bash_block)

sybil = Sybil(parsers=[parse_bash_blocks], pattern='*.rst')

Finally, the parser could be implemented from scratch, with the parsed version again consisting of a tuple of the command to run and the expected output:

from subprocess import check_output
import re, textwrap
from sybil import Sybil, Region
from sybil.parsers.abstract.lexers import BlockLexer

BASHBLOCK_START = re.compile(r'^\.\.\s*code-block::\s*bash')
BASHBLOCK_END = r'(\n\Z|\n(?=\S))'

def evaluate_bash_block(example):
    command, expected = example.parsed
    actual = check_output(command).strip().decode('ascii')
    assert actual == expected, repr(actual) + ' != ' + repr(expected)

def parse_bash_blocks(document):
    lexer = BlockLexer(BASHBLOCK_START, BASHBLOCK_END)
    for region in lexer(document):
        command, output = region.lexemes['source'].strip().split('\n')
        assert command.startswith('$ ')
        region.parsed = command[2:].split(), output
        region.evaluator = evaluate_bash_block
        yield region

sybil = Sybil(parsers=[parse_bash_blocks], pattern='*.rst')

Of course, you should also write tests for your parser, showing it both succeeding and failing. Here are examples for the Bash parser implementation at the start of this section, making use of check_parser() to check a single example in a string against the supplied Parser:

from sybil.testing import check_parser
from testfixtures import ShouldAssert

def test_bash_success() -> None:
    check_parser(
        bash_parser,
        text="""
            .. code-block:: bash

                $ echo hi there
                hi there
        """,
    )

def test_bash_failure() -> None:
    with ShouldAssert("'this is wrong' != 'hi there'"):
        check_parser(
            bash_parser,
            text="""
                .. code-block:: bash

                    $ echo this is wrong
                    hi there
            """,
        )

Developing with Lexers¶

Sybil has a fairly rich selection of parsers and lexers such that even if your source format isn’t directly supported, you may not have too much work to do in order to support it.

Take Docusaurus code blocks, which add parameters to Markdown fenced code blocks. Suppose we want to implement a parser which will execute Python code blocks in this format:

```python title="hello.py"
print("hello")
```

Firstly, let’s implement a lexer that understands this extension to the markdown format:

from sybil.parsers.markdown.lexers import RawFencedCodeBlockLexer

class DocusaurusCodeBlockLexer(RawFencedCodeBlockLexer):

    def __init__(self) -> None:
        super().__init__(
            info_pattern=re.compile(
                r'^(?P<language>\w+)(?:\s+(?P<params>.+))?$\n', re.MULTILINE
            ),
        )

    def __call__(self, document: Document) -> Iterable[Region]:
        for lexed in super().__call__(document):
            lexemes = lexed.lexemes
            raw_params = lexemes.pop('params', None)
            params = lexemes['params'] = {}
            if raw_params:
                for match in re.finditer(r'(?P<key>\w+)="(?P<value>[^"]*)"', raw_params):
                    params[match.group('key')] = match.group('value')
            yield lexed

We can write a unit test that verifies this lexer works as follows:

from sybil import Region
from sybil.testing import check_lexer

def test_docusaurus_lexing() -> None:
    regions = check_lexer(
        lexer=DocusaurusCodeBlockLexer(),
        source_text="""
            ```jsx title="/src/components/HelloCodeTitle.js"
            function HelloCodeTitle(props) {
              return <h1>Hello, {props.name}</h1>;
            }
            ```
        """,
        expected_text=(
            '            ```jsx title="/src/components/HelloCodeTitle.js"\n'
            '            function HelloCodeTitle(props) {\n'
            '              return <h1>Hello, {props.name}</h1>;\n'
            '            }\n            ```'
        ),
        expected_lexemes={
            'language': 'jsx',
            'params': {'title': '/src/components/HelloCodeTitle.js'},
            'source': (
                'function HelloCodeTitle(props) {\n'
                '  return <h1>Hello, {props.name}</h1>;\n}'
                '\n'
            ),
        }
    )

Once we’re confident that the lexer is working as required, we can use it with the existing AbstractCodeBlockParser as follows:

from sybil.evaluators.python import PythonEvaluator
from sybil.parsers.abstract.codeblock import AbstractCodeBlockParser

class DocusaurusCodeBlockParser(AbstractCodeBlockParser):
    def __init__(self) -> None:
        super().__init__(
            lexers=[DocusaurusCodeBlockLexer()],
            language='python',
            evaluator=PythonEvaluator(),
            language_lexeme_name = 'language',
        )

This can then be tested as follows:

from sybil.testing import check_parser

def test_docusaurus_parsing() -> None:
    document = check_parser(
        DocusaurusCodeBlockParser(),
        text="""
            ```python title="hello.py"
            x = 1
            ```
        """,
    )
    assert document.namespace['x'] == 1