Bug 280 - POWER spec parser needs to support fall-through cases
Summary: POWER spec parser needs to support fall-through cases
Status: CONFIRMED
Alias: None
Product: Libre-SOC's first SoC
Classification: Unclassified
Component: Source Code (show other bugs)
Version: unspecified
Hardware: PC Linux
: --- enhancement
Assignee: Luke Kenneth Casson Leighton
URL:
Depends on:
Blocks: 269
  Show dependency treegraph
 
Reported: 2020-04-05 16:57 BST by Luke Kenneth Casson Leighton
Modified: 2020-04-06 13:45 BST (History)
2 users (show)

See Also:
NLnet milestone: ---
total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0
parent task for budget allocation:
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke Kenneth Casson Leighton 2020-04-05 16:57:26 BST
this is very tricky to get working at the *lexer* level and still support
whitespace indentation.

switch (n)
    case(1): x <- 5
    case(2): # here
    case(3):
        x <- 3
    default:
        x <- 9
print (5)
Comment 1 Luke Kenneth Casson Leighton 2020-04-05 16:58:10 BST
current work-around: add a "comment" at the end (as in the example),
then set the parser looking for that, and ignore it.
Comment 2 Jacob Lifshay 2020-04-05 18:01:18 BST
(In reply to Luke Kenneth Casson Leighton from comment #0)
> this is very tricky to get working at the *lexer* level and still support
> whitespace indentation.
> 
> switch (n)
>     case(1): x <- 5
>     case(2): # here
>     case(3):
>         x <- 3
>     default:
>         x <- 9
> print (5)

How about switching the grammar to parse a case-sequence instead of a single case, that way multiple cases before a statement block would be correctly handled.

It might look like the following (doesn't handle empty lines):

switch := "switch" LPAREN expression RPAREN NL INDENT (case_sequence case_body)+ DEDENT

case_sequence := (case NL)* case (NL default)?
              | default

case := "case" LPAREN expression RPAREN COLON

default := "default" COLON

case_body := statement_with_nl
          | NL INDENT statement_with_nl+ DEDENT

statement_with_nl := lhs "<-" expression NL
          | ...
Comment 3 Luke Kenneth Casson Leighton 2020-04-05 21:37:57 BST
(In reply to Jacob Lifshay from comment #2)
> (In reply to Luke Kenneth Casson Leighton from comment #0)
> > this is very tricky to get working at the *lexer* level and still support
> > whitespace indentation.
> > 
> > switch (n)
> >     case(1): x <- 5
> >     case(2): # here
> >     case(3):
> >         x <- 3
> >     default:
> >         x <- 9
> > print (5)
> 
> How about switching the grammar to parse a case-sequence instead of a single
> case, that way multiple cases before a statement block would be correctly
> handled.

annoyingly, the more changes that are made to the grammar, the
less "like the spec" the grammar becomes, with the implication
that further manual intervention stages are required when it
comes to verifying against the 3.0B spec and, in future, against
3.0C and other releases.

i've made quite a few minor changes, some of them necessary (to support
syntax such as CR0[x] rather than CR0_subscript_x, some of them, well,
not being lazy, just "trying to get it working fast"

yet-another-preprocessor-stage - even if it is line-based rather than
ply-lexer-based, looking for 2 "case" statements (or case followed by
default) which are lined up, and inserting the ghost word "fallthrough"
would "do the trick"

this is ok:

   case (2): fallthrough
   case (3): x <- 3

the lexer happily recognises that... *sigh*

so it can be "faked" by recognising

spc spc spc case (2):
spc spc spc case (3):
spc spc spc default :

counting the number of spaces, recognising that there is a case after a case
(with no code) and "magically" putting in the ghost-word

horrible, but it "works" :)
Comment 4 Jacob Lifshay 2020-04-05 22:43:12 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > (In reply to Luke Kenneth Casson Leighton from comment #0)
> > > this is very tricky to get working at the *lexer* level and still support
> > > whitespace indentation.
> > > 
> > > switch (n)
> > >     case(1): x <- 5
> > >     case(2): # here
> > >     case(3):
> > >         x <- 3
> > >     default:
> > >         x <- 9
> > > print (5)
> > 
> > How about switching the grammar to parse a case-sequence instead of a single
> > case, that way multiple cases before a statement block would be correctly
> > handled.
> 
> annoyingly, the more changes that are made to the grammar, the
> less "like the spec" the grammar becomes, with the implication
> that further manual intervention stages are required when it
> comes to verifying against the 3.0B spec and, in future, against
> 3.0C and other releases.

Um, I don't think the pseudocode in the Power spec was ever supposed to be Python, so I have no issues with changing the grammar file to more accurately match the spec pdf even if the grammar doesn't match Python as closely.

> i've made quite a few minor changes, some of them necessary (to support
> syntax such as CR0[x] rather than CR0_subscript_x, some of them, well,
> not being lazy, just "trying to get it working fast"
> 
> yet-another-preprocessor-stage - even if it is line-based rather than
> ply-lexer-based, looking for 2 "case" statements (or case followed by
> default) which are lined up, and inserting the ghost word "fallthrough"
> would "do the trick"

I think having the grammar correctly reflect the actual syntax that is used is probably a better way to go than adding preprossing cludges to add `fallthrough` everywhere.

The space-counting done by the lexer translates the spaces into the INDENT and DEDENT tokens.

The following algorithm should work to translate lines:

def lexer(lines):
    """ lines is a iterator that returns each line
        of text without the trailing newline """

    indent_depth_stack = [0]
    for line in lines:
        # assume we don't have to worry about tabs in string literals
        expanded_line = line.expandtabs()
        line = expanded_line.lstrip()
        # count indent depth
        depth = len(expanded_line) - len(line)
        if line == "" or line[0] == "#":
            # empty lines don't have to match depth
            # don't yield repeated NL tokens
            continue
        if depth > indent_depth_stack.top():
            yield INDENT
            indent_depth_stack.append(depth)
        else:
            while depth < indent_depth_stack[-1]:
                yield DEDENT
                indent_depth_stack.pop()
            if depth > indent_depth_stack[-1]:
                raise IndentDepthMismatch("indent depth doesn't match!")
        yield from tokenize(line)
        yield NL
    while indent_depth_stack[-1] != 0:
        yield DEDENT
        indent_depth_stack.pop()
Comment 5 Luke Kenneth Casson Leighton 2020-04-05 23:12:26 BST
(In reply to Jacob Lifshay from comment #4)

> Um, I don't think the pseudocode in the Power spec was ever supposed to be
> Python,

it's not, however it turns out to be a whitespace-indented one, probably through expediency and a desire for compact clarity.

i *began* with GardenSnake.py and rapidly morphed it to match the PDF pseudocode grammar.

it *outputs* python AST because that was what GardenSnake did, and it's worked well.

however python lacks switch/case about which i always wondered "wth?" and it makes sense now: case fall-through seriously messes with an already complex three-pass process.

> so I have no issues with changing the grammar file to more
> accurately match the spec pdf even if the grammar doesn't match Python as
> closely.

the "hard" way is to try and mess with one of the existing stages.

the pseudo syntax does this:

do while test
    foo

by recognising the keyword "while" i managed to insert an extra token ":" when the newline occurs.

this trick means that the tokens *look* like they were this:

do while test:
    foo

and of course that now looks exactly like python "while" and it was easy to add.

i therefore have absolutely no problem messing with the token stream to "get the desired outcome" :)


> > i've made quite a few minor changes, some of them necessary (to support
> > syntax such as CR0[x] rather than CR0_subscript_x, some of them, well,
> > not being lazy, just "trying to get it working fast"
> > 
> > yet-another-preprocessor-stage - even if it is line-based rather than
> > ply-lexer-based, looking for 2 "case" statements (or case followed by
> > default) which are lined up, and inserting the ghost word "fallthrough"
> > would "do the trick"
> 
> I think having the grammar correctly reflect the actual syntax that is used
> is probably a better way to go than adding preprossing cludges to add
> `fallthrough` everywhere.

i'd put it as an actual keyword, which is silently added by the lexer and silently removed by the yacc parser, if it made any difference.

however by matching against NAME it makes life simpler and easier.

i think what is happening is that

case(5): somename

is matching against "small single line suite".

by then detecting "if len(statements) == 1 and statements[0] == "fallthrough"" that does the trick.

i know - it's awful.  

> The space-counting done by the lexer translates the spaces into the INDENT
> and DEDENT tokens.

yes.  the person who wrote GardenSnake.py copied the algorithm from the python c interpreter and transliterated it to python ply lexer.

i had a loovely time getting it to recognise whitespace, properly.  changed the actual lexer context to trick it into swapping fron whitespace-aware to non-aware :)

 
> The following algorithm should work to translate lines:
> 
> def lexer(lines):
>     """ lines is a iterator that returns each line
>         of text without the trailing newline """

unfortunately, stripping the newlines removes vital information, particularly when discerning small statements after a colon from "suites".

a suite is defined as the following tokens

NEWLINE INDENT stmts DEDENT

where the algorithm you wrote (minus NEWLINE stripping) identifies indent and dedent

oh... er, 1 sec...

>         yield from tokenize(line)
>         yield NL

ah interesting.  so the NL *isn't* removed?
Comment 6 Jacob Lifshay 2020-04-05 23:32:32 BST
(In reply to Luke Kenneth Casson Leighton from comment #5)
> (In reply to Jacob Lifshay from comment #4)
> > The following algorithm should work to translate lines:
> > 
> > def lexer(lines):
> >     """ lines is a iterator that returns each line
> >         of text without the trailing newline """
> 
> unfortunately, stripping the newlines removes vital information,
> particularly when discerning small statements after a colon from "suites".
> 
> a suite is defined as the following tokens
> 
> NEWLINE INDENT stmts DEDENT
> 
> where the algorithm you wrote (minus NEWLINE stripping) identifies indent
> and dedent
> 
> oh... er, 1 sec...
> 
> >         yield from tokenize(line)
> >         yield NL
> 
> ah interesting.  so the NL *isn't* removed?

No, NL tokens are retained, however duplicate NL tokens coming from empty lines/comment lines are stripped. The idea is that it should never produce two NL tokens right after each other.

However, the way I wrote it you can't do the following:

if True:
    # just a comment here, no actual statements;

    # python has a `pass` statement for this case
next_statement

since it will produce "if" "True" COLON NL next_statement
rather than "if" "True" COLON NL INDENT NL DEDENT next_statement

However, the following will work:

if True:
              # empty line with comment

    pass
next_statement

it will produce "if" "True" COLON NL INDENT "pass" NL DEDENT next_statement
Comment 7 Luke Kenneth Casson Leighton 2020-04-06 13:33:22 BST
blegh.  does the job.

+    """add annoying "silent keyword" (fallthrough)
+
+    this which tricks the parser into taking the (silent) case statement
+    as a "small expression".  it can then be spotted and used to indicate
+    "fall through" to the next case (in the parser)
+
+    bugs: any function that starts with the letters "case" or "default"
+    will be detected erroneously.  fixing that involves doing a token
+    lexer which spots the fact that "case" and "default" are words,
+    separating them from space, colon, bracket etc.
+    """
+    res = []
+    prev_spc_count = None
+    for l in code.split("\n"):
+        spc_count = count_spaces(l)
+        nwhite = l[spc_count:]
+        if nwhite.startswith("case") or nwhite.startswith("default"):
+            #print ("case/default", nwhite, spc_count, prev_spc_count)
+            if (prev_spc_count is not None and
+                prev_spc_count == spc_count and
+                (res[-1].endswith(":") or res[-1].endswith(": fallthrough"))):
+                res[-1] += " fallthrough" # add to previous line
+            prev_spc_count = spc_count
+        else:
+            #print ("notstarts", spc_count, nwhite)
+            prev_spc_count = None
+        res.append(l)
+    return '\n'.join(res)
+
Comment 8 Luke Kenneth Casson Leighton 2020-04-06 13:45:11 BST
@@ -152,6 +154,8 @@ def annoying_case_hack_filter(code):
     for l in code.split("\n"):
         spc_count = count_spaces(l)
         nwhite = l[spc_count:]
+        if len(nwhite) == 0: # skip blank lines
+            continue
         if nwhite.startswith("case") or nwhite.startswith("default"):
             #print ("case/default", nwhite, spc_count, prev_spc_count)
             if (prev_spc_count is not None and
@@ -473,9 +477,11 @@ switch (n)

that was enough to get it to skip blank lines.  it "Does The Job".

yes it's not "clean / nice" however it's quick and it works.
moving on to the next task, kinda important.