this is very tricky to get working at the *lexer* level and still support whitespace indentation. switch (n) case(1): x <- 5 case(2): # here case(3): x <- 3 default: x <- 9 print (5)
current work-around: add a "comment" at the end (as in the example), then set the parser looking for that, and ignore it.
(In reply to Luke Kenneth Casson Leighton from comment #0) > this is very tricky to get working at the *lexer* level and still support > whitespace indentation. > > switch (n) > case(1): x <- 5 > case(2): # here > case(3): > x <- 3 > default: > x <- 9 > print (5) How about switching the grammar to parse a case-sequence instead of a single case, that way multiple cases before a statement block would be correctly handled. It might look like the following (doesn't handle empty lines): switch := "switch" LPAREN expression RPAREN NL INDENT (case_sequence case_body)+ DEDENT case_sequence := (case NL)* case (NL default)? | default case := "case" LPAREN expression RPAREN COLON default := "default" COLON case_body := statement_with_nl | NL INDENT statement_with_nl+ DEDENT statement_with_nl := lhs "<-" expression NL | ...
(In reply to Jacob Lifshay from comment #2) > (In reply to Luke Kenneth Casson Leighton from comment #0) > > this is very tricky to get working at the *lexer* level and still support > > whitespace indentation. > > > > switch (n) > > case(1): x <- 5 > > case(2): # here > > case(3): > > x <- 3 > > default: > > x <- 9 > > print (5) > > How about switching the grammar to parse a case-sequence instead of a single > case, that way multiple cases before a statement block would be correctly > handled. annoyingly, the more changes that are made to the grammar, the less "like the spec" the grammar becomes, with the implication that further manual intervention stages are required when it comes to verifying against the 3.0B spec and, in future, against 3.0C and other releases. i've made quite a few minor changes, some of them necessary (to support syntax such as CR0[x] rather than CR0_subscript_x, some of them, well, not being lazy, just "trying to get it working fast" yet-another-preprocessor-stage - even if it is line-based rather than ply-lexer-based, looking for 2 "case" statements (or case followed by default) which are lined up, and inserting the ghost word "fallthrough" would "do the trick" this is ok: case (2): fallthrough case (3): x <- 3 the lexer happily recognises that... *sigh* so it can be "faked" by recognising spc spc spc case (2): spc spc spc case (3): spc spc spc default : counting the number of spaces, recognising that there is a case after a case (with no code) and "magically" putting in the ghost-word horrible, but it "works" :)
(In reply to Luke Kenneth Casson Leighton from comment #3) > (In reply to Jacob Lifshay from comment #2) > > (In reply to Luke Kenneth Casson Leighton from comment #0) > > > this is very tricky to get working at the *lexer* level and still support > > > whitespace indentation. > > > > > > switch (n) > > > case(1): x <- 5 > > > case(2): # here > > > case(3): > > > x <- 3 > > > default: > > > x <- 9 > > > print (5) > > > > How about switching the grammar to parse a case-sequence instead of a single > > case, that way multiple cases before a statement block would be correctly > > handled. > > annoyingly, the more changes that are made to the grammar, the > less "like the spec" the grammar becomes, with the implication > that further manual intervention stages are required when it > comes to verifying against the 3.0B spec and, in future, against > 3.0C and other releases. Um, I don't think the pseudocode in the Power spec was ever supposed to be Python, so I have no issues with changing the grammar file to more accurately match the spec pdf even if the grammar doesn't match Python as closely. > i've made quite a few minor changes, some of them necessary (to support > syntax such as CR0[x] rather than CR0_subscript_x, some of them, well, > not being lazy, just "trying to get it working fast" > > yet-another-preprocessor-stage - even if it is line-based rather than > ply-lexer-based, looking for 2 "case" statements (or case followed by > default) which are lined up, and inserting the ghost word "fallthrough" > would "do the trick" I think having the grammar correctly reflect the actual syntax that is used is probably a better way to go than adding preprossing cludges to add `fallthrough` everywhere. The space-counting done by the lexer translates the spaces into the INDENT and DEDENT tokens. The following algorithm should work to translate lines: def lexer(lines): """ lines is a iterator that returns each line of text without the trailing newline """ indent_depth_stack = [0] for line in lines: # assume we don't have to worry about tabs in string literals expanded_line = line.expandtabs() line = expanded_line.lstrip() # count indent depth depth = len(expanded_line) - len(line) if line == "" or line[0] == "#": # empty lines don't have to match depth # don't yield repeated NL tokens continue if depth > indent_depth_stack.top(): yield INDENT indent_depth_stack.append(depth) else: while depth < indent_depth_stack[-1]: yield DEDENT indent_depth_stack.pop() if depth > indent_depth_stack[-1]: raise IndentDepthMismatch("indent depth doesn't match!") yield from tokenize(line) yield NL while indent_depth_stack[-1] != 0: yield DEDENT indent_depth_stack.pop()
(In reply to Jacob Lifshay from comment #4) > Um, I don't think the pseudocode in the Power spec was ever supposed to be > Python, it's not, however it turns out to be a whitespace-indented one, probably through expediency and a desire for compact clarity. i *began* with GardenSnake.py and rapidly morphed it to match the PDF pseudocode grammar. it *outputs* python AST because that was what GardenSnake did, and it's worked well. however python lacks switch/case about which i always wondered "wth?" and it makes sense now: case fall-through seriously messes with an already complex three-pass process. > so I have no issues with changing the grammar file to more > accurately match the spec pdf even if the grammar doesn't match Python as > closely. the "hard" way is to try and mess with one of the existing stages. the pseudo syntax does this: do while test foo by recognising the keyword "while" i managed to insert an extra token ":" when the newline occurs. this trick means that the tokens *look* like they were this: do while test: foo and of course that now looks exactly like python "while" and it was easy to add. i therefore have absolutely no problem messing with the token stream to "get the desired outcome" :) > > i've made quite a few minor changes, some of them necessary (to support > > syntax such as CR0[x] rather than CR0_subscript_x, some of them, well, > > not being lazy, just "trying to get it working fast" > > > > yet-another-preprocessor-stage - even if it is line-based rather than > > ply-lexer-based, looking for 2 "case" statements (or case followed by > > default) which are lined up, and inserting the ghost word "fallthrough" > > would "do the trick" > > I think having the grammar correctly reflect the actual syntax that is used > is probably a better way to go than adding preprossing cludges to add > `fallthrough` everywhere. i'd put it as an actual keyword, which is silently added by the lexer and silently removed by the yacc parser, if it made any difference. however by matching against NAME it makes life simpler and easier. i think what is happening is that case(5): somename is matching against "small single line suite". by then detecting "if len(statements) == 1 and statements[0] == "fallthrough"" that does the trick. i know - it's awful. > The space-counting done by the lexer translates the spaces into the INDENT > and DEDENT tokens. yes. the person who wrote GardenSnake.py copied the algorithm from the python c interpreter and transliterated it to python ply lexer. i had a loovely time getting it to recognise whitespace, properly. changed the actual lexer context to trick it into swapping fron whitespace-aware to non-aware :) > The following algorithm should work to translate lines: > > def lexer(lines): > """ lines is a iterator that returns each line > of text without the trailing newline """ unfortunately, stripping the newlines removes vital information, particularly when discerning small statements after a colon from "suites". a suite is defined as the following tokens NEWLINE INDENT stmts DEDENT where the algorithm you wrote (minus NEWLINE stripping) identifies indent and dedent oh... er, 1 sec... > yield from tokenize(line) > yield NL ah interesting. so the NL *isn't* removed?
(In reply to Luke Kenneth Casson Leighton from comment #5) > (In reply to Jacob Lifshay from comment #4) > > The following algorithm should work to translate lines: > > > > def lexer(lines): > > """ lines is a iterator that returns each line > > of text without the trailing newline """ > > unfortunately, stripping the newlines removes vital information, > particularly when discerning small statements after a colon from "suites". > > a suite is defined as the following tokens > > NEWLINE INDENT stmts DEDENT > > where the algorithm you wrote (minus NEWLINE stripping) identifies indent > and dedent > > oh... er, 1 sec... > > > yield from tokenize(line) > > yield NL > > ah interesting. so the NL *isn't* removed? No, NL tokens are retained, however duplicate NL tokens coming from empty lines/comment lines are stripped. The idea is that it should never produce two NL tokens right after each other. However, the way I wrote it you can't do the following: if True: # just a comment here, no actual statements; # python has a `pass` statement for this case next_statement since it will produce "if" "True" COLON NL next_statement rather than "if" "True" COLON NL INDENT NL DEDENT next_statement However, the following will work: if True: # empty line with comment pass next_statement it will produce "if" "True" COLON NL INDENT "pass" NL DEDENT next_statement
blegh. does the job. + """add annoying "silent keyword" (fallthrough) + + this which tricks the parser into taking the (silent) case statement + as a "small expression". it can then be spotted and used to indicate + "fall through" to the next case (in the parser) + + bugs: any function that starts with the letters "case" or "default" + will be detected erroneously. fixing that involves doing a token + lexer which spots the fact that "case" and "default" are words, + separating them from space, colon, bracket etc. + """ + res = [] + prev_spc_count = None + for l in code.split("\n"): + spc_count = count_spaces(l) + nwhite = l[spc_count:] + if nwhite.startswith("case") or nwhite.startswith("default"): + #print ("case/default", nwhite, spc_count, prev_spc_count) + if (prev_spc_count is not None and + prev_spc_count == spc_count and + (res[-1].endswith(":") or res[-1].endswith(": fallthrough"))): + res[-1] += " fallthrough" # add to previous line + prev_spc_count = spc_count + else: + #print ("notstarts", spc_count, nwhite) + prev_spc_count = None + res.append(l) + return '\n'.join(res) +
@@ -152,6 +154,8 @@ def annoying_case_hack_filter(code): for l in code.split("\n"): spc_count = count_spaces(l) nwhite = l[spc_count:] + if len(nwhite) == 0: # skip blank lines + continue if nwhite.startswith("case") or nwhite.startswith("default"): #print ("case/default", nwhite, spc_count, prev_spc_count) if (prev_spc_count is not None and @@ -473,9 +477,11 @@ switch (n) that was enough to get it to skip blank lines. it "Does The Job". yes it's not "clean / nice" however it's quick and it works. moving on to the next task, kinda important.