* DONE: GRev class in nmutil https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=51e5a621621c12bb34b44d6bb8a29f1f668a111e * DONE: comments in grev.py * DONE: add NGI POINTER funding notice * DONE: unit tests for GRev in nmutil * DONE: formal tests for GRev in nmutil note, to be covered by bug #840 * DONE: add to openpower-isa * DONE: add to csv * DONE: add pseudo-code https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/bitmanip.mdwn;h=3f73b0b894583e8cfb2b871988c5b593f6412382;hb=HEAD * DONE: add to BitManipTestCase https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=b55f37ee14b3253087f0a743eb795b53e59f25d2 * DONE: add to soc * DONE: add formal proof https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=66620a91bac731f6745fa33a7fe122b0aa7d3485 note, to be covered by bug #840 * TODO: create formal correctness bugreport under bug #840 to cover tasks done here * TODO: remove from openpower-isa (leave in nmutil, all good there)
commit 7aed55e3538fb39327bf4946843292d128d29abc (HEAD -> master, origin/master, origin/HEAD) Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> Date: Sat Dec 11 15:29:09 2021 +0000 add some comments (locations for comments to be added) i took a quick look, one thing: i think you'll probably find this works: # start with input as first layer step_i = self.input for i in range(self.log2_width): step_o = Signal(self.width, name=f"step{i}") chunk_size = 1 << i # TODO comment that this is creating the mux-swapper with m.If(self.chunk_sizes[i]): # swap path for j in range(self.width): # TODO explain what this XOR does m.d.comb += step_o[j].eq(step_i[j ^ chunk_size]) with m.Else(): # straight path m.d.comb += step_o.eq(step_i) # output becomes input for next phase step_i = step_o # TODO comment that the last "step" is the output m.d.comb += self.output.eq(step_o)
(In reply to Luke Kenneth Casson Leighton from comment #1) > commit 7aed55e3538fb39327bf4946843292d128d29abc (HEAD -> master, > origin/master, origin/HEAD) > Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net> > Date: Sat Dec 11 15:29:09 2021 +0000 > > add some comments (locations for comments to be added) > > i took a quick look, one thing: i think you'll probably find this works: > > > # start with input as first layer > step_i = self.input > > for i in range(self.log2_width): > step_o = Signal(self.width, name=f"step{i}") I specifically have the intermediate signals exposed because I want to have the tests be able to access them to assert they have the correct values.
(In reply to Jacob Lifshay from comment #2) > I specifically have the intermediate signals exposed because I want to have > the tests be able to access them to assert they have the correct values. then why was that not added as a comment, explaining that? nothing "unusual" like that can be left out of the comments. you need to think - ask yourself the question (every time): "how will someone else view this decision that i am making? will they understand it? come to think of it, in 3-6 months time will *i* understand it?"
this makes the need for a sub-function redundant: for i in range(self._log2_width): step_o = Signal(self.width, name="step_%d" % i) _steps.append(step_o) ... ... step_i = step_o # for next loop, to create the combinatorial chain then at the end self._steps can be set to _steps. optionally, a constructor parameter could be added "testing_mode=False" which does not add that. or, another technique that i've seen used - and this is for Formal Correctness Proofs only: def elaborate(self, platform): if platform == 'formal': # do some Asserts. it's a small amount of code, here, so isn't a Big Deal. if it was a much larger module, exposing a considerably larger suite of "test" signals, i'd start to get concerned.
I cleaned up the grev implementation a bunch and added docs, unit tests, and formal proofs. Too bad Python 3.7 doesn't have Rust's `windows` iterator, it'd be really useful here. I added `pairwise` cuz we're not running Python 3.10 yet...I want it anyway! https://doc.rust-lang.org/std/primitive.slice.html#method.windows
lkcl, can you add the funding notice? src/nmutil/grev.py:4 # TODO(lkcl): add NGI POINTER funding notice, idk what we need for that...
https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=c63d1f2b483573fae75af6688c07409b84efa7e4 i did a style rewrite, and updated the comments. it's definitely a butterfly network - you can see by looking at the wikipedia page and comparing against the xbitmanip draft. it's definitely not an ror, that's the function before in the RVV xbitmanip document (the previous page contains a different diagram, for ror, not butterfly) https://en.wikipedia.org/wiki/Butterfly_network#/media/File:Butterfly_Network.jpg https://ftp.libre-soc.org/2021-12-17_11-17.png * i removed tee and pairwise: "adding yet more code" means "more things to read" returned the code to the style that you replaced, which uses a list for an accumulator, inside elaborate. this is perfectly fine to do because the dut is instantiated (elaborate called) long before the unit test begins to refer to the actual contents this has the advantage of removing the intermediary signals from the constructor, such that when someone reads the constructor to find out what Signals they should be connecting up, they don't go "urrr what the heck should i do with these" * module docstrings != class docstrings. module docstrings describe the general concept, class docstrings typically tell you "how to use the class" the two are radically different * saying "see this other stuff somewhere else" entirely defeats the object of the exercise of comments! comments are for "immediate thoughts necessary to understand, right now, what the hell is going on, right now". referring *elsewhere* causes massive confusion, frustration, and irritation for the reader. for example: self._steps is not obvious at all that it's an accumulator of everything (input, intermediary, output)... *therefore say so*, then comment each addition and, oh look, "the last layer is also the output", okaaay, that's quite obvious. bottom line is that you should *not* assume - at all - that the reader is capable of understanding the HDL (the code). AT ALL. someone who has NO UNDERSTANDING of python should be able to read the english language "words" and go, "hmmmm, y'know: i can just about understand this" also, any "tricks" - things that break standard conventions - DEFINITELY need to be commented. * i replaced the "If" with an *actual* Mux, because it replaces 4 lines with 1. * the FSFE provides guidance on how to do proper copyright notices. Copyright (C) year, year, year, year fullname contactinfo it is NOT ok to put yearstart-yearend, and you MUST NOT claim copyright in a year that you did not ACTUALLY make any additions. there are plenty of examples in the code to follow here.
(In reply to Luke Kenneth Casson Leighton from comment #7) > https://git.libre-soc.org/?p=nmutil.git;a=commitdiff; > h=c63d1f2b483573fae75af6688c07409b84efa7e4 > > i did a style rewrite, and updated the comments. it's definitely a > butterfly network - you can see by looking at the wikipedia page > and comparing against the xbitmanip draft. It's similar but not identical because wikipedia's example does grev from MSB to LSB (swap 4, swap 2, swap 1) but the code I wrote goes the other way (swap 1, swap 2, swap 4). > it's definitely not > an ror I never said it's an ror, I said it's *similar* to an ror, which is true. it uses a very similar sequence of 2:1 muxes. > * i removed tee and pairwise: "adding yet more code" means "more things to > read" I documented exactly what pairwise does...the exact same as python's standard library -- it returns successive pairs of the input items. imho it's waay easier to understand how I had written it. > returned the code to the style that you replaced, which uses a list for > an accumulator, inside elaborate. > > this is perfectly fine to do because the dut is instantiated (elaborate > called) long before the unit test begins to refer to the actual contents no it's not, you *obviously* didn't run the test (cuz you broke it). Also, having new members that show up only after calling something other than the constructor is poor style and quite confusing to use imho, hence why I replaced it. > > this has the advantage of removing the intermediary signals from the > constructor, such that when someone reads the constructor to find out > what Signals they should be connecting up, they don't go "urrr what the > heck should i do with these" then, simply, they just read the docstring for _steps and see that it says it's an internal signal only exposed for unit tests, and they then can easily ignore it. that should have been *also totally obvious* because its name starts with an underline anyway. > * the FSFE provides guidance on how to do proper copyright notices. > > Copyright (C) year, year, year, year fullname contactinfo we had previously agreed just referring to Notices.txt was sufficient, cuz you can put the copyright notice in there and only have to update the copyright years in one place. > > it is NOT ok to put yearstart-yearend, and you MUST NOT claim > copyright in a year that you did not ACTUALLY make any additions. > > there are plenty of examples in the code to follow here.
(In reply to Jacob Lifshay from comment #6) > lkcl, can you add the funding notice? > > src/nmutil/grev.py:4 > # TODO(lkcl): add NGI POINTER funding notice, idk what we need for that... sorted, # Funded by NLnet Assure Programme 2021-02-052, https://nlnet.nl/assure part # of Horizon 2020 EU Programme 957073. https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=20db17aafdbe539d8afbe4e466d215e93a18817e
(In reply to Jacob Lifshay from comment #8) > It's similar but not identical because wikipedia's example does grev from > MSB to LSB (swap 4, swap 2, swap 1) but the code I wrote goes the other way > (swap 1, swap 2, swap 4). yes, a full butterfly isn't described on the wikipedia page: a full butterfly you go 1 2 4 8 16 8 4 2 1 (or 16 8 4 2 1 2 4 8 16, bizarrely it doesn't matter which) and ta-daa you have a full (100%) non-blocking routing network as long as you have a permutation [no attempts to route 2 pieces of data to the same destination]. this was my 3rd year project back in 1991. i added an option to allow 1 2 4 8 rather than 8 4 2 1 because of that. hmmm.... is that worth actually putting in as a bit-option into the instruction? because the process of inverting just those few bits (esp with SVP64) is going to be a pain in the ass. looking at the opcode list, though, it's already quite crushed for space https://libre-soc.org/openpower/sv/bitmanip/ although, removing ternlog entirely (leaving ternlogv in) would free up a heck of a lot of opcode space. btw did you see grev-or (gorc)? that got me thinking that perhaps an operator as an option to GRev would be cool. and also mean that GOrc is basically "import operator; GOrc = Grev(width, operator._or)" i have no idea if a nor or nand version of Grev would even be useful but i like the concept and can see it may have potential. > I never said it's an ror, I said it's *similar* to an ror, which is true. it > uses a very similar sequence of 2:1 muxes. i got rather confused by the xbitmanip PDF which happens to have the ROR diagram on the exact same page as the grev c-pseudocode, and thought "errrr" for about half an hour > > * i removed tee and pairwise: "adding yet more code" means "more things to > > read" > > I documented exactly what pairwise does...the exact same as python's > standard library -- it returns successive pairs of the input items. imho > it's waay easier to understand how I had written it. mhrhhhm... well, that's what the code-comments are for: to explain "not quite obvious" things [in english, not actual-code]. > no it's not, you *obviously* didn't run the test (cuz you broke it). yes, i'm catching up, i missed several messages (don't know how) and didn't see that a unit test now existed. sorted now. > Also, > having new members that show up only after calling something other than the > constructor is poor style and quite confusing to use imho, hence why I > replaced it. vs the confusion / expectation when people look at the constructor and go "oh, i must connect this array of Signals up, what the heck to, i have no idea", and given that it's *only* for the unit test that those are accessible, i'd say that the "public usage" wins over any kind of "convenience" for us as writers of a public API. elaborate() exists precisely because it was found to be problematic to expect a class to construct everything-it-needs within its constructor, so leveraging that is fine. > then, simply, they just read the docstring for _steps and see that it says > it's an internal signal only exposed for unit tests, and they then can > easily ignore it. that should have been *also totally obvious* because its > name starts with an underline anyway. again: we can't assume that. it's quite common to access member variables with underscores, despite being an [unenforceable] convention that attempts to dictate otherwise. and inherited classes often *require* access to them. in this case, it's definitely private. > we had previously agreed just referring to Notices.txt was sufficient, cuz > you can put the copyright notice in there and only have to update the > copyright years in one place. yyeah then i saw the FSFE's video at one of the NGI POINTER mini-conferences and was reminded that that's insufficient. also, from a legal perspective it's technically incorrect. changes to one file are required *only* to add the current year: updating copyright years in a global file makes a legally-dubious statement, "we claim *ALL* files were modified in this year" also, we are getting people taking our code and trying to claim both credit and copyright, which they can do by cherry-picking and ignoring the bland "see this other file" at the top. a copyright notice and creditation in every single file makes it abundantly clear, and if they remove those then that's a clear copyright violation. (which is the whole purpose of the exercise of having them)
I redid grev, adding a bunch of comments and some nice diagrams. https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/grev.py;h=35b45657eb5895028a88d7927454fbe2398cc27a;hb=HEAD
ascii art's great, sphinx plugin can turn it into an image. irony: you can have too many comments, they become a distraction esp. when the function is 2 lines and comments 10-12 spelling: "it's" is not a relative pronoun. it's short for the two words, "it" and "is". also "cuz" is fine in IRC but not documentation. inter function can go (1 liner made from these) https://stackoverflow.com/questions/339007/how-to-pad-zeroes-to-a-string https://stackoverflow.com/questions/16060899/alphabet-range-in-python string.ascii_lowercase[:-length].rjust(fixedlen, "0") negative ranges start from the opposite end. works with start as well e.g. list[-5:7] """docstrings convention end closer on a newline, lined up """ also can you put the rtlil creator back in, they are really useful
(In reply to Luke Kenneth Casson Leighton from comment #12) > also can you put the rtlil creator back in, they are really useful they're there, i added automatic rtlil writing to the do_sim function, so rtlil is written by every unit test. apparently that wasn't obvious from the "run this specific unit test, then run yosys" paraphrased comments: > # useful to see what is going on: > # python3 src/nmutil/test/test_grev.py > # yosys <<<"read_ilang sim_test_out/__main__.TestGrev.test_small/0.il; proc; clean -purge; show top"
(In reply to Luke Kenneth Casson Leighton from comment #12) > ascii art's great, sphinx plugin can turn it into an image. > irony: you can have too many comments, they become > a distraction esp. when the function is 2 lines and comments 10-12 > > spelling: "it's" is not a relative pronoun. it's short for the two words, > "it" and "is". ah yeah, annoying english spelling. I totally didn't notice when I wrote it, even though I usually catch that. > also "cuz" is fine in IRC but not documentation. > > inter function can go (1 liner made from these) > https://stackoverflow.com/questions/339007/how-to-pad-zeroes-to-a-string > https://stackoverflow.com/questions/16060899/alphabet-range-in-python > > string.ascii_lowercase[:-length].rjust(fixedlen, "0") that only works for the msb_first is False case. when msb_first is True, it produces names like 0buvwx00, which the above code doesn't do.
(In reply to Jacob Lifshay from comment #14) > > string.ascii_lowercase[:-length].rjust(fixedlen, "0") > > that only works for the msb_first is False case. when msb_first is True, it > produces names like 0buvwx00, which the above code doesn't do. this should solve that: s = reversed(string.ascii_lowercase[:-fixedlen]) if msb else string.ascii_lowercase return s.rjust(....) (In reply to Jacob Lifshay from comment #13) > (In reply to Luke Kenneth Casson Leighton from comment #12) > > also can you put the rtlil creator back in, they are really useful > > they're there, i added automatic rtlil writing to the do_sim function, so > rtlil is written by every unit test. yes, i know, and that is not the point. i add rtlil outputting to every module i write and both i and other people find it useful. we are not interested in wasting time hunting for a unit test, another file, or in wasting time waiting for that test to run, or wasting even more time hunting through a bunch of unnecessary crud looking for something that is unclear if it is even related to the module. please put it back.
oo, oo, i have an idea. how about using a pair of LUT4s instead of a pair of MUXes? this would cover gorc and a boatload of other instructions including all types of bitwise treereduce. it would look like: gternlut4(RT, RA, leftlut4, rightlut4) for i in range log2wid step = steps[i] for j in width: if j ^ xorstepthing: lut = leftlut4 else rightlut4 bit = lut4fn(step[j], step[j^xorstepthing], bit4) outstep[j] = bit so there are a *pair* of lut4s, one replacing the left mux, one replacing the right. when using this fn, setting the right lut4 to OR its inputs and the left one to pass its input through would create a treewise bitlevel OR reduction.
wark. has to be lut3 (8 bit table) because the step bit has to be included gternlut3(RT, RA, leftlut8, rightlut8) for i in range log2wid step = steps[i] for j in width: if j ^ xorstepthing: lut = leftlut8 else rightlut8 bit = lut3fn(step[j], step[j^xorstepthing], RA[i], lut) outstep[j] = bit this could end up as a huge number of gates, and, also, makes an immediate version impractical unless leftlut8 and rightlut8 are constructed from each other (restricting the options) leftlut7 = immed rightlut8 = immed[0:4] + ~immed[4:8] better i feel would be to use RB leftlut8 = RB[0:8] rightlut8 = RB[8:16] and RC for the selector. actually probably RC for lut pair and RB for the selector-immediate because the convention is to put any immediates down the RB data path.
i did a quick check (python3 nmutil/lut.py; yosys; synth; ltp) and this gives: * 2048 cells for a 64-bit LUT3 * longest path 7 that means that a 6-layer 64-bit g-tern-log is going to be absolutely massive: 13,000 cells (which are average... what... 5 gates each?) and a path length of almost 50. that will almost certainly have to be at least a 3-stage pipeline. the question is, then: is it worth it? for the sheer awesome flexibility it has - the number of options that can be covered - i'd say hell yes, if it wasn't for the fact that it's 65 *thousand* gates - four times larger than a 64-bit multiplier and three times larger than an IEEE754 FP multiply. which then begs the question: how can it be justified? what algorithms could it cover that would justify its inclusion? and: would a 32-bit variant be sufficient? a 32-bit variant would cut the gate count to around.... 20,000?
(In reply to Luke Kenneth Casson Leighton from comment #18) > the question is, then: is it worth it? for the sheer awesome flexibility > it has - the number of options that can be covered - i'd say hell yes, > if it wasn't for the fact that it's 65 *thousand* gates - four times > larger than a 64-bit multiplier and three times larger than an IEEE754 FP > multiply. I think that's an interesting idea, however I think that should be left for next time -- We're kinda out-of-time and I just want to be done with the bitmanip instructions...I'm going to work on the rest of the bitmanip instructions in approximately the order I think they will be most useful to least useful, hence why I started on ternlog and grev. If/when we run out of time, that way we will have the most useful ones all finished -- and the more questionable/incompletely-designed instructions left for last.
(In reply to Jacob Lifshay from comment #19) > I think that's an interesting idea, however I think that should be left for > next time -- We're kinda out-of-time and I just want to be done with the > bitmanip instructions... totally sensible. with gorc being a variant of grev i was thinking ahead. go for it.
added grev[w][i] to csvs
adding pseudo-code...I decided grevw should be defined to leave the MSB word cleared -- specifically zero-extending because that allows the compiler to skip zero-extension instructions that could otherwise be needed. I picked zero-extension rather than sign-extension because that doesn't require copying the sign bit, potentially saving wiring.
(In reply to Jacob Lifshay from comment #22) > skip zero-extension instructions that could otherwise be needed. I picked > zero-extension rather than sign-extension because that doesn't require > copying the sign bit, potentially saving wiring. sensible. signed concept is only meaningful on arithmetic, this is a Logical op. none of the Logical ops do arithmetic.
Added grev to BitManipTestCase, also had to spend a bunch of time fixing bugs in the pseudo-code and working around bugs in the simulator...(I don't want to spend the time right now to solve #765). I added a log2 helper function .. I figured that's probably fine because I designed it to assert if the input isn't an integer power of 2, so its function is basically always totally obvious imho. It's needed so I can properly index into a field where I need the lsb log2(XLEN) or log2(XLEN/2) bits, and OpenPower *loves* MSB0, making it extra annoying: just look at this mess: grevw's pseudo-code: result <- [0] * (XLEN / 2) a <- (RA)[XLEN/2:XLEN-1] b <- EXTZ64(RB) do i = 0 to XLEN / 2 - 1 idx <- b[64-log2(XLEN/2):63] ^ i result[i] <- a[idx] RT <- ([0] * (XLEN / 2)) || result we'll want to do that slicing RB too for the shift/rotate instructions cuz that makes the behavior more consistent (always shifting by RB_OR_IMM % XLEN). when I tried using % in grevw's pseudo-code, the LHS was 5 bits wide, and XLEN/2 was 32, which overflows 5 bits, causing it to wrap around to RB_OR_IMM % 0 which obviously raises a ZeroDivisionError.
(In reply to Jacob Lifshay from comment #24) > Added grev to BitManipTestCase, also had to spend a bunch of time fixing > bugs in the pseudo-code and working around bugs in the simulator... you've accidentally requested RC in the CSV file for this instruction, which inherently and automatically requests OE as well. that's a bug in the CSV file, not the simulator. > (I don't > want to spend the time right now to solve #765). that should be closed as INVALID. > I added a log2 helper function .. I figured that's probably fine it isn't. anything that's added requires a formal discussion and to go through an OpenPOWER ISA Working Group Request For Change (RFC) process. > because I > designed it to assert if the input isn't an integer power of 2, so its > function is basically always totally obvious imho. "obvious" is not a criteria that will satisfy the OPF ISA WG. they have a defined set of procedures and processes that have to be gone through, including but not limited to providing the text that is to go into the Power ISA Specification. > It's needed so I can > properly index into a field where I need the lsb log2(XLEN) or log2(XLEN/2) > bits, in this particular case, because it is log2 of a constant that is already pre-defined, there is an alternative: define a constant XLENLOG2 and XLEN2LOG2. this should go into the bugreport discussing and formally proposing adding log2 as an alternative, so that if there is a decision on it the reasons are fully available and can be placed directly into the RFC. > and OpenPower *loves* MSB0, making it extra annoying: > > just look at this mess: > grevw's pseudo-code: > result <- [0] * (XLEN / 2) > a <- (RA)[XLEN/2:XLEN-1] > b <- EXTZ64(RB) > do i = 0 to XLEN / 2 - 1 > idx <- b[64-log2(XLEN/2):63] ^ i > result[i] <- a[idx] > RT <- ([0] * (XLEN / 2)) || result yyep. looks fairly annoyingly normal. can i suggest instead a <- (RA)[XLEN/2:XLEN-1] b <- (RB)[64-log2(XLEN/2):XLEN-1] and then: idx <- b ^ i
(In reply to Luke Kenneth Casson Leighton from comment #25) > (In reply to Jacob Lifshay from comment #24) > > Added grev to BitManipTestCase, also had to spend a bunch of time fixing > > bugs in the pseudo-code and working around bugs in the simulator... > > you've accidentally requested RC in the CSV file for this instruction, > which inherently and automatically requests OE as well. that's a bug > in the CSV file, not the simulator. no, it's a bug in the simulator that there isn't a way to request RC without OE unless you add GREV to a magic list of exceptions -- those exceptions should be translated to an OE field in the CSV (they only cover instructions without OE where the XO-form OE bit is 1; instructions like nand without OE but with the XO-form OE bit set to 0 aren't included in the list of exceptions because mis-interpreting the XO-form OE bit works even though it's not an XO-form instruction. > > (I don't > > want to spend the time right now to solve #765). > > that should be closed as INVALID. nope, it is a valid bug, as explained above. > > > I added a log2 helper function .. I figured that's probably fine > > it isn't. anything that's added requires a formal discussion and > to go through an OpenPOWER ISA Working Group Request For Change (RFC) > process. well, because it's already in the spec pdf in v3.1 in dbcz's pseudo-code -- it's fine. > > and OpenPower *loves* MSB0, making it extra annoying: > > > > just look at this mess: > > grevw's pseudo-code: > > result <- [0] * (XLEN / 2) > > a <- (RA)[XLEN/2:XLEN-1] > > b <- EXTZ64(RB) > > do i = 0 to XLEN / 2 - 1 > > idx <- b[64-log2(XLEN/2):63] ^ i > > result[i] <- a[idx] > > RT <- ([0] * (XLEN / 2)) || result > > yyep. looks fairly annoyingly normal. can i suggest instead > > a <- (RA)[XLEN/2:XLEN-1] > b <- (RB)[64-log2(XLEN/2):XLEN-1] that doesn't work when XLEN != 64: if XLEN = 8, then length((RB)) = 8, and the slice expands to: (RB)[62:7] which is obviously out-of-range.
(In reply to Jacob Lifshay from comment #26) > no, it's a bug in the simulator that there isn't a way to request RC without > OE this really is not a bug, it's a [annoying] design decision. totally undocumented and "emergent", but not something that is unexpected. if there are going to be multiple Rc=1 but not OE=1 new Logical instructions then yes i think thst justifies sorting out. to keep compatibility i suggest adding a new RC option, called say RCONLY which will need to be added to DecodeRC and to ISACaller, and possibly the pipelines common_input.py and several other places. decode_regspec_map will need inspecting too. all pipeline, compunit and several issuer unit tests will need to be run to make absolutely sure no damage is done. it's not a "minor job" in other words. if you're going to tackle it for goodness sake don't push/commit anything without running a massive set of unit tests. use a branch if you like, it's justifiable
some notes/thoughts: merging gror and grev plus any-other-combo i think might be possible to do without going completely insane on the gate budget, by using LUT2s. with m.If(self.chunk_sizes[i]): comb += lut2.eq(swaplut2) with m.Else(): comb += lut2.eq(noswaplut2) for j in range(self.width): m.d.comb += step_o[j].eq(lut2(step_i[j], step_i[j ^ chunk_size], lut2) the instruction would be of the form grevtern(RT, RA, steps, swaplut2[0..3], noswaplut2[0..3]) only one set of LUT2s needed per layer, and it has applications beyond grev even when the steps is set to zero. still needs thought
QTY 64of LUT2s: Number of cells: 640 $_ANDNOT_ 192 $_AND_ 64 $_MUX_ 64 $_ORNOT_ 128 $_OR_ 192 six layers of those is only 3840 cells, which is not bad. it's certainly one hell of a lot less than a LUT3: Number of cells: 2048 $_ANDNOT_ 576 $_AND_ 192 $_MUX_ 64 $_NAND_ 192 $_NOT_ 64 $_ORNOT_ 128 $_OR_ 832 six layers --> 12,288 cells which given that's excluding the grev-muxes is far too much - brings it to over 20,000 cells. a LUT2 would be workable and cover 256 possible instructions rather than forcing us to do grevi and grorc as separate instructions. might actually save gates.
or: grevtern(RT, RA, steps, swaplut2[0..3]) with m.If(self.chunk_sizes[i]): for j in range(self.width): m.d.comb += step_o[j].eq(lut2(step_i[j], step_i[j ^ chunk_size], swaplut2) with m.Else(): m.d.comb += step_o.eq((step_i) which reduces the number of options to sane. provides 16 instructions in one.
(In reply to Luke Kenneth Casson Leighton from comment #30) > which reduces the number of options to sane. provides 16 instructions in > one. including, ha! if allowing (RA|0) it would allow, by setting swaplut2=0b1111 (or other), for certain combinations of 64-bit constants to be set without having to use the 6-instruction ORI/ORIS/shift sequence. possible constants: * 0b01010101 * 0b10101010 * 0b00001111 * 0b11110000 etc. etc. simply by setting the right step and swaplut2 values.
added grev to soc, and also added formal proof https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=66620a91bac731f6745fa33a7fe122b0aa7d3485
fantastic. do you have an indication of the gate count (and LUT4 resources, synth_ecp5)?
(In reply to Luke Kenneth Casson Leighton from comment #33) > fantastic. do you have an indication of the gate count (and LUT4 resources, > synth_ecp5)? I haven't checked, but I expect the additional resources required to be on the order of 500 2-in muxes (384 (6 layers * 64 bits) for the grev itself, the rest for misc routing in/out)
(In reply to Jacob Lifshay from comment #34) > (In reply to Luke Kenneth Casson Leighton from comment #33) > > fantastic. do you have an indication of the gate count (and LUT4 resources, > > synth_ecp5)? > > I haven't checked, but I expect the additional resources required to be on > the order of 500 2-in muxes (384 (6 layers * 64 bits) for the grev itself, > the rest for misc routing in/out) pffh, that's not bad at all. should be fine to look at a LUT2-based version which will give 16 instructions and dozens of immediates in a 32-bit op
(In reply to Luke Kenneth Casson Leighton from comment #35) > (In reply to Jacob Lifshay from comment #34) > > (In reply to Luke Kenneth Casson Leighton from comment #33) > > > fantastic. do you have an indication of the gate count (and LUT4 resources, > > > synth_ecp5)? > > > > I haven't checked, but I expect the additional resources required to be on > > the order of 500 2-in muxes (384 (6 layers * 64 bits) for the grev itself, > > the rest for misc routing in/out) > > pffh, that's not bad at all. should be fine to look at a LUT2-based version > which will give 16 instructions and dozens of immediates in a 32-bit op imho a lut2 version is enough additional latency that, if we're not worried about that, we should just merge with the shift/rotate logic, since imho it should be similar latency cost and will save a bunch of area.
(In reply to Jacob Lifshay from comment #36) > imho a lut2 version is enough additional latency that, if we're not worried > about that, we should just merge with the shift/rotate logic, since imho it > should be similar latency cost and will save a bunch of area. i made it clear already that i do not want shift/rot touched at all in any way shape or form right now. it was complex as hell to write and took several weeks to get right due to the options added by masks. please consider it 100% off-limits. no further discussion on modification to shift_rot right now please.
to give some idea of the significance, here: in the VERSA_ECP5 builds i am doing right now, pipe1 of shiftrot came up on the critical path, barely making 49.5 mhz. the only possible way to reduce that without major redesign is to cut the alu input stage into its own separate pipeline. any additional gates in rotate would only force a 2 stage pipeline and i do not consider it wise to spend time focussing on a near total redesign of the shiftrot function unit, there are much higher priorities
https://libre-soc.org/openpower/sv/grevlut2x2.jpg a separate left-right LUT2 table gives a lot more options both for creating constants (RA=0) as well as in more general circumstsances.
(In reply to Luke Kenneth Casson Leighton from comment #39) > https://libre-soc.org/openpower/sv/grevlut2x2.jpg > > a separate left-right LUT2 table gives a lot more options > both for creating constants (RA=0) as well as in more general > circumstsances. in order to actually have a grev, that lut2 version would need a different lookup table for each row of the butterfly network...that's a prohibitive number of immediates (2*4*6 = 24 bits) for a lut2butterfly instruction. also, we'd still want grev as a separate instruction. The other issue is the same as in the shift/rotate circuit: a lut2 has more delay than a mux, making grev slower than 1 clock cycle (assuming the shift/rotate barely fits in 1 cycle)
(In reply to Jacob Lifshay from comment #40) > (In reply to Luke Kenneth Casson Leighton from comment #39) > > https://libre-soc.org/openpower/sv/grevlut2x2.jpg > > > > a separate left-right LUT2 table gives a lot more options > > both for creating constants (RA=0) as well as in more general > > circumstsances. > > in order to actually have a grev, that lut2 version would need a different > lookup table for each row of the butterfly network... look again at the pseudocode. the option to disable the lut2 on a per-row basis is still there: https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip.mdwn;hb=HEAD#l332 > that's a prohibitive > number of immediates (2*4*6 = 24 bits) for a lut2butterfly instruction. > also, we'd still want grev as a separate instruction. it becomes grevlut(immediate=0b10101010) or something like that. and gorc becomes grevlut(immediate=0b11101110) or something like that gxorc would be grevlut(immediate=0b01100110) or something > The other issue is the same as in the shift/rotate circuit: a lut2 has more > delay than a mux, making grev slower than 1 clock cycle (assuming the > shift/rotate barely fits in 1 cycle) i deliberately didn't use LUT3 for exactly this reason, that would almost certainly be over budget yes it's worth checking the longest-path in yosys after doing "synth" and/or "synth_ecp5".
I think we should use primary opcode 9 for the new second opcode for bitmanip instructions...but only for experimentation since it's marked as reserved because stuff used to be there. primary opcode 0 should't be used since it's reserved for implementation-dependent use and the `attn` opcode.
(In reply to Jacob Lifshay from comment #42) > I think we should use primary opcode 9 for the new second opcode for > bitmanip instructions...but only for experimentation since it's marked as > reserved because stuff used to be there. primary opcode 0 should't be used > since it's reserved for implementation-dependent use and the `attn` opcode. the official sandbox is EXT022. the only reason for moving ternlogi and grevlogi to a separate opcode is because they're completely nuts on bit-allocation, and were putting design pressure on the bitmanip space which is pretty heavily loaded. ternlogi has to move from EXT05 if grevlogi is to be on EXT09, they're allocated as a pair.
(In reply to Luke Kenneth Casson Leighton from comment #43) > (In reply to Jacob Lifshay from comment #42) > > I think we should use primary opcode 9 for the new second opcode for > > bitmanip instructions...but only for experimentation since it's marked as > > reserved because stuff used to be there. primary opcode 0 should't be used > > since it's reserved for implementation-dependent use and the `attn` opcode. > > the official sandbox is EXT022. the only reason for moving ternlogi > and grevlogi to a separate opcode is because they're completely nuts > on bit-allocation, and were putting design pressure on the bitmanip space > which is pretty heavily loaded. > > ternlogi has to move from EXT05 if grevlogi is to be on EXT09, they're > allocated as a pair. nope, grevlogi would still be in opcode 5. everything else (that're conveniently not implemented or not completed (grev[w][i][.]) yet) will move to opcode 9. opcode 22 is currently used by setvl and friends and iirc you didn't want to move them until you're back in working-on-SV mode, so I picked other opcodes meanwhile.
(In reply to Jacob Lifshay from comment #44) > opcode 22 is currently used by setvl and friends and iirc you didn't want to > move them until you're back in working-on-SV mode, so I picked other opcodes > meanwhile. there should - unless i made a mistake - be room for setvl in opcode 22.
(In reply to Luke Kenneth Casson Leighton from comment #45) > (In reply to Jacob Lifshay from comment #44) > > > opcode 22 is currently used by setvl and friends and iirc you didn't want to > > move them until you're back in working-on-SV mode, so I picked other opcodes > > meanwhile. > > there should - unless i made a mistake - be room for setvl in opcode 22. arse, i made a mistake. https://libre-soc.org/openpower/sv/bitmanip/ ok (not matching what's in sv/trans/svp64.py but correctable) NN --11 110 Rc setvl not ok because there's not enough space. or, there might be, i have to double-check the encoding. NN RA RB RC 10 --10 110 Rc rsvd
(In reply to Jacob Lifshay from comment #44) > opcode 22 is currently used by setvl and friends and iirc you didn't want to > move them until you're back in working-on-SV mode, so I picked other opcodes > meanwhile. nuts. there's not enough space, and i noted a mistake in the layout. https://libre-soc.org/openpower/sv/bitmanip/ these overlap: NN RT RA RB 1 itype 0101 110 Rc xperm NN RA RB RC 0 itype 0101 110 Rc av minmax NN RA RB RC 1 00 0101 110 Rc av abss NN RA RB RC 1 01 0101 110 Rc av absu NN RA RB 1 10 0101 110 Rc av avgadd those last 3 have to move. yep, EXT09 it is, EXT22 remains for setvl/svshape/svremap/svstep - i'll sort out the bitmanip opcode allocation table.
hang on, nope, i got it, i spotted extra encoding space NN RS RA RB RC 11 011 rsvd NN RA RB 1001 110 Rc rsvd NN RA RB RC --10 110 Rc rsvd NN --11 110 Rc setvl those will do for setvl, svstep, svshape and svremap i will work out the others
lkcl asked me to calculate gate counts for grevlut: just counting the grev network and luts: latency of a single mux-lut combination: 1 mux through data input and 2 muxes through select input. gate count of a single mux-lut combination: 4 muxes. latency of full network: 6 muxes through data input and 12 muxes through select input plus wire delay -- will almost certainly need 2 clock cycles. this is 3x a grev/gorc combined network. also 3x a rotator network. gate count of full network: 4 * 6 * 64 = 1536 muxes a grev/gorc combined network can be made by taking a grev network and writing it in terms of the 4-input and-or-invert cells that the muxes would likely be anyway, converting every other layer to or-and-invert, and then enabling both and-gate inputs simultaneously instead of having them be S and ~S. I'll make a diagram.
(In reply to Jacob Lifshay from comment #49) > latency of full network: > 6 muxes through data input and 12 muxes through select input plus wire delay > -- will almost certainly need 2 clock cycles. perfectly fine. i've made almost everything min 3 stage ALUs because of combinatorial paths in MultiCompALU. > this is 3x a grev/gorc > combined network. also 3x a rotator network. > > gate count of full network: > 4 * 6 * 64 = 1536 muxes call it 5 gates per mux? 8000 gates. less than a multiplier by 30%. for 256 instructions (which is what 2x 4-entry luts gives) that's pretty damn good. > a grev/gorc combined network can be made by taking a grev network and > writing it in terms of the 4-input and-or-invert cells that the muxes would > likely be anyway, converting every other layer to or-and-invert, and then > enabling both and-gate inputs simultaneously instead of having them be S and > ~S. which covers only 2 out of the possible 256 instructions of grevlut. actually 2 out of 512 because i added an invert-input bit as well. for less than 0.5% of instructions it's not worth the effort and grev/gorc is nowhere near capable of generating the types of immediates that grevluti can. grev and gorc i don't think are worth spending any more time on. frees up 8x X-Form instructions by removing them.
(In reply to Luke Kenneth Casson Leighton from comment #50) > (In reply to Jacob Lifshay from comment #49) > > > latency of full network: > > 6 muxes through data input and 12 muxes through select input plus wire delay > > -- will almost certainly need 2 clock cycles. > > perfectly fine. i've made almost everything min 3 stage ALUs because > of combinatorial paths in MultiCompALU. imho having everything be 3-stage ALUs is fine for a simple processor, but it's terrible if you actually want high performance which is what we're aiming for. The latency (assuming a mux is an and-or-invert gate with 1.5 gate latency and an inverter on select with 1 gate latency) is 6 * 1.5 + 12 * 2.5 = 39 gates!!! this is unacceptably slow for a grev imho. > > > this is 3x a grev/gorc > > combined network. also 3x a rotator network. > > > > gate count of full network: > > 4 * 6 * 64 = 1536 muxes > > call it 5 gates per mux? 8000 gates. less than a multiplier by 30%. > for 256 instructions (which is what 2x 4-entry luts gives) > that's pretty damn good. gate count isn't an issue. latency is. > > > a grev/gorc combined network can be made by taking a grev network and > > writing it in terms of the 4-input and-or-invert cells that the muxes would > > likely be anyway, converting every other layer to or-and-invert, and then > > enabling both and-gate inputs simultaneously instead of having them be S and > > ~S. > > which covers only 2 out of the possible 256 instructions of grevlut. it actually covers more than that, because that's just how you can make a faster circuit (probably what the riscv designers were thinking of when they added gorc) that implements both grev and gorc and other instructions too. imho we should have an instruction based on that circuit instead of grevlut with its 3x latency -- it would still have the control signals based off of the immediate like grevlut, but would be single-cycle even on a 5ghz cpu, unlike grevlut. > actually 2 out of 512 because i added an invert-input bit as well. adding a layer of xor gates is trivially easy to add to the grev/gorc circuit and doesn't add much latency -- imho having invert or not shouldn't be a reason to choose grevlut over the grev/gorc/etc. circuit.
The grev/gorc/etc. circuit (1/3 latency of grevlut and nearly just as flexible): https://libre-soc.org/openpower/sv/bitmanip/grev_gorc_design/
(In reply to Jacob Lifshay from comment #52) > The grev/gorc/etc. circuit (1/3 latency of grevlut and nearly just as > flexible): > https://libre-soc.org/openpower/sv/bitmanip/grev_gorc_design/ oo! oo! i'm so excited: a small modification to put in a *pair* of LUTs (one for left, one for right) and i think we have the exact same functionality.... perhaps not exactly the same. i'll try to code it up in a 2nd version of grevlut.py
nope, it's not the same. drat. what it's missing is the inversion (XOR-effect) of the input bits within each row. that's what makes grevlut special. here's what would work (roughly): https://ftp.libre-soc.org/2022-05-17_12-17.png * double the size of the AND-OR-INV (AOI) * bring in an *inversion* of the A[]-straight and A[]-cross-over bits per row * put the *inverted* copy of A[]-straight and A[]-crossover through the *second* set of AOIs * arrange for sh[] to be a pass-thru which puts A[]-straight untouched into the output. (this is the bit i got wrong in the diagram, if 4 AND gates are used then the output is always zero. what's needed is that the imm is converted to be... 0b0001 which indicates "please put A[] through untouched". i think) it would, i believe, in pseudocode, be this: AOI_inputs = A[i] | A[i^cross]<<1 | ~A[i]<<2 | ~A[i^cross]<<3 AOI = (AOI_inputs & (sh[j] ? imm : 0b0001)) != 0b0000 output[i] = AOI it introduces one single NOT-inverter on each layer, it combines the MUXes into one... at each point 6x64 there are: * 2x NOTs - 2 gates * 4x ANDs - 4 gates * 1x 4-input OR-gate - 4? gates? which is only 10 gates total. let's say i got it wrong and it's 20 gates. 20 * 6 * 64 = 7680 gates which is still damn good. what about the gate-delay chain: * gate-delay-chain is 3 gates (NOT, AND, 4-input-OR) * that's per layer, 6 layers * total 18 gates instead of 12 gates.
https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=527efabfbe08f5ce2ea81b4bf26d8b194eb109b5 gives the principle: it's nothing like the same as the MUX version of grevlut which has me concerned.
ok so the key difference as to why i designed grevlut the way it is, is to bring in *two* bits (A[i] and A[i^crossover]) into the binary LUT. the cascade effect of the application of multiple LUTs results in e.g. ANDing or NANDing of multiple bits, which is how the magic constants 0x11111111 0x10101010 0x10001000 0x10000000 emerge, and it turns out that 0x77777777 0x70707070 etc and many more can be created as well. the AOI trick *cannot do that* because the bits A[i] and A[i^crossover] are treated as *independent* thus unless otherwise radically altered the AOI trick is not going to be useful. one way to reduce the number of MUXes on the grevlut path is to substitute "identity" immediates: imm_left = sh[j] ? imm[0:3] : 0b0101 imm_right = sh[j] ? imm[4:7] : 0b1100 i think that does the trick... it'll be pretty close. this knocks out one MUX per layer in the shortest critical path.
(In reply to Luke Kenneth Casson Leighton from comment #56) > ok so the key difference as to why i designed grevlut the way it is, > is to bring in *two* bits (A[i] and A[i^crossover]) into the binary LUT. > > the cascade effect of the application of multiple LUTs results in > e.g. ANDing or NANDing of multiple bits, which is how the magic > constants 0x11111111 0x10101010 0x10001000 0x10000000 emerge, > and it turns out that 0x77777777 0x70707070 etc and many more > can be created as well. neat! > the AOI trick *cannot do that* because the bits A[i] and A[i^crossover] > are treated as *independent* that's not true, see: https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=26558305c986c30f6f786e77769c566c6104f085 This has the exact same aoi matrix that I proposed, just the inputs from sh are changed. it still has latency 6x aoi gates plus misc stuff on sh and input/output xor layers. it *can* generate 0x1000100010001 because the aoi matrix has the input/outputs invertable, giving you and-ing via de-morgan inversion of the or-gates inside the aoi matrix. I added in the thing where the aoi gates taking input from the left have different imm bits than the input from the right in the grev butterfly wires. (sorry, I can't think of a good way to describe it textually.)
(In reply to Jacob Lifshay from comment #57) > > and it turns out that 0x77777777 0x70707070 etc and many more > > can be created as well. > > neat! yeah, unexpected. it sort-of happened, but is kinda important to justify adding such an expensive operation (1/3 of a Major opcode) especially as "just" a bitmanip operation. > > the AOI trick *cannot do that* because the bits A[i] and A[i^crossover] > > are treated as *independent* > > that's not true, see: > https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=26558305c986c30f6f786e77769c566c6104f085 ah great, i must have got the implementation wrong ok that's good - it's still not getting as diverse a range of constants. * imm 0b1101110 will give f0f0f0f0... c0c0c0c0 80808080 88008800 808000008080000 880000 * imm 0b100110 will give one single bit set, based on sh (which is nice) but i see a 2nd imm doing exactly the same thing * imm 0b100111 is quite interesting, gives 20202020, 2222000022222 but there's no way to create 0xbbbb, 0x7777, 0x6666 etc which grevlut somehow manages to do. also there are a hell of a lot of zeros, 0xfffffffs, and so on, all wasted, which tells me it's not "mixing" enough. > This has the exact same aoi matrix that I proposed, just the inputs from sh are > changed. it still has latency 6x aoi gates plus misc stuff on sh and > input/output xor layers. > > it *can* generate 0x1000100010001 because the aoi matrix has the input/outputs > invertable, giving you and-ing via de-morgan inversion of the or-gates inside > the aoi matrix. > > I added in the thing where the aoi gates taking input from the left have > different imm bits than the input from the right in the grev butterfly wires. > (sorry, I can't think of a good way to describe it textually.) yeah we're in new territory here. i used left and right as well, i know what you mean.
as if grevlut was not fun enough already i figured a 3-in register variant would allow multiple left-right 4+4 LUTs in RC, one for each row in the swapping-lookup-crossover-mixing-thing. i can't explain why, it just felt right, to have the means to flip individual rows, i think this will turn out to be a really powerful instruction. i did however just realise that the sh parameter is totally redundant because if you can provide individual LUTs per row then it is a straightforward matter to set an 8-bit setting within RC for that row which does absolutely nothing. hilariously a grevluti might help with setting up RC with the appropriate bits, hmm. some of them. thoughts appreciated.
the "identity" pattern which makes A (left side) go straight through is 0b1010, and B side is 0b1100. i think. which would be an 8bit of 0b11001010 or 0xca. honestly it would be easier (a lot easier) to arrange the inputs to the LUT pairs if that was 0xaa! "identity" (no change at all) would just be 0xaaaaaaaaaaaaaaa (6x 8-bit 0xaa) which can be arranged by swapping a and b on the right side. all needs checking.
annoying as it is, let's consider this done but qualified as R&D "lessons". there is not enough EXT022 opcode space for so many instructions, when grevlut and grevluti are so compelling. i have not yet been even able to fit *any* of the GFP or GF2^N operations in, yet. i got clmul in, and gf2, none of the 4-operand GFs. adding grevlut needs to be under a separate bugreport with its own budget.
(In reply to Luke Kenneth Casson Leighton from comment #61) > annoying as it is, let's consider this done but qualified as R&D "lessons". ok, sounds good. this totals to 557 lines so (using the method from bug #1025) that comes out to EUR 1500 measuring as of: https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=b55f37ee14b3253087f0a743eb795b53e59f25d2 https://git.libre-soc.org/?p=nmutil.git;a=commit;h=7a3ab7ecce390b2003ff61befca59cd4ad6b7428 measuring nmutil.git: src/nmutil/grev.py 270 lines src/nmutil/test/test_grev.py 86 lines measuring openpower-isa.git: openpower/isa/bitmanip.mdwn about 80 lines openpower/isatables/minor_5.csv 4 lines src/openpower/decoder/isa/caller.py about 4 lines src/openpower/decoder/power_decoder2.py 1 line src/openpower/decoder/power_enums.py 4 lines src/openpower/sv/trans/svp64.py about 32 lines src/openpower/test/bitmanip/bitmanip_cases.py about 50 lines src/openpower/decoder/helpers.py about 10 lines (log2 needed by grev tests) openpower/isatables/fields.text about 12 lines (add XB form for grev) src/openpower/decoder/formal/proof_decoder2.py 2 lines (add XBI for grev) src/openpower/decoder/power_decoder2.py 2 lines (add XBI for grev)
(In reply to Jacob Lifshay from comment #62) > src/openpower/decoder/power_decoder2.py 2 lines (add XBI for grev) you also now need to remove it. i did say that it is not going in. please remove all of these unauthorized instructions from the CSV files, pipelines and enums (OP_GREV). unfortunately you went too far down the implementation path. budget remains the same, the work itself and the lesson learned was useful. leave in nmutil as it is an independent and valuable library.
(In reply to Luke Kenneth Casson Leighton from comment #63) > please remove all of these unauthorized instructions from > the CSV files, pipelines and enums (OP_GREV). I think "unauthorized" is misleading, at the time, we all agreed to add them, you included. I'll remove them later, except I think we should leave the unit tests, since they'll be useful for grevlut when that gets implemented.
(In reply to Jacob Lifshay from comment #64) > I'll remove them later, except I think we should leave the unit tests, since > they'll be useful for grevlut when that gets implemented. I removed grev from openpower-isa.git and soc.git, leaving the tests for future use with grevlut. I also fixed soc.git's CI, so we can actually test that it wasn't broken by removing grev. https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=8fc152e6e1710bfe32e2eea9dee0ec1f415ce719 https://git.libre-soc.org/?p=soc.git;a=shortlog;h=5aa0ab1da7681952e5e6424800c0f80b84826695