755 – add grev instruction (OP_GREV)

Bug 755 - add grev instruction (OP_GREV)

Summary: add grev instruction (OP_GREV)

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's second ASIC
Classification:	Unclassified
Component:	source code (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	--- enhancement
Assignee:	Jacob Lifshay

URL:	https://libre-soc.org/openpower/sv/bi...

Depends on:	757
Blocks:	741 771
	Show dependency tree / graph

Reported:	2021-12-10 23:46 GMT by Jacob Lifshay
Modified:	2023-11-14 01:36 GMT (History)
CC List:	3 users (show)

See Also:	765 852 840
NLnet milestone:	NLnet.2021.02A.052.CryptoRouter
total budget (EUR) for completion of task and all subtasks:	1500
budget (EUR) for this task, excluding subtasks' budget:	1500
parent task for budget allocation:	741
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	lkcl = { amount = 500, submitted = 2023-09-29, paid = 2023-10-04 } [jacob] amount = 1000 submitted = 2023-10-15 paid = 2023-11-10

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jacob Lifshay 2021-12-10 23:46:07 GMT

* DONE: GRev class in nmutil
https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=51e5a621621c12bb34b44d6bb8a29f1f668a111e
* DONE: comments in grev.py
* DONE: add NGI POINTER funding notice
* DONE: unit tests for GRev in nmutil
* DONE: formal tests for GRev in nmutil
  note, to be covered by bug #840
* DONE: add to openpower-isa
  * DONE: add to csv
  * DONE: add pseudo-code
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=openpower/isa/bitmanip.mdwn;h=3f73b0b894583e8cfb2b871988c5b593f6412382;hb=HEAD
  * DONE: add to BitManipTestCase
https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=b55f37ee14b3253087f0a743eb795b53e59f25d2
* DONE: add to soc
  * DONE: add formal proof
https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=66620a91bac731f6745fa33a7fe122b0aa7d3485
  note, to be covered by bug #840
* TODO: create formal correctness bugreport under bug #840
  to cover tasks done here
* TODO: remove from openpower-isa (leave in nmutil, all good there)

Comment 1 Luke Kenneth Casson Leighton 2021-12-11 15:34:56 GMT

commit 7aed55e3538fb39327bf4946843292d128d29abc (HEAD -> master, origin/master, origin/HEAD)
Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date:   Sat Dec 11 15:29:09 2021 +0000

    add some comments (locations for comments to be added)

i took a quick look, one thing: i think you'll probably find this works:


        # start with input as first layer
        step_i = self.input

        for i in range(self.log2_width):
            step_o = Signal(self.width, name=f"step{i}")
            chunk_size = 1 << i
            # TODO comment that this is creating the mux-swapper
            with m.If(self.chunk_sizes[i]):
                # swap path
                for j in range(self.width):
                    # TODO explain what this XOR does
                    m.d.comb += step_o[j].eq(step_i[j ^ chunk_size])
            with m.Else():
                # straight path
                m.d.comb += step_o.eq(step_i)
            # output becomes input for next phase
            step_i = step_o
        # TODO comment that the last "step" is the output
        m.d.comb += self.output.eq(step_o)

Comment 2 Jacob Lifshay 2021-12-12 02:19:17 GMT

(In reply to Luke Kenneth Casson Leighton from comment #1)
> commit 7aed55e3538fb39327bf4946843292d128d29abc (HEAD -> master,
> origin/master, origin/HEAD)
> Author: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
> Date:   Sat Dec 11 15:29:09 2021 +0000
> 
>     add some comments (locations for comments to be added)
> 
> i took a quick look, one thing: i think you'll probably find this works:
> 
> 
>         # start with input as first layer
>         step_i = self.input
> 
>         for i in range(self.log2_width):
>             step_o = Signal(self.width, name=f"step{i}")

I specifically have the intermediate signals exposed because I want to have the tests be able to access them to assert they have the correct values.

Comment 3 Luke Kenneth Casson Leighton 2021-12-12 11:01:03 GMT

(In reply to Jacob Lifshay from comment #2)

> I specifically have the intermediate signals exposed because I want to have
> the tests be able to access them to assert they have the correct values.

then why was that not added as a comment, explaining that?
nothing "unusual" like that can be left out of the comments.

you need to think - ask yourself the question (every time):
"how will someone else view this decision that i am making?
will they understand it?  come to think of it, in 3-6 months
time will *i* understand it?"

Comment 4 Luke Kenneth Casson Leighton 2021-12-12 14:46:50 GMT

this makes the need for a sub-function redundant:

        for i in range(self._log2_width):
            step_o = Signal(self.width, name="step_%d" % i)
            _steps.append(step_o)
            ...
            ...
            step_i = step_o # for next loop, to create the combinatorial chain

then at the end self._steps can be set to _steps.

optionally, a constructor parameter could be added "testing_mode=False"
which does not add that.

or, another technique that i've seen used - and this is for Formal Correctness
Proofs only:

def elaborate(self, platform):

    if platform == 'formal':
        # do some Asserts.

it's a small amount of code, here, so isn't a Big Deal.  if it was a much
larger module, exposing a considerably larger suite of "test" signals,
i'd start to get concerned.

Comment 5 Jacob Lifshay 2021-12-17 03:21:05 GMT

I cleaned up the grev implementation a bunch and added docs, unit tests, and formal proofs.

Too bad Python 3.7 doesn't have Rust's `windows` iterator, it'd be really useful here. I added `pairwise` cuz we're not running Python 3.10 yet...I want it anyway!

https://doc.rust-lang.org/std/primitive.slice.html#method.windows

Comment 6 Jacob Lifshay 2021-12-17 03:29:46 GMT

lkcl, can you add the funding notice?

src/nmutil/grev.py:4
# TODO(lkcl): add NGI POINTER funding notice, idk what we need for that...

Comment 7 Luke Kenneth Casson Leighton 2021-12-17 12:47:35 GMT

https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=c63d1f2b483573fae75af6688c07409b84efa7e4

i did a style rewrite, and updated the comments.  it's definitely a
butterfly network - you can see by looking at the wikipedia page
and comparing against the xbitmanip draft.  it's definitely not
an ror, that's the function before in the RVV xbitmanip document
(the previous page contains a different diagram, for ror, not
butterfly)

https://en.wikipedia.org/wiki/Butterfly_network#/media/File:Butterfly_Network.jpg
https://ftp.libre-soc.org/2021-12-17_11-17.png

* i removed tee and pairwise: "adding yet more code" means "more things to read"
  returned the code to the style that you replaced, which uses a list for
  an accumulator, inside elaborate.

  this is perfectly fine to do because the dut is instantiated (elaborate
  called) long before the unit test begins to refer to the actual contents

  this has the advantage of removing the intermediary signals from the
  constructor, such that when someone reads the constructor to find out
  what Signals they should be connecting up, they don't go "urrr what the
  heck should i do with these"

* module docstrings != class docstrings.  module docstrings describe the
  general concept, class docstrings typically tell you "how to use the class"

  the two are radically different

* saying "see this other stuff somewhere else" entirely defeats the object
  of the exercise of comments!

  comments are for "immediate thoughts necessary to understand, right now,
  what the hell is going on, right now".  referring *elsewhere* causes
  massive confusion, frustration, and irritation for the reader.

  for example: self._steps is not obvious at all that it's an accumulator
  of everything (input, intermediary, output)... *therefore say so*,
  then comment each addition and, oh look, "the last layer is also the
  output", okaaay, that's quite obvious.

  bottom line is that you should *not* assume - at all - that the reader
  is capable of understanding the HDL (the code).  AT ALL.
  someone who has NO UNDERSTANDING of python should be able to read
  the english language "words" and go, "hmmmm, y'know: i can just about
  understand this"

  also, any "tricks" - things that break standard conventions - DEFINITELY
  need to be commented.

* i replaced the "If" with an *actual* Mux, because it replaces 4 lines
  with 1.

* the FSFE provides guidance on how to do proper copyright notices.

  Copyright (C) year, year, year, year fullname contactinfo

  it is NOT ok to put yearstart-yearend, and you MUST NOT claim
  copyright in a year that you did not ACTUALLY make any additions.

  there are plenty of examples in the code to follow here.

Comment 8 Jacob Lifshay 2021-12-17 16:56:07 GMT

(In reply to Luke Kenneth Casson Leighton from comment #7)
> https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;
> h=c63d1f2b483573fae75af6688c07409b84efa7e4
> 
> i did a style rewrite, and updated the comments.  it's definitely a
> butterfly network - you can see by looking at the wikipedia page
> and comparing against the xbitmanip draft.
It's similar but not identical because wikipedia's example does grev from MSB to LSB (swap 4, swap 2, swap 1) but the code I wrote goes the other way (swap 1, swap 2, swap 4).
>  it's definitely not
> an ror
I never said it's an ror, I said it's *similar* to an ror, which is true. it uses a very similar sequence of 2:1 muxes.

> * i removed tee and pairwise: "adding yet more code" means "more things to
> read"

I documented exactly what pairwise does...the exact same as python's standard library -- it returns successive pairs of the input items. imho it's waay easier to understand how I had written it.

>   returned the code to the style that you replaced, which uses a list for
>   an accumulator, inside elaborate.
> 
>   this is perfectly fine to do because the dut is instantiated (elaborate
>   called) long before the unit test begins to refer to the actual contents

no it's not, you *obviously* didn't run the test (cuz you broke it). Also, having new members that show up only after calling something other than the constructor is poor style and quite confusing to use imho, hence why I replaced it.
> 
>   this has the advantage of removing the intermediary signals from the
>   constructor, such that when someone reads the constructor to find out
>   what Signals they should be connecting up, they don't go "urrr what the
>   heck should i do with these"

then, simply, they just read the docstring for _steps and see that it says it's an internal signal only exposed for unit tests, and they then can easily ignore it. that should have been *also totally obvious* because its name starts with an underline anyway.


> * the FSFE provides guidance on how to do proper copyright notices.
> 
>   Copyright (C) year, year, year, year fullname contactinfo

we had previously agreed just referring to Notices.txt was sufficient, cuz you can put the copyright notice in there and only have to update the copyright years in one place.
> 
>   it is NOT ok to put yearstart-yearend, and you MUST NOT claim
>   copyright in a year that you did not ACTUALLY make any additions.
> 
>   there are plenty of examples in the code to follow here.

Comment 9 Luke Kenneth Casson Leighton 2021-12-17 21:56:15 GMT

(In reply to Jacob Lifshay from comment #6)
> lkcl, can you add the funding notice?
> 
> src/nmutil/grev.py:4
> # TODO(lkcl): add NGI POINTER funding notice, idk what we need for that...

sorted,

# Funded by NLnet Assure Programme 2021-02-052, https://nlnet.nl/assure part
# of Horizon 2020 EU Programme 957073.

https://git.libre-soc.org/?p=nmutil.git;a=commitdiff;h=20db17aafdbe539d8afbe4e466d215e93a18817e

Comment 10 Luke Kenneth Casson Leighton 2021-12-17 23:05:04 GMT

(In reply to Jacob Lifshay from comment #8)

> It's similar but not identical because wikipedia's example does grev from
> MSB to LSB (swap 4, swap 2, swap 1) but the code I wrote goes the other way
> (swap 1, swap 2, swap 4).

yes, a full butterfly isn't described on the wikipedia page: a full butterfly
you go 1 2 4 8 16 8 4 2 1 (or 16 8 4 2 1 2 4 8 16, bizarrely it doesn't
matter which) and ta-daa you have a full (100%) non-blocking routing network
as long as you have a permutation [no attempts to route 2 pieces of data
to the same destination]. this was my 3rd year project back in 1991.

i added an option to allow 1 2 4 8 rather than 8 4 2 1 because of that.

hmmm.... is that worth actually putting in as a bit-option into the
instruction? because the process of inverting just those few bits
(esp with SVP64) is going to be a pain in the ass.  looking at
the opcode list, though, it's already quite crushed for space
https://libre-soc.org/openpower/sv/bitmanip/

although, removing ternlog entirely (leaving ternlogv in) would
free up a heck of a lot of opcode space.

btw did you see grev-or (gorc)? that got me thinking that perhaps
an operator as an option to GRev would be cool.  and also mean that
GOrc is basically "import operator; GOrc = Grev(width, operator._or)"

i have no idea if a nor or nand version of Grev would even be useful
but i like the concept and can see it may have potential.

> I never said it's an ror, I said it's *similar* to an ror, which is true. it
> uses a very similar sequence of 2:1 muxes.

i got rather confused by the xbitmanip PDF which happens to have the
ROR diagram on the exact same page as the grev c-pseudocode, and thought
"errrr" for about half an hour

> > * i removed tee and pairwise: "adding yet more code" means "more things to
> > read"
> 
> I documented exactly what pairwise does...the exact same as python's
> standard library -- it returns successive pairs of the input items. imho
> it's waay easier to understand how I had written it.

mhrhhhm... well, that's what the code-comments are for: to explain "not quite
obvious" things [in english, not actual-code].

> no it's not, you *obviously* didn't run the test (cuz you broke it). 

yes, i'm catching up, i missed several messages (don't know how) and
didn't see that a unit test now existed.  sorted now.

> Also,
> having new members that show up only after calling something other than the
> constructor is poor style and quite confusing to use imho, hence why I
> replaced it.

vs the confusion / expectation when people look at the constructor and
go "oh, i must connect this array of Signals up, what the heck to, i have
no idea", and given that it's *only* for the unit test that those are
accessible, i'd say that the "public usage" wins over any kind of
"convenience" for us as writers of a public API.

elaborate() exists precisely because it was found to be problematic
to expect a class to construct everything-it-needs within its constructor,
so leveraging that is fine.

> then, simply, they just read the docstring for _steps and see that it says
> it's an internal signal only exposed for unit tests, and they then can
> easily ignore it. that should have been *also totally obvious* because its
> name starts with an underline anyway.

again: we can't assume that.  it's quite common to access member variables
with underscores, despite being an [unenforceable] convention that attempts
to dictate otherwise.  and inherited classes often *require* access to them.
in this case, it's definitely private.

> we had previously agreed just referring to Notices.txt was sufficient, cuz
> you can put the copyright notice in there and only have to update the
> copyright years in one place.

yyeah then i saw the FSFE's video at one of the NGI POINTER mini-conferences
and was reminded that that's insufficient.  also, from a legal perspective
it's technically incorrect.  changes to one file are required *only* to
add the current year: updating copyright years in a global file makes a
legally-dubious statement, "we claim *ALL* files were modified in this
year"

also, we are getting people taking our code and trying to claim both credit
and copyright, which they can do by cherry-picking and ignoring the bland
"see this other file" at the top.

a copyright notice and creditation in every single file makes it abundantly
clear, and if they remove those then that's a clear copyright violation.
(which is the whole purpose of the exercise of having them)

Comment 11 Jacob Lifshay 2021-12-23 04:49:28 GMT

I redid grev, adding a bunch of comments and some nice diagrams.

https://git.libre-soc.org/?p=nmutil.git;a=blob;f=src/nmutil/grev.py;h=35b45657eb5895028a88d7927454fbe2398cc27a;hb=HEAD

Comment 12 Luke Kenneth Casson Leighton 2021-12-23 06:20:22 GMT

ascii art's great, sphinx plugin can turn it into an image.
  irony: you can have too many comments, they become
a distraction esp. when the function is 2 lines and comments 10-12

spelling: "it's" is not a relative pronoun. it's short for the two words, "it" and "is".  also "cuz" is fine in IRC but not documentation.

inter function can go (1 liner made from these)
https://stackoverflow.com/questions/339007/how-to-pad-zeroes-to-a-string
https://stackoverflow.com/questions/16060899/alphabet-range-in-python

string.ascii_lowercase[:-length].rjust(fixedlen, "0")

negative ranges start from the opposite end.  works with
start as well e.g. list[-5:7]

"""docstrings convention
end closer on a newline, lined up
"""

also can you put the rtlil creator back in, they are really useful

Comment 13 Jacob Lifshay 2021-12-23 06:58:01 GMT

(In reply to Luke Kenneth Casson Leighton from comment #12)
> also can you put the rtlil creator back in, they are really useful

they're there, i added automatic rtlil writing to the do_sim function, so rtlil is written by every unit test. apparently that wasn't obvious from the "run this specific unit test, then run yosys" paraphrased comments:
> # useful to see what is going on:
> # python3 src/nmutil/test/test_grev.py
> # yosys <<<"read_ilang sim_test_out/__main__.TestGrev.test_small/0.il; proc; clean -purge; show top"

Comment 14 Jacob Lifshay 2021-12-23 07:06:26 GMT

(In reply to Luke Kenneth Casson Leighton from comment #12)
> ascii art's great, sphinx plugin can turn it into an image.
>   irony: you can have too many comments, they become
> a distraction esp. when the function is 2 lines and comments 10-12
> 
> spelling: "it's" is not a relative pronoun. it's short for the two words,
> "it" and "is".

ah yeah, annoying english spelling. I totally didn't notice when I wrote it, even though I usually catch that.

>  also "cuz" is fine in IRC but not documentation.
> 
> inter function can go (1 liner made from these)
> https://stackoverflow.com/questions/339007/how-to-pad-zeroes-to-a-string
> https://stackoverflow.com/questions/16060899/alphabet-range-in-python
> 
> string.ascii_lowercase[:-length].rjust(fixedlen, "0")

that only works for the msb_first is False case. when msb_first is True, it produces names like 0buvwx00, which the above code doesn't do.

Comment 15 Luke Kenneth Casson Leighton 2021-12-23 20:55:14 GMT

(In reply to Jacob Lifshay from comment #14)

> > string.ascii_lowercase[:-length].rjust(fixedlen, "0")
> 
> that only works for the msb_first is False case. when msb_first is True, it
> produces names like 0buvwx00, which the above code doesn't do.

this should solve that:
s = reversed(string.ascii_lowercase[:-fixedlen]) if msb else string.ascii_lowercase
return s.rjust(....)

(In reply to Jacob Lifshay from comment #13)
> (In reply to Luke Kenneth Casson Leighton from comment #12)
> > also can you put the rtlil creator back in, they are really useful
> 
> they're there, i added automatic rtlil writing to the do_sim function, so
> rtlil is written by every unit test.

yes, i know, and that is not the point.  i add rtlil outputting to
every module i write and both i and other people find it useful.
we are not interested in wasting time hunting for a unit test,
another file, or in wasting time waiting for that test to run, or
wasting even more time hunting through a bunch of unnecessary crud
looking for something that is unclear if it is even related to
the module.

please put it back.

Comment 16 Luke Kenneth Casson Leighton 2021-12-25 03:14:08 GMT

oo, oo, i have an idea. how about using a pair of LUT4s instead of
a pair of MUXes? this would cover gorc and a boatload of other
instructions including all types of bitwise treereduce.

it would look like:

gternlut4(RT, RA, leftlut4, rightlut4)
   for i in range log2wid
      step = steps[i]
      for j in width:
          if j ^ xorstepthing: lut = leftlut4 else rightlut4
          bit = lut4fn(step[j], step[j^xorstepthing], bit4)
          outstep[j] = bit

so there are a *pair* of lut4s, one replacing the left mux,
one replacing the right.

when using this fn, setting the right lut4 to OR its inputs
and the left one to pass its input through would create a
treewise bitlevel OR reduction.

Comment 17 Luke Kenneth Casson Leighton 2021-12-25 14:13:52 GMT

wark.  has to be lut3 (8 bit table) because the step bit has
to be included

gternlut3(RT, RA, leftlut8, rightlut8)
   for i in range log2wid
      step = steps[i]
      for j in width:
          if j ^ xorstepthing: lut = leftlut8 else rightlut8
          bit = lut3fn(step[j], step[j^xorstepthing], RA[i], lut)
          outstep[j] = bit

this could end up as a huge number of gates, and, also, makes an
immediate version impractical unless leftlut8 and rightlut8 are
constructed from each other (restricting the options)

    leftlut7 = immed
    rightlut8 = immed[0:4] + ~immed[4:8]

better i feel would be to use RB

    leftlut8 = RB[0:8]
    rightlut8 = RB[8:16]

and RC for the selector.  actually probably RC for lut pair and
RB for the selector-immediate because the convention is to put
any immediates down the RB data path.

Comment 18 Luke Kenneth Casson Leighton 2021-12-27 22:10:57 GMT

i did a quick check (python3 nmutil/lut.py; yosys; synth; ltp) and this gives:

* 2048 cells for a 64-bit LUT3
* longest path 7

that means that a 6-layer 64-bit g-tern-log is going to be absolutely
massive: 13,000 cells (which are average... what... 5 gates each?)
and a path length of almost 50.  that will almost certainly have to
be at least a 3-stage pipeline.

the question is, then: is it worth it?  for the sheer awesome flexibility
it has - the number of options that can be covered - i'd say hell yes,
if it wasn't for the fact that it's 65 *thousand* gates - four times
larger than a 64-bit multiplier and three times larger than an IEEE754 FP
multiply.

which then begs the question: how can it be justified? what algorithms
could it cover that would justify its inclusion?

and: would a 32-bit variant be sufficient? a 32-bit variant would cut
the gate count to around.... 20,000?

Comment 19 Jacob Lifshay 2022-01-06 02:56:21 GMT

(In reply to Luke Kenneth Casson Leighton from comment #18)
> the question is, then: is it worth it?  for the sheer awesome flexibility
> it has - the number of options that can be covered - i'd say hell yes,
> if it wasn't for the fact that it's 65 *thousand* gates - four times
> larger than a 64-bit multiplier and three times larger than an IEEE754 FP
> multiply.

I think that's an interesting idea, however I think that should be left for next time -- We're kinda out-of-time and I just want to be done with the bitmanip instructions...I'm going to work on the rest of the bitmanip instructions in approximately the order I think they will be most useful to least useful, hence why I started on ternlog and grev. If/when we run out of time, that way we will have the most useful ones all finished -- and the more questionable/incompletely-designed instructions left for last.

Comment 20 Luke Kenneth Casson Leighton 2022-01-06 03:01:34 GMT

(In reply to Jacob Lifshay from comment #19)

> I think that's an interesting idea, however I think that should be left for
> next time -- We're kinda out-of-time and I just want to be done with the
> bitmanip instructions...

totally sensible.  with gorc being a variant of grev i was thinking ahead.
go for it.

Comment 21 Jacob Lifshay 2022-01-06 05:01:21 GMT

added grev[w][i] to csvs

Comment 22 Jacob Lifshay 2022-01-14 22:53:36 GMT

adding pseudo-code...I decided grevw should be defined to leave the MSB word cleared -- specifically zero-extending because that allows the compiler to skip zero-extension instructions that could otherwise be needed. I picked zero-extension rather than sign-extension because that doesn't require copying the sign bit, potentially saving wiring.

Comment 23 Luke Kenneth Casson Leighton 2022-01-14 23:45:18 GMT

(In reply to Jacob Lifshay from comment #22)

> skip zero-extension instructions that could otherwise be needed. I picked
> zero-extension rather than sign-extension because that doesn't require
> copying the sign bit, potentially saving wiring.

sensible.  signed concept is only meaningful on arithmetic, this is
a Logical op.  none of the Logical ops do arithmetic.

Comment 24 Jacob Lifshay 2022-01-18 05:10:22 GMT

Added grev to BitManipTestCase, also had to spend a bunch of time fixing bugs in the pseudo-code and working around bugs in the simulator...(I don't want to spend the time right now to solve #765).

I added a log2 helper function .. I figured that's probably fine because I designed it to assert if the input isn't an integer power of 2, so its function is basically always totally obvious imho. It's needed so I can properly index into a field where I need the lsb log2(XLEN) or log2(XLEN/2) bits, and OpenPower *loves* MSB0, making it extra annoying:

just look at this mess:
grevw's pseudo-code:
    result <- [0] * (XLEN / 2)
    a <- (RA)[XLEN/2:XLEN-1]
    b <- EXTZ64(RB)
    do i = 0 to XLEN / 2 - 1
      idx <- b[64-log2(XLEN/2):63] ^ i
      result[i] <- a[idx]
    RT <- ([0] * (XLEN / 2)) || result

we'll want to do that slicing RB too for the shift/rotate instructions cuz that makes the behavior more consistent (always shifting by RB_OR_IMM % XLEN).

when I tried using % in grevw's pseudo-code, the LHS was 5 bits wide, and XLEN/2 was 32, which overflows 5 bits, causing it to wrap around to RB_OR_IMM % 0 which obviously raises a ZeroDivisionError.

Comment 25 Luke Kenneth Casson Leighton 2022-01-18 12:36:04 GMT

(In reply to Jacob Lifshay from comment #24)
> Added grev to BitManipTestCase, also had to spend a bunch of time fixing
> bugs in the pseudo-code and working around bugs in the simulator...

you've accidentally requested RC in the CSV file for this instruction,
which inherently and automatically requests OE as well.  that's a bug
in the CSV file, not the simulator.

> (I don't
> want to spend the time right now to solve #765).

that should be closed as INVALID.

> I added a log2 helper function .. I figured that's probably fine 

it isn't.  anything that's added requires a formal discussion and
to go through an OpenPOWER ISA Working Group Request For Change (RFC)
process.

> because I
> designed it to assert if the input isn't an integer power of 2, so its
> function is basically always totally obvious imho. 

"obvious" is not a criteria that will satisfy the OPF ISA WG. they have
a defined set of procedures and processes that have to be gone through,
including but not limited to providing the text that is to go into the
Power ISA Specification.

> It's needed so I can
> properly index into a field where I need the lsb log2(XLEN) or log2(XLEN/2)
> bits,

in this particular case, because it is log2 of a constant that is already
pre-defined, there is an alternative: define a constant XLENLOG2 and
XLEN2LOG2.

this should go into the bugreport discussing and formally proposing
adding log2 as an alternative, so that if there is a decision on it
the reasons are fully available and can be placed directly into the RFC.

> and OpenPower *loves* MSB0, making it extra annoying:
> 
> just look at this mess:
> grevw's pseudo-code:
>     result <- [0] * (XLEN / 2)
>     a <- (RA)[XLEN/2:XLEN-1]
>     b <- EXTZ64(RB)
>     do i = 0 to XLEN / 2 - 1
>       idx <- b[64-log2(XLEN/2):63] ^ i
>       result[i] <- a[idx]
>     RT <- ([0] * (XLEN / 2)) || result

yyep. looks fairly annoyingly normal. can i suggest instead

     a <- (RA)[XLEN/2:XLEN-1]
     b <- (RB)[64-log2(XLEN/2):XLEN-1]

and then:

       idx <- b ^ i

Comment 26 Jacob Lifshay 2022-01-18 18:35:53 GMT

(In reply to Luke Kenneth Casson Leighton from comment #25)
> (In reply to Jacob Lifshay from comment #24)
> > Added grev to BitManipTestCase, also had to spend a bunch of time fixing
> > bugs in the pseudo-code and working around bugs in the simulator...
> 
> you've accidentally requested RC in the CSV file for this instruction,
> which inherently and automatically requests OE as well.  that's a bug
> in the CSV file, not the simulator.

no, it's a bug in the simulator that there isn't a way to request RC without OE unless you add GREV to a magic list of exceptions -- those exceptions should be translated to an OE field in the CSV (they only cover instructions without OE where the XO-form OE bit is 1; instructions like nand without OE but with the XO-form OE bit set to 0 aren't included in the list of exceptions because mis-interpreting the XO-form OE bit works even though it's not an XO-form instruction.

> > (I don't
> > want to spend the time right now to solve #765).
> 
> that should be closed as INVALID.

nope, it is a valid bug, as explained above.
> 
> > I added a log2 helper function .. I figured that's probably fine 
> 
> it isn't.  anything that's added requires a formal discussion and
> to go through an OpenPOWER ISA Working Group Request For Change (RFC)
> process.

well, because it's already in the spec pdf in v3.1 in dbcz's pseudo-code -- it's fine.

> > and OpenPower *loves* MSB0, making it extra annoying:
> > 
> > just look at this mess:
> > grevw's pseudo-code:
> >     result <- [0] * (XLEN / 2)
> >     a <- (RA)[XLEN/2:XLEN-1]
> >     b <- EXTZ64(RB)
> >     do i = 0 to XLEN / 2 - 1
> >       idx <- b[64-log2(XLEN/2):63] ^ i
> >       result[i] <- a[idx]
> >     RT <- ([0] * (XLEN / 2)) || result
> 
> yyep. looks fairly annoyingly normal. can i suggest instead
> 
>      a <- (RA)[XLEN/2:XLEN-1]
>      b <- (RB)[64-log2(XLEN/2):XLEN-1]

that doesn't work when XLEN != 64:
if XLEN = 8, then length((RB)) = 8, and the slice expands to: (RB)[62:7] which is obviously out-of-range.

Comment 27 Luke Kenneth Casson Leighton 2022-01-18 20:37:32 GMT

(In reply to Jacob Lifshay from comment #26)

> no, it's a bug in the simulator that there isn't a way to request RC without
> OE

this really is not a bug, it's a [annoying] design decision. totally 
undocumented and "emergent", but not something that is unexpected.

if there are going to be multiple Rc=1 but not OE=1 new Logical
instructions then yes i think thst justifies sorting out.

to keep compatibility i suggest adding a new RC option, called say
RCONLY which will need to be added to DecodeRC and to ISACaller,
and possibly the pipelines common_input.py and several other places.
decode_regspec_map will need inspecting too.

all pipeline, compunit and several issuer unit tests will need to
be run to make absolutely sure no damage is done.

it's not a "minor job" in other words.  if you're going to tackle it
for goodness sake don't push/commit anything without running a
massive set of unit tests.  use a branch if you like, it's justifiable

Comment 28 Luke Kenneth Casson Leighton 2022-02-13 16:53:31 GMT

some notes/thoughts: merging gror and grev plus any-other-combo i think might
be possible to do without going completely insane on the gate budget, by using
LUT2s.

    with m.If(self.chunk_sizes[i]):
        comb += lut2.eq(swaplut2)
    with m.Else():
        comb += lut2.eq(noswaplut2)
    for j in range(self.width):
        m.d.comb += step_o[j].eq(lut2(step_i[j], step_i[j ^ chunk_size], lut2)

the instruction would be of the form

    grevtern(RT, RA, steps, swaplut2[0..3], noswaplut2[0..3])

only one set of LUT2s needed per layer, and it has applications
beyond grev even when the steps is set to zero.

still needs thought

Comment 29 Luke Kenneth Casson Leighton 2022-02-13 17:50:14 GMT

QTY 64of LUT2s:

   Number of cells:                640
     $_ANDNOT_                     192
     $_AND_                         64
     $_MUX_                         64
     $_ORNOT_                      128
     $_OR_                         192

six layers of those is only 3840 cells, which is not bad.  it's certainly
one hell of a lot less than a LUT3:

   Number of cells:               2048
     $_ANDNOT_                     576
     $_AND_                        192
     $_MUX_                         64
     $_NAND_                       192
     $_NOT_                         64
     $_ORNOT_                      128
     $_OR_                         832

six layers --> 12,288 cells which given that's excluding the grev-muxes is
far too much - brings it to over 20,000 cells.

a LUT2 would be workable and cover 256 possible instructions rather than
forcing us to do grevi and grorc as separate instructions. might actually
save gates.

Comment 30 Luke Kenneth Casson Leighton 2022-02-13 17:54:19 GMT

or:

    grevtern(RT, RA, steps, swaplut2[0..3])

    with m.If(self.chunk_sizes[i]):
        for j in range(self.width):
            m.d.comb += step_o[j].eq(lut2(step_i[j], step_i[j ^ chunk_size],
                                     swaplut2)
    with m.Else():
            m.d.comb += step_o.eq((step_i)

which reduces the number of options to sane.  provides 16 instructions in one.

Comment 31 Luke Kenneth Casson Leighton 2022-02-13 18:01:02 GMT

(In reply to Luke Kenneth Casson Leighton from comment #30)

> which reduces the number of options to sane.  provides 16 instructions in
> one.

including, ha! if allowing (RA|0) it would allow, by setting swaplut2=0b1111
(or other), for certain combinations of 64-bit constants to be set without
having to use the 6-instruction ORI/ORIS/shift sequence. possible
constants:

* 0b01010101
* 0b10101010
* 0b00001111
* 0b11110000

etc. etc. simply by setting the right step and swaplut2 values.

Comment 32 Jacob Lifshay 2022-02-18 06:04:29 GMT

added grev to soc, and also added formal proof

https://git.libre-soc.org/?p=soc.git;a=commitdiff;h=66620a91bac731f6745fa33a7fe122b0aa7d3485

Comment 33 Luke Kenneth Casson Leighton 2022-02-18 16:28:42 GMT

fantastic. do you have an indication of the gate count (and LUT4 resources,
synth_ecp5)?

Comment 34 Jacob Lifshay 2022-02-18 17:01:00 GMT

(In reply to Luke Kenneth Casson Leighton from comment #33)
> fantastic. do you have an indication of the gate count (and LUT4 resources,
> synth_ecp5)?

I haven't checked, but I expect the additional resources required to be on the order of 500 2-in muxes (384 (6 layers * 64 bits) for the grev itself, the rest for misc routing in/out)

Comment 35 Luke Kenneth Casson Leighton 2022-02-18 21:28:48 GMT

(In reply to Jacob Lifshay from comment #34)
> (In reply to Luke Kenneth Casson Leighton from comment #33)
> > fantastic. do you have an indication of the gate count (and LUT4 resources,
> > synth_ecp5)?
> 
> I haven't checked, but I expect the additional resources required to be on
> the order of 500 2-in muxes (384 (6 layers * 64 bits) for the grev itself,
> the rest for misc routing in/out)

pffh, that's not bad at all. should be fine to look at a LUT2-based version
which will give 16 instructions and dozens of immediates in a 32-bit op

Comment 36 Jacob Lifshay 2022-02-18 21:40:18 GMT

(In reply to Luke Kenneth Casson Leighton from comment #35)
> (In reply to Jacob Lifshay from comment #34)
> > (In reply to Luke Kenneth Casson Leighton from comment #33)
> > > fantastic. do you have an indication of the gate count (and LUT4 resources,
> > > synth_ecp5)?
> > 
> > I haven't checked, but I expect the additional resources required to be on
> > the order of 500 2-in muxes (384 (6 layers * 64 bits) for the grev itself,
> > the rest for misc routing in/out)
> 
> pffh, that's not bad at all. should be fine to look at a LUT2-based version
> which will give 16 instructions and dozens of immediates in a 32-bit op

imho a lut2 version is enough additional latency that, if we're not worried about that, we should just merge with the shift/rotate logic, since imho it should be similar latency cost and will save a bunch of area.

Comment 37 Luke Kenneth Casson Leighton 2022-02-19 22:46:21 GMT

(In reply to Jacob Lifshay from comment #36)

> imho a lut2 version is enough additional latency that, if we're not worried
> about that, we should just merge with the shift/rotate logic, since imho it
> should be similar latency cost and will save a bunch of area.

i made it clear already that i do not want shift/rot touched at all in
any way shape or form right now.

it was complex as hell to write and took several weeks to get right due
to the options added by masks. please consider it 100% off-limits.

no further discussion on modification to shift_rot right now please.

Comment 38 Luke Kenneth Casson Leighton 2022-02-20 19:17:23 GMT

to give some idea of the significance, here: in the VERSA_ECP5 builds
i am doing right now, pipe1 of shiftrot came up on the critical path,
barely making 49.5 mhz.

the only possible way to reduce that without major redesign is to
cut the alu input stage into its own separate pipeline.

any additional gates in rotate would only force a 2 stage pipeline
and i do not consider it wise to spend time focussing on a near total
redesign of the shiftrot function unit, there are much higher priorities

Comment 39 Luke Kenneth Casson Leighton 2022-03-07 16:44:22 GMT

https://libre-soc.org/openpower/sv/grevlut2x2.jpg

a separate left-right LUT2 table gives a lot more options
both for creating constants (RA=0) as well as in more general
circumstsances.

Comment 40 Jacob Lifshay 2022-03-07 20:23:57 GMT

(In reply to Luke Kenneth Casson Leighton from comment #39)
> https://libre-soc.org/openpower/sv/grevlut2x2.jpg
> 
> a separate left-right LUT2 table gives a lot more options
> both for creating constants (RA=0) as well as in more general
> circumstsances.

in order to actually have a grev, that lut2 version would need a different lookup table for each row of the butterfly network...that's a prohibitive number of immediates (2*4*6 = 24 bits) for a lut2butterfly instruction. also, we'd still want grev as a separate instruction.

The other issue is the same as in the shift/rotate circuit: a lut2 has more delay than a mux, making grev slower than 1 clock cycle (assuming the shift/rotate barely fits in 1 cycle)

Comment 41 Luke Kenneth Casson Leighton 2022-03-07 22:44:12 GMT

(In reply to Jacob Lifshay from comment #40)
> (In reply to Luke Kenneth Casson Leighton from comment #39)
> > https://libre-soc.org/openpower/sv/grevlut2x2.jpg
> > 
> > a separate left-right LUT2 table gives a lot more options
> > both for creating constants (RA=0) as well as in more general
> > circumstsances.
> 
> in order to actually have a grev, that lut2 version would need a different
> lookup table for each row of the butterfly network...

look again at the pseudocode. the option to disable the lut2 on a
per-row basis is still there:

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/bitmanip.mdwn;hb=HEAD#l332


> that's a prohibitive
> number of immediates (2*4*6 = 24 bits) for a lut2butterfly instruction.
> also, we'd still want grev as a separate instruction.

it becomes grevlut(immediate=0b10101010) or something like that.

and gorc becomes grevlut(immediate=0b11101110) or something like that

gxorc would be grevlut(immediate=0b01100110) or something


> The other issue is the same as in the shift/rotate circuit: a lut2 has more
> delay than a mux, making grev slower than 1 clock cycle (assuming the
> shift/rotate barely fits in 1 cycle)

i deliberately didn't use LUT3 for exactly this reason, that would
almost certainly be over budget

yes it's worth checking the longest-path in yosys after doing "synth"
and/or "synth_ecp5".

Comment 42 Jacob Lifshay 2022-05-03 09:51:13 BST

I think we should use primary opcode 9 for the new second opcode for bitmanip instructions...but only for experimentation since it's marked as reserved because stuff used to be there. primary opcode 0 should't be used since it's reserved for implementation-dependent use and the `attn` opcode.

Comment 43 Luke Kenneth Casson Leighton 2022-05-03 10:51:06 BST

(In reply to Jacob Lifshay from comment #42)
> I think we should use primary opcode 9 for the new second opcode for
> bitmanip instructions...but only for experimentation since it's marked as
> reserved because stuff used to be there. primary opcode 0 should't be used
> since it's reserved for implementation-dependent use and the `attn` opcode.

the official sandbox is EXT022.  the only reason for moving ternlogi
and grevlogi to a separate opcode is because they're completely nuts
on bit-allocation, and were putting design pressure on the bitmanip space
which is pretty heavily loaded.

ternlogi has to move from EXT05 if grevlogi is to be on EXT09, they're
allocated as a pair.

Comment 44 Jacob Lifshay 2022-05-03 10:59:00 BST

(In reply to Luke Kenneth Casson Leighton from comment #43)
> (In reply to Jacob Lifshay from comment #42)
> > I think we should use primary opcode 9 for the new second opcode for
> > bitmanip instructions...but only for experimentation since it's marked as
> > reserved because stuff used to be there. primary opcode 0 should't be used
> > since it's reserved for implementation-dependent use and the `attn` opcode.
> 
> the official sandbox is EXT022.  the only reason for moving ternlogi
> and grevlogi to a separate opcode is because they're completely nuts
> on bit-allocation, and were putting design pressure on the bitmanip space
> which is pretty heavily loaded.
> 
> ternlogi has to move from EXT05 if grevlogi is to be on EXT09, they're
> allocated as a pair.

nope, grevlogi would still be in opcode 5. everything else (that're conveniently not implemented or not completed (grev[w][i][.]) yet) will move to opcode 9.

opcode 22 is currently used by setvl and friends and iirc you didn't want to move them until you're back in working-on-SV mode, so I picked other opcodes meanwhile.

Comment 45 Luke Kenneth Casson Leighton 2022-05-03 11:14:07 BST

(In reply to Jacob Lifshay from comment #44)

> opcode 22 is currently used by setvl and friends and iirc you didn't want to
> move them until you're back in working-on-SV mode, so I picked other opcodes
> meanwhile.

there should - unless i made a mistake - be room for setvl in opcode 22.

Comment 46 Luke Kenneth Casson Leighton 2022-05-03 14:46:38 BST

(In reply to Luke Kenneth Casson Leighton from comment #45)
> (In reply to Jacob Lifshay from comment #44)
> 
> > opcode 22 is currently used by setvl and friends and iirc you didn't want to
> > move them until you're back in working-on-SV mode, so I picked other opcodes
> > meanwhile.
> 
> there should - unless i made a mistake - be room for setvl in opcode 22.

arse, i made a mistake.
https://libre-soc.org/openpower/sv/bitmanip/

ok (not matching what's in sv/trans/svp64.py but correctable)
NN						--11 110	Rc	setvl

not ok because there's not enough space.  or, there might be, i have
to double-check the encoding.
NN	RA	RB	RC		10	--10 110	Rc	rsvd

Comment 47 Luke Kenneth Casson Leighton 2022-05-03 15:26:54 BST

(In reply to Jacob Lifshay from comment #44)

> opcode 22 is currently used by setvl and friends and iirc you didn't want to
> move them until you're back in working-on-SV mode, so I picked other opcodes
> meanwhile.

nuts.  there's not enough space, and i noted a mistake in the layout.
https://libre-soc.org/openpower/sv/bitmanip/

these overlap:

NN	RT	RA	RB	1	itype	0101 110	Rc	xperm
NN	RA	RB	RC	0	itype	0101 110	Rc	av minmax
NN	RA	RB	RC	1	00	0101 110	Rc	av abss
NN	RA	RB	RC	1	01	0101 110	Rc	av absu
NN	RA	RB		1	10	0101 110	Rc	av avgadd

those last 3 have to move.

yep, EXT09 it is, EXT22 remains for setvl/svshape/svremap/svstep -
i'll sort out the bitmanip opcode allocation table.

Comment 48 Luke Kenneth Casson Leighton 2022-05-03 15:54:56 BST

hang on, nope, i got it, i spotted extra encoding space

NN	RS	RA	RB	RC	11 011		rsvd
NN	RA	RB			1001 110	Rc	rsvd
NN	RA	RB	RC		--10 110	Rc	rsvd
NN					--11 110	Rc	setvl

those will do for setvl, svstep, svshape and svremap

i will work out the others

Comment 49 Jacob Lifshay 2022-05-17 00:29:50 BST

lkcl asked me to calculate gate counts for grevlut:
just counting the grev network and luts:

latency of a single mux-lut combination:
1 mux through data input and 2 muxes through select input.

gate count of a single mux-lut combination:
4 muxes.


latency of full network:
6 muxes through data input and 12 muxes through select input plus wire delay -- will almost certainly need 2 clock cycles. this is 3x a grev/gorc combined network. also 3x a rotator network.

gate count of full network:
4 * 6 * 64 = 1536 muxes

a grev/gorc combined network can be made by taking a grev network and writing it in terms of the 4-input and-or-invert cells that the muxes would likely be anyway, converting every other layer to or-and-invert, and then enabling both and-gate inputs simultaneously instead of having them be S and ~S.

I'll make a diagram.

Comment 50 Luke Kenneth Casson Leighton 2022-05-17 00:52:07 BST

(In reply to Jacob Lifshay from comment #49)

> latency of full network:
> 6 muxes through data input and 12 muxes through select input plus wire delay
> -- will almost certainly need 2 clock cycles.

perfectly fine. i've made almost everything min 3 stage ALUs because
of combinatorial paths in MultiCompALU.

> this is 3x a grev/gorc
> combined network. also 3x a rotator network.
> 
> gate count of full network:
> 4 * 6 * 64 = 1536 muxes

call it 5 gates per mux? 8000 gates. less than a multiplier by 30%.
for 256 instructions (which is what 2x 4-entry luts gives)
that's pretty damn good.  

> a grev/gorc combined network can be made by taking a grev network and
> writing it in terms of the 4-input and-or-invert cells that the muxes would
> likely be anyway, converting every other layer to or-and-invert, and then
> enabling both and-gate inputs simultaneously instead of having them be S and
> ~S.

which covers only 2 out of the possible 256 instructions of grevlut.
actually 2 out of 512 because i added an invert-input bit as well.

for less than 0.5% of instructions it's not worth the effort and
grev/gorc is nowhere near capable of generating the types of
immediates that grevluti can.

grev and gorc i don't think are worth spending any more time on.
frees up 8x X-Form instructions by removing them.

Comment 51 Jacob Lifshay 2022-05-17 01:03:43 BST

(In reply to Luke Kenneth Casson Leighton from comment #50)
> (In reply to Jacob Lifshay from comment #49)
> 
> > latency of full network:
> > 6 muxes through data input and 12 muxes through select input plus wire delay
> > -- will almost certainly need 2 clock cycles.
> 
> perfectly fine. i've made almost everything min 3 stage ALUs because
> of combinatorial paths in MultiCompALU.

imho having everything be 3-stage ALUs is fine for a simple processor, but it's terrible if you actually want high performance which is what we're aiming for.

The latency (assuming a mux is an and-or-invert gate with 1.5 gate latency and an inverter on select with 1 gate latency) is 6 * 1.5 + 12 * 2.5 = 39 gates!!! this is unacceptably slow for a grev imho.


> 
> > this is 3x a grev/gorc
> > combined network. also 3x a rotator network.
> > 
> > gate count of full network:
> > 4 * 6 * 64 = 1536 muxes
> 
> call it 5 gates per mux? 8000 gates. less than a multiplier by 30%.
> for 256 instructions (which is what 2x 4-entry luts gives)
> that's pretty damn good.

gate count isn't an issue. latency is.
> 
> > a grev/gorc combined network can be made by taking a grev network and
> > writing it in terms of the 4-input and-or-invert cells that the muxes would
> > likely be anyway, converting every other layer to or-and-invert, and then
> > enabling both and-gate inputs simultaneously instead of having them be S and
> > ~S.
> 
> which covers only 2 out of the possible 256 instructions of grevlut.

it actually covers more than that, because that's just how you can make a faster circuit (probably what the riscv designers were thinking of when they added gorc) that implements both grev and gorc and other instructions too. imho we should have an instruction based on that circuit instead of grevlut with its 3x latency -- it would still have the control signals based off of the immediate like grevlut, but would be single-cycle even on a 5ghz cpu, unlike grevlut.

> actually 2 out of 512 because i added an invert-input bit as well.

adding a layer of xor gates is trivially easy to add to the grev/gorc circuit and doesn't add much latency -- imho having invert or not shouldn't be a reason to choose grevlut over the grev/gorc/etc. circuit.

Comment 52 Jacob Lifshay 2022-05-17 03:45:07 BST

The grev/gorc/etc. circuit (1/3 latency of grevlut and nearly just as flexible):
https://libre-soc.org/openpower/sv/bitmanip/grev_gorc_design/

Comment 53 Luke Kenneth Casson Leighton 2022-05-17 11:46:12 BST

(In reply to Jacob Lifshay from comment #52)
> The grev/gorc/etc. circuit (1/3 latency of grevlut and nearly just as
> flexible):
> https://libre-soc.org/openpower/sv/bitmanip/grev_gorc_design/

oo! oo! i'm so excited: a small modification to put in a *pair* of
LUTs (one for left, one for right) and i think we have the exact
same functionality.... perhaps not exactly the same.  i'll try
to code it up in a 2nd version of grevlut.py

Comment 54 Luke Kenneth Casson Leighton 2022-05-17 12:42:37 BST

nope, it's not the same. drat.  what it's missing is the inversion (XOR-effect)
of the input bits within each row.  that's what makes grevlut special.

here's what would work (roughly):

https://ftp.libre-soc.org/2022-05-17_12-17.png

* double the size of the AND-OR-INV (AOI)
* bring in an *inversion* of the A[]-straight and A[]-cross-over bits per row
* put the *inverted* copy of A[]-straight and A[]-crossover
  through the *second* set of AOIs
* arrange for sh[] to be a pass-thru which puts A[]-straight untouched
  into the output.

(this is the bit i got wrong in the diagram, if 4 AND gates are used then
 the output is always zero.  what's needed is that the imm is converted
 to be... 0b0001 which indicates "please put A[] through untouched". i think)

it would, i believe, in pseudocode, be this:

    AOI_inputs = A[i] | A[i^cross]<<1 | ~A[i]<<2 | ~A[i^cross]<<3
    AOI = (AOI_inputs & (sh[j] ? imm : 0b0001)) != 0b0000
    output[i] = AOI

it introduces one single NOT-inverter on each layer, it combines the
MUXes into one...

at each point 6x64 there are:

* 2x NOTs              - 2 gates
* 4x ANDs              - 4 gates
* 1x 4-input OR-gate   - 4? gates?

which is only 10 gates total. let's say i got it wrong and it's 20 gates.
20 * 6 * 64 = 7680 gates which is still damn good.

what about the gate-delay chain:

* gate-delay-chain is 3 gates (NOT, AND, 4-input-OR)
* that's per layer, 6 layers
* total 18 gates instead of 12 gates.

Comment 55 Luke Kenneth Casson Leighton 2022-05-17 12:58:38 BST

https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=527efabfbe08f5ce2ea81b4bf26d8b194eb109b5

gives the principle: it's nothing like the same as the MUX version
of grevlut which has me concerned.

Comment 56 Luke Kenneth Casson Leighton 2022-05-17 15:30:17 BST

ok so the key difference as to why i designed grevlut the way it is,
is to bring in *two* bits (A[i] and A[i^crossover]) into the binary LUT.

the cascade effect of the application of multiple LUTs results in
e.g. ANDing or NANDing of multiple bits, which is how the magic
constants 0x11111111 0x10101010 0x10001000 0x10000000 emerge,
and it turns out that 0x77777777 0x70707070 etc and many more
can be created as well.

the AOI trick *cannot do that* because the bits A[i] and A[i^crossover]
are treated as *independent*

thus unless otherwise radically altered the AOI trick is not going to
be useful.


one way to reduce the number of MUXes on the grevlut path
is to substitute "identity" immediates:

    imm_left  = sh[j] ? imm[0:3] : 0b0101
    imm_right = sh[j] ? imm[4:7] : 0b1100

i think that does the trick... it'll be pretty close.  this knocks out
one MUX per layer in the shortest critical path.

Comment 57 Jacob Lifshay 2022-05-18 06:35:04 BST

(In reply to Luke Kenneth Casson Leighton from comment #56)
> ok so the key difference as to why i designed grevlut the way it is,
> is to bring in *two* bits (A[i] and A[i^crossover]) into the binary LUT.
> 
> the cascade effect of the application of multiple LUTs results in
> e.g. ANDing or NANDing of multiple bits, which is how the magic
> constants 0x11111111 0x10101010 0x10001000 0x10000000 emerge,
> and it turns out that 0x77777777 0x70707070 etc and many more
> can be created as well.

neat!

> the AOI trick *cannot do that* because the bits A[i] and A[i^crossover]
> are treated as *independent*

that's not true, see:
https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=26558305c986c30f6f786e77769c566c6104f085

This has the exact same aoi matrix that I proposed, just the inputs from sh are changed. it still has latency 6x aoi gates plus misc stuff on sh and input/output xor layers.

it *can* generate 0x1000100010001 because the aoi matrix has the input/outputs invertable, giving you and-ing via de-morgan inversion of the or-gates inside the aoi matrix.

I added in the thing where the aoi gates taking input from the left have different imm bits than the input from the right in the grev butterfly wires. (sorry, I can't think of a good way to describe it textually.)

Comment 58 Luke Kenneth Casson Leighton 2022-05-18 12:02:38 BST

(In reply to Jacob Lifshay from comment #57)

> > and it turns out that 0x77777777 0x70707070 etc and many more
> > can be created as well.
>
> neat!

yeah, unexpected. it sort-of happened, but is kinda important to justify
adding such an expensive operation (1/3 of a Major opcode) especially
as "just" a bitmanip operation.

> > the AOI trick *cannot do that* because the bits A[i] and A[i^crossover]
> > are treated as *independent*
>
> that's not true, see:
> https://git.libre-soc.org/?p=libreriscv.git;a=commitdiff;h=26558305c986c30f6f786e77769c566c6104f085

ah great, i must have got the implementation wrong 

ok that's good - it's still not getting as diverse a range of constants.
* imm 0b1101110 will give f0f0f0f0... c0c0c0c0 80808080 88008800
  808000008080000 880000
* imm 0b100110 will give one single bit set, based on sh (which is nice)
  but i see a 2nd imm doing exactly the same thing
* imm 0b100111 is quite interesting, gives 20202020, 2222000022222

but there's no way to create 0xbbbb, 0x7777, 0x6666 etc which grevlut
somehow manages to do.

also there are a hell of a lot of zeros, 0xfffffffs, and so on, all wasted,
which tells me it's not "mixing" enough.

> This has the exact same aoi matrix that I proposed, just the inputs from sh are
> changed. it still has latency 6x aoi gates plus misc stuff on sh and
> input/output xor layers.
>
> it *can* generate 0x1000100010001 because the aoi matrix has the input/outputs
> invertable, giving you and-ing via de-morgan inversion of the or-gates inside
> the aoi matrix.
>
> I added in the thing where the aoi gates taking input from the left have
> different imm bits than the input from the right in the grev butterfly wires.
> (sorry, I can't think of a good way to describe it textually.)

yeah we're in new territory here. i used left and right as well,
i know what you mean.

Comment 59 Luke Kenneth Casson Leighton 2022-05-20 23:05:49 BST

as if grevlut was not fun enough already i figured a 3-in register variant
would allow multiple left-right 4+4 LUTs in RC, one for each row
in the swapping-lookup-crossover-mixing-thing.

i can't explain why, it just felt right, to have the means to
flip individual rows, i think this will turn out to be a really
powerful instruction.

i did however just realise that the sh parameter is totally redundant
because if you can provide individual LUTs per row then it is a straightforward
matter to set an 8-bit setting within RC for that row which does absolutely
nothing.  hilariously a grevluti might help with setting up RC with the
appropriate bits, hmm.  some of them.

thoughts appreciated.

Comment 60 Luke Kenneth Casson Leighton 2022-05-20 23:54:44 BST

the "identity" pattern which makes A (left side) go straight through
is 0b1010, and B side is 0b1100. i think.  which would be an 8bit
of 0b11001010 or 0xca.

honestly it would be easier (a lot easier) to arrange the inputs to
the LUT pairs if that was 0xaa! "identity" (no change at all) would
just be 0xaaaaaaaaaaaaaaa (6x 8-bit 0xaa)

which can be arranged by swapping a and b on the right side.  all needs
checking.

Comment 61 Luke Kenneth Casson Leighton 2022-06-14 17:37:59 BST

annoying as it is, let's consider this done but qualified as R&D "lessons".
there is not enough EXT022 opcode space for so many instructions, when grevlut
and grevluti are so compelling. i have not yet been even able to fit *any*
of the GFP or GF2^N operations in, yet. i got clmul in, and gf2, none of
the 4-operand GFs. adding grevlut needs to be under a separate bugreport
with its own budget.

Comment 62 Jacob Lifshay 2023-09-07 03:25:17 BST

(In reply to Luke Kenneth Casson Leighton from comment #61)
> annoying as it is, let's consider this done but qualified as R&D "lessons".

ok, sounds good.

this totals to 557 lines so (using the method from bug #1025) that comes out to EUR 1500 

measuring as of:
https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=b55f37ee14b3253087f0a743eb795b53e59f25d2

https://git.libre-soc.org/?p=nmutil.git;a=commit;h=7a3ab7ecce390b2003ff61befca59cd4ad6b7428

measuring nmutil.git:
src/nmutil/grev.py  270 lines
src/nmutil/test/test_grev.py  86 lines

measuring openpower-isa.git:
openpower/isa/bitmanip.mdwn  about 80 lines
openpower/isatables/minor_5.csv  4 lines
src/openpower/decoder/isa/caller.py  about 4 lines
src/openpower/decoder/power_decoder2.py  1 line
src/openpower/decoder/power_enums.py  4 lines
src/openpower/sv/trans/svp64.py  about 32 lines
src/openpower/test/bitmanip/bitmanip_cases.py  about 50 lines
src/openpower/decoder/helpers.py  about 10 lines  (log2 needed by grev tests)
openpower/isatables/fields.text  about 12 lines  (add XB form for grev)
src/openpower/decoder/formal/proof_decoder2.py  2 lines  (add XBI for grev)
src/openpower/decoder/power_decoder2.py  2 lines  (add XBI for grev)

Comment 63 Luke Kenneth Casson Leighton 2023-09-08 06:52:55 BST

(In reply to Jacob Lifshay from comment #62)

> src/openpower/decoder/power_decoder2.py  2 lines  (add XBI for grev)

you also now need to remove it.

i did say that it is not going in.

please remove all of these unauthorized instructions from
the CSV files, pipelines and enums (OP_GREV).

unfortunately you went too far down the implementation path.
budget remains the same, the work itself and the lesson learned
was useful. leave in nmutil as it is an independent and
valuable library.

Comment 64 Jacob Lifshay 2023-09-08 07:00:25 BST

(In reply to Luke Kenneth Casson Leighton from comment #63)
> please remove all of these unauthorized instructions from
> the CSV files, pipelines and enums (OP_GREV).

I think "unauthorized" is misleading, at the time, we all agreed to add them, you included.

I'll remove them later, except I think we should leave the unit tests, since they'll be useful for grevlut when that gets implemented.

Comment 65 Jacob Lifshay 2023-09-12 02:22:43 BST

(In reply to Jacob Lifshay from comment #64)
> I'll remove them later, except I think we should leave the unit tests, since
> they'll be useful for grevlut when that gets implemented.

I removed grev from openpower-isa.git and soc.git, leaving the tests for future use with grevlut.

I also fixed soc.git's CI, so we can actually test that it wasn't broken by removing grev.

https://git.libre-soc.org/?p=openpower-isa.git;a=shortlog;h=8fc152e6e1710bfe32e2eea9dee0ec1f415ce719

https://git.libre-soc.org/?p=soc.git;a=shortlog;h=5aa0ab1da7681952e5e6424800c0f80b84826695