794 – UTF8 validation

Bug 794 - UTF8 validation

Summary: UTF8 validation

Status:	RESOLVED FIXED

Alias:	None

Product:	Libre-SOC's first SoC
Classification:	Unclassified
Component:	Specification (show other bugs)
Version:	unspecified
Hardware:	Other Linux

Importance:	--- enhancement
Assignee:	Jacob Lifshay

URL:

Depends on:	922
Blocks:	213
	Show dependency tree / graph

Reported:	2022-03-30 11:16 BST by Luke Kenneth Casson Leighton
Modified:	2022-09-25 17:07 BST (History)
CC List:	2 users (show)

See Also:	910 911 254 922
NLnet milestone:	NLNet.2019.10.042.Vulkan
total budget (EUR) for completion of task and all subtasks:	2500
budget (EUR) for this task, excluding subtasks' budget:	2500
parent task for budget allocation:	254
child tasks for budget allocation:
The table of payments (in EUR) for this task; TOML format:	lkcl = { amount = 500, submitted = 2022-09-15, paid = 2022-09-16 } [jacob] amount = 2000 submitted = 2022-09-16 paid = 2022-09-24

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luke Kenneth Casson Leighton 2022-03-30 11:16:29 BST

Algorithms we want to demo:
* DONE: UTF-8 validation
  https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=7217fe80d54a5dab33566e6d8fff949b84ce433e

Links:
https://www.json.org/JSON_checker/utf8_decode.c

https://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

Comment 1 Jacob Lifshay 2022-03-30 15:17:05 BST

additional useful links:
converting utf-8 <-> utf-16 (useful for JS and Java)
https://web.archive.org/web/20210625032530/https://researcher.watson.ibm.com/researcher/files/jp-INOUEHRS/IPSJPRO2008_SIMDdecoding.pdf

validating UTF-8 (useful for JSON decoding and many many other things)
https://github.com/rusticstuff/simdutf8

it's very common to only care if you have correct utf-8 and where the first error is rather than needing to decode the unicode codepoints -- the unicode codepoints aren't that much more useful than the bytes for many purposes -- parsing (e.g. JSON) is nearly always faster on just the utf-8 bytes rather than having to decode to utf-32 first.

Comment 2 Jacob Lifshay 2022-03-30 15:25:17 BST

(In reply to Jacob Lifshay from comment #1)
> additional useful links:

also:
https://github.com/simd-lite/simd-json

Comment 3 Luke Kenneth Casson Leighton 2022-03-30 15:27:01 BST

ironically whilst everyone else is desperately trying to smash
their heads against a SIMD wall we need to track down simple
scalar versions of algorithms because the insistence "But SIMD
Makes It Fast" makes it astoundingly difficult to comprehend.

it took 3 weeks for example to track down easy-to-read DCT
source code.

assessing this one therefore needs a similar (the usual) strategy:

1) work out the hotspots (good to hear UTF8 to UTF16 is common
   for example)
2) find *readable* non-assembler non-optimised non-parallelised
   reference implementations.

Comment 4 Luke Kenneth Casson Leighton 2022-03-30 15:35:04 BST

(In reply to Jacob Lifshay from comment #1)
> additional useful links:
> converting utf-8 <-> utf-16 (useful for JS and Java)
> https://web.archive.org/web/20210625032530/https://researcher.watson.ibm.com/
> researcher/files/jp-INOUEHRS/IPSJPRO2008_SIMDdecoding.pdf

this will be useful to know about only that 8-16 is desirable
(which is great). trying to understand what on earth the SIMD
assembly is doing, not so much.

> validating UTF-8 (useful for JSON decoding and many many other things)
> https://github.com/rusticstuff/simdutf8

this is about as bad as it gets:
https://github.com/rusticstuff/simdutf8/blob/main/src/implementation/x86/avx2.rs

hopelessly unreadable, the level of "optimisations" is so deeply embedded
within that code that it is worse than useless!

with SVP64 being based in the abstract on "Multi-Issue parallelisation
of Scalar operations by dropping hardware for-loops around them", time and
time again it has been shown that progress is made by starting from a
*scalar* proof-of-concept, never from someone's heavily optimised SIMD
Assembler.

Comment 5 Luke Kenneth Casson Leighton 2022-03-30 15:42:24 BST

(In reply to Jacob Lifshay from comment #2)

> also:
> https://github.com/simd-lite/simd-json

https://github.com/simd-lite/simd-json/blob/main/src/neon/deser.rs

i can't even begin to comprehend what that is doing :)

whereas the c code from comment #0 is both dead simple,
well documented, and serial in nature.

it is the serial nature which makes mapping it straight
to SVP64 so easy, the comments are a bonus.

pretty ironic, huh? you'd think "oh yeah it's fast with NEON
therefore it MUST contain useful inspiration", right?  turns
out this instinct is dead wrong, every single time. sigh.

Comment 6 Jacob Lifshay 2022-03-30 16:16:30 BST

additional links:
(WTF-8 is UTF-8 but modified to also represent unpaired surrogates, like in ill-formed UTF-16. this is useful for Windows File Names, Java/JS Strings, etc.)
https://simonsapin.github.io/wtf-8/

https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf
Table 3-7 (modified to put a star next to where the original used bold text)
Well-Formed UTF-8 Byte Sequences
Code Points        First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F     00..7F
U+0080..U+07FF     C2..DF     80..BF
U+0800..U+0FFF     E0         *A0..BF     80..BF
U+1000..U+CFFF     E1..EC     80..BF      80..BF
U+D000..U+D7FF     ED         80..*9F     80..BF
U+E000..U+FFFF     EE..EF     80..BF      80..BF
U+10000..U+3FFFF   F0         *90..BF     80..BF     80..BF
U+40000..U+FFFFF   F1..F3     80..BF      80..BF     80..BF
U+100000..U+10FFFF F4         80..*8F     80..BF     80..BF

Comment 7 Jacob Lifshay 2022-03-30 16:22:19 BST

(In reply to Luke Kenneth Casson Leighton from comment #4)
> this is about as bad as it gets:
> https://github.com/rusticstuff/simdutf8/blob/main/src/implementation/x86/
> avx2.rs
> 
> hopelessly unreadable,

that's cuz you're reading the isa abstraction layer, not the core algorithm. the algorithm is here:
https://github.com/rusticstuff/simdutf8/blob/main/src/implementation/algorithm.rs

Comment 8 Jacob Lifshay 2022-03-30 16:32:47 BST

(In reply to Jacob Lifshay from comment #7)
> that's cuz you're reading the isa abstraction layer, not the core algorithm.
> the algorithm is here:
> https://github.com/rusticstuff/simdutf8/blob/main/src/implementation/
> algorithm.rs

the papers describing the algorithms:

https://github.com/simdjson/simdjson#about-simdjson
links from above link:
* enjoy reading our paper
https://arxiv.org/abs/1902.08318
* Parsing Gigabytes of JSON per Second
https://arxiv.org/abs/1902.08318
* Validating UTF-8 In Less Than One Instruction Per Byte
https://arxiv.org/abs/2010.03090
* blog post providing some background and context
https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/
* simdjson at QCon San Francisco 2019
http://www.youtube.com/watch?v=wlvKAT7SZIQ

Comment 9 Luke Kenneth Casson Leighton 2022-03-30 17:25:21 BST

still unintelligable at an algorithmic level due to this:

                      idx += SIMD_CHUNK_SIZE

no explanations at all:

                let byte_1_low = prev1.and(SimdU8Value::splat(0x0F)).lookup_16(
                    CARRY | OVERLONG_3 | OVERLONG_2 | OVERLONG_4,
                    CARRY | OVERLONG_2,

and there's zero code comments.

the very attempt to include lookup tables and to perform SIMD-ification
is precisely what makes this code 100% hostile.  no comments just buries
what is already dead another 6ft under :)

i have some ideas floating around but until appropriate *scalar* non-optimised
*simple* implementations are found i cannot nail those ideas down.

i expect finding such implementations to be just as hard as for DCT because
"why would you bother, like, y'know, that's so slow ya wasting time, man"

i need to understand the *principle* behind utf8, and when doing REMAP
it needs the *REMAP* system to perform the looping, not the "concept
called packed SIMD where you throw SIMD_CHUNK_SIZEs of data at a wall
and hope for the best".

anything that uses Packed SIMD catastrophically interferes with REMAP, and
with Data-Dependent FailFirst and Predicate-Result Modes.

just like how strncpy in RVV with fail-first LDST is only 13 assembler
instructions but when using Power ISA Packed SIMD it requires 240.

Comment 10 Luke Kenneth Casson Leighton 2022-03-30 17:34:10 BST

(In reply to Jacob Lifshay from comment #8)

> * blog post providing some background and context
> https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/

ok, that's about parsing JSON, not about parsing UTF8.
although, parsing of {Insert Graph-based Data Format}
is part of what Extra-V was designed for.

Comment 11 Luke Kenneth Casson Leighton 2022-04-01 12:18:23 BST

there is a hardware design concept i would like to consider here, it is
an advancement of the Eth Zurich Snitch core
https://arxiv.org/pdf/2002.10143

specifically the idea of putting an intercept in to register usage which
instead connects to a synchronous FIFO.

reading or writing the FIFO would be wired to an advancement of svstep.

if also connected to Memory LDST just like in Snitch but also Data Dependent
failfirst and REMAP then there is the possibility to cover strange
algorithms like UTF8 and JSON parsing

i had a think, i see the value of identifying starting points and end points,
creating a DOM from a sequential stream, that is BIG.

could even be used for Message Passing between processors or processes.
must look at design of OpenCAPI properly.

Comment 12 Luke Kenneth Casson Leighton 2022-08-22 19:48:37 BST

found one that is obvious and simple to understand.

https://codereview.stackexchange.com/questions/159814/utf-8-validation/159832#159832

class Solution(object):
    def validUtf8(self, data):
        """
        Check that a sequence of byte values follows the UTF-8 encoding
        rules.  Does not check for canonicalization (i.e. overlong encodings
        are acceptable).

        >>> s = Solution()
        >>> s.validUtf8([197, 130, 1])
        True
        >>> s.validUtf8([235, 140, 4])
        False
        """
        data = iter(data)
        for leading_byte in data:
            leading_ones = self._count_leading_ones(leading_byte)
            if leading_ones in [1, 7, 8]:
                return False        # Illegal leading byte
            for _ in range(leading_ones - 1):
                trailing_byte = next(data, None)
                if trailing_byte is None or trailing_byte >> 6 != 0b10:
                    return False    # Missing or illegal trailing byte
        return True

    @staticmethod
    def _count_leading_ones(byte):
        for i in range(8):
            if byte >> (7 - i) == 0b11111111 >> (7 - i) & ~1:
                return i
        return 8


*now* it is obvious that validation starts by counting the number
of 1s in the first character, then you must check that the top 2 bits
of UTF8 characters must be 0b10.

this simplicity is utterly destroyed by efforts made by optimised SIMD.
attempting to even understand the validation algorithm from looking at
optimised SIMD is not only wasting time it risks making mistakes.

SimpleV is such a different paradigm we literally have to go back to
scalar unoptimised implementations.

this algorithm is quite fascinating, one byte will contain a count
of the number of bytes that need to be checked for a match with 0b10------
needs some thought.

Comment 13 Jacob Lifshay 2022-08-22 20:21:20 BST

(In reply to Luke Kenneth Casson Leighton from comment #12)
>         Check that a sequence of byte values follows the UTF-8 encoding
>         rules.  Does not check for canonicalization (i.e. overlong encodings
>         are acceptable).

canonicalization and surrogate encodings needs to be checked, otherwise you can have security flaws such as smuggling / characters through a http server by encoding them as 0xC0 0xAF rather than 0x2F, which then allows you to access stuff outside the /var/www/html directory, e.g. by accessing
https://example.com/..%C0%AF..%C0%AF..%C0%AFetc/passwd

Comment 14 Luke Kenneth Casson Leighton 2022-08-22 21:47:36 BST

ok i think i have a strategy.  firstly, note that the leading 1s == QTY(1)
is the patern 0b10------ which is also the "invalid" pattern.

secondly, there is a cntleadones scalar instruction in v3.1.
a vector of the 1s_count can be created.

thirdly, some sv.cmpi on that tells us where utf8 starts and ends,
where the end points may go into the next instruction as a mask

fourthly, a sv.addi/satu/ew=8/m=eq *RT,*RA,0xff where RT is 1 greater than RA
will perform a cascading non-rollover subtract of 1 from each element.
anything that started as a count of 2 3 4 5 or 6 will count down
*overwriting* the next register, but due to unsigned saturate it will
not wrap back to 0xff 0xfe etc.

furthermore due to the predicate mask the cascade *only* starts and
continues from non-terminating points.  it may be necessary to shift
the mask down by one as you want the subtract-cascade to stop at
the character *before* the beginning of the next utf8 sequence.

if there are only zeros at these last characters, then the expected
length is equal to the observed length.

setting VL to 64 would get you about...  maybe 14-18 instructions per
64 bytes?

the only thing about those cascading subtracts is, they could create
some horrendous hazard dependencies.  therefore another potential
way to do it would be to have a loop-unrolled sequence of sv.addi
operations, bouncing back and forth between two pairs of registers.
maybe three with a shift-incremented mask?

another potential way would be to use bmask, to analyse the start and
end points.

Comment 15 Jacob Lifshay 2022-08-23 10:28:37 BST

I have a different strategy that I think will work well, I started adding it as a test case in openpower-isa.git, but ran out of time today because I started writing a super simple svp64 emulator since I thought I'd need elwidth overrides (which iirc the simulator doesn't yet support), but turns out I don't so I converted it to a TestAccumulatorBase test case.

basic idea:
load current chunk of bytes to regs 64-95, expanding to 1 byte per register, zero pad (because nul is always a 1-byte utf-8 char) to 32 regs. put previous iteration's chunk in regs 32-63.

now match the regs against the valid utf-8 patterns (see comment #6), accessing previous bytes by using regs 63-94, 62-93, 61-92, etc.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=f640d6b5c0ca5ae72d70cdaa95cda4f7e68e7e60

Comment 16 Jacob Lifshay 2022-08-23 10:29:25 BST

(In reply to Jacob Lifshay from comment #15)
> now match the regs against the valid utf-8 patterns (see comment #6),
> accessing previous bytes by using regs 63-94, 62-93, 61-92, etc.

using sv.cmprb should help a bunch.

Comment 17 Luke Kenneth Casson Leighton 2022-08-23 11:06:46 BST

(In reply to Jacob Lifshay from comment #15)
> I have a different strategy that I think will work well, I started adding it
> as a test case in openpower-isa.git, but ran out of time today because I
> started writing a super simple svp64 emulator

for goodness sake don't waste time doing that.  

> since I thought I'd need
> elwidth overrides (which iirc the simulator doesn't yet support), 

you can help add it!  any time spent working on ISACaller no matter how
small is infinitely more useful than any amount of time spent on duplication
of effort.

> now match the regs against the valid utf-8 patterns (see comment #6),
> accessing previous bytes by using regs 63-94, 62-93, 61-92, etc.

i'll be fascinated to see how that goes (using sv.cmprb, offset by one
each time).

estimating the clocks/byte is challenging as it will depend fundamentally
on the micro-architecture.

there *may* be yet *another* way - to use Vertical-First Mode and
rely on a Multi-issue Engine.  due to the branches though i don't
think it would be beneficial.

Comment 18 Jacob Lifshay 2022-08-23 23:06:51 BST

(In reply to Luke Kenneth Casson Leighton from comment #9)
> i have some ideas floating around but until appropriate *scalar*
> non-optimised
> *simple* implementations are found i cannot nail those ideas down.

a good simple scalar algorithm is Algorithm 1 in:
https://arxiv.org/pdf/2010.03090.pdf

Comment 19 Luke Kenneth Casson Leighton 2022-08-24 00:15:00 BST

the branch-range one?
yyeah... i wonder, if it would work to do a sequence of cmps (and cmprbs),
every one of the tests in each case statement, then transfer them into
INTs (crweirds), do a 1-bit 2-bit 3-bit and 4-bit shift on them, then
use ANDs ORs and BMASKs to perform a parallel bitlevel version of that
switch statement? (no branches at all)

what would be insane is that by doing sv.ANDs, sv.ors and sv.bmasks
you could, with say 8-way multi-issue, be doing the equivalent of
64x8 switch statements all simultaneously.

Comment 20 Luke Kenneth Casson Leighton 2022-08-24 00:16:24 BST

(In reply to Luke Kenneth Casson Leighton from comment #19)
> the branch-range one?
> yyeah... i wonder, if it would work to do a sequence of cmps (and cmprbs),

[cmps cmprbs and countleading1s].

Comment 21 Jacob Lifshay 2022-08-24 01:52:52 BST

thinking about it, it would be very useful to have a quick way to do what
risc-v v's vslideup/vslidedown do:
https://github.com/riscv/riscv-v-spec/blob/b6368b3c44d775f8eb01c7ce0ad017db19944aa7/v-spec.adoc#163-vector-slide-instructions

it can be done using remap, but takes several instructions to be set up. imho we should use one of the svshape reserved combinations for this. it should not set vl and mvl as part of svshape.

the example code I've been writing works around that by expanding each byte to fill a whole 64-bit register -- pretty wasteful.

Comment 22 Luke Kenneth Casson Leighton 2022-08-24 02:48:27 BST

(In reply to Jacob Lifshay from comment #21)
> thinking about it, it would be very useful to have a quick way to do what
> risc-v v's vslideup/vslidedown do:
> https://github.com/riscv/riscv-v-spec/blob/
> b6368b3c44d775f8eb01c7ce0ad017db19944aa7/v-spec.adoc#163-vector-slide-
> instructions

if register aligned a simple sv.ori rt, rt+1, 0 does
that. /mrr inverts the loop order.

except if nonaligned then REMAP offset is needed.

> it can be done using remap, but takes several instructions to be set up.

two. that's hardly "several", is it.  and if in a loop
and there are no other uses, the svshape can be set once,
outside, and the svremap on-demand as needed. leave SVSHAPE0-3
alone, activate them when needed, not setting the "persist"
bit. then the remaps apply to the next instruction only
and switch off again... *without* changing SVSHAPE0-3 though.

> imho we should use one of the svshape reserved combinations for this. it
> should not set vl and mvl as part of svshape.

the 3 current purposes for svshape at the moment are to
absolutely minimise those 3 uses: matrix dct fft. anything
else is a welcome bonus.

> the example code I've been writing works around that by expanding each byte
> to fill a whole 64-bit register -- pretty wasteful.

First approximation, good enough, then work out what can
be done better.

for example by doing 64-bit svshape (sv.svshape) an extra
24 bits magically becomes available. i have no problem
at all in some of those bits expanding the options that
had to be limited or missed entirely for the 32 bit svshape,
such as the offset.

matrix mode is perfectly capable of being set to 1D which
when combined with offset gives the desired result here.
5+1 bits are also enough to set a small range of remap
options as well (see svindex for how that can be done)

Comment 23 Jacob Lifshay 2022-08-24 06:19:15 BST

(In reply to Luke Kenneth Casson Leighton from comment #22)
> (In reply to Jacob Lifshay from comment #21)
> > thinking about it, it would be very useful to have a quick way to do what
> > risc-v v's vslideup/vslidedown do:
> > https://github.com/riscv/riscv-v-spec/blob/
> > b6368b3c44d775f8eb01c7ce0ad017db19944aa7/v-spec.adoc#163-vector-slide-
> > instructions
> 
> if register aligned a simple sv.ori rt, rt+1, 0 does
> that. /mrr inverts the loop order.
> 
> except if nonaligned then REMAP offset is needed.
> 
> > it can be done using remap, but takes several instructions to be set up.
> 
> two. that's hardly "several", is it.

you forgot the setvl again since svshape put junk in it...

I didn't see how to get the svshape instruction to set offset...

Also, the algorithm constantly needs to switch between several offsets, making a dedicated mode desirable.

Comment 24 Jacob Lifshay 2022-08-24 06:22:19 BST

I wrote out the full algorithm, but was stymied trying to get `sv.andi.` to assemble:

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=e347fb846bba92dbec07b33f08e185daad9df68b

  File "/home/jacob/projects/libreriscv/openpower-isa/src/openpower/test/algorithms/svp64_utf_8_validation.py", line 234, in run_case
    lst = list(isa)
  File "/home/jacob/projects/libreriscv/openpower-isa/src/openpower/sv/trans/svp64.py", line 617, in __iter__
    yield from self.trans
  File "/home/jacob/projects/libreriscv/openpower-isa/src/openpower/sv/trans/svp64.py", line 1358, in translate
    yield from self.translate_one(insn)
  File "/home/jacob/projects/libreriscv/openpower-isa/src/openpower/sv/trans/svp64.py", line 667, in translate_one
    raise Exception("opcode %s of '%s' not supported" %
Exception: opcode andi of 'sv.andi. *80, *47, 15' not supported

I'll debug that more later.

Comment 25 Jacob Lifshay 2022-08-24 12:47:08 BST

fixed assembling `sv.andi.`:

https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=6a79227deb29927ad71115ab99d9ff054173bd84

rewrote a lot of the utf-8 validation code to workaround simulator quirks/unimplemented-stuff, the utf-8 validation code is still not working yet -- now, to figure out if that's due to flaws in my code, or flaws in the svp64 implementation...

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=1e445b5efce833d158950c5084d8ee1dce0be0f8

I added code so you can use self.subTest(...) with TestAccumulatorBase, as well as adding src_loc_at so you can specify which function in the backtrace is the one you care about.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=b64cafd74bd05c6d5cf42ffb224f3227395bc796

Comment 26 Luke Kenneth Casson Leighton 2022-08-24 13:02:33 BST

(In reply to Jacob Lifshay from comment #23)

> I didn't see how to get the svshape instruction to set offset...

it can't.  although i may have worked out a way to do it, by using these
SVRM modes https://libre-soc.org/openpower/sv/remap/

    0b1000	reserved
    0b1001	reserved

it would mean sacrificing 3 out of 3D (when setting offset) i.e. only being
able to do 1 or 2D REMAP, because 

    svshape SVxd,SVyd,SVzd,SVRM,vf

* SVxd would be interpreted as the offset
* SVyd as an rmm (see svindex instruction)
* SVzd as-is (the dimension)

so by sort-of combining what's already been done in svindex with svshape
it *should* be possible.

> Also, the algorithm constantly needs to switch between several offsets,
> making a dedicated mode desirable.

interesting.  ok so that also means having the "nonpersist" mode is
also a priority, and being able to set up several SVSHAPEs simultaneously.
ok this is all doable.

(In reply to Jacob Lifshay from comment #24)

>     raise Exception("opcode %s of '%s' not supported" %
> Exception: opcode andi of 'sv.andi. *80, *47, 15' not supported

oink.

--- a/src/openpower/sv/trans/svp64.py
+++ b/src/openpower/sv/trans/svp64.py
@@ -1535,6 +1535,9 @@ if __name__ == '__main__':
         'fmvis 5,64',
         'fmvis 5,32768',
     ]
+    lst = [
+        'sv.and. *80, *80, 1',
+    ]
     isa = SVP64Asm(lst, macros=macros)
     log("list", list(isa))
     asm_process()

sv.and is detected/supported but sv.andi is not.  moo?
i bet that's just entirely missing from the RM*.csv files
i.e. missing entirely from sv_analysis.py as a recognised
pattern.

../openpower/isatables/RM-2P-1S1D.csv:andi.,NORMAL,,2P,EXTRA3,
                        d:RA;d:CR0,s:RS,0,0,RS,0,0,RA,0,CR0,0

oink.  noo, it's there - that's even weirder.

leave it with me.

Comment 27 Luke Kenneth Casson Leighton 2022-08-24 13:22:32 BST

(In reply to Jacob Lifshay from comment #24)

> Exception: opcode andi of 'sv.andi. *80, *47, 15' not supported

sorted.

Comment 28 Luke Kenneth Casson Leighton 2022-08-24 13:53:05 BST

https://libre-soc.org/openpower/sv/remap/discussion/

drat.  i think that's going to need a new instruction.
svoffset or svshape2 or something.  it's almost the
same but not close enough.  in HDL it can be covered
by svshape but it is sufficiently different to likely
need a new instruction.

sigh

Comment 29 Luke Kenneth Casson Leighton 2022-08-24 14:43:14 BST

 194         # set bit 0x80 (TwoContinuations) if input is >= 0xF0
 195         f"sv.subi/satu *80, *45, {0xF0 - 0x80}",

saturation isn't implemented yet, use minu/maxu with a
constant scalar RB=0.
https://libre-soc.org/openpower/isa/av/

hm, thought just occurred to me, would (RB|0) be useful
in mins/maxs?

Comment 30 Luke Kenneth Casson Leighton 2022-08-24 18:15:40 BST

ok so whilst svshape2 doesn't yet exist you can use svindex:

    setvl
    svstep
    sv.addi
    svindex
    blah

so you set the length, then get svstep to output the indices
into an array, then add one to them, then use them.

once the array of offsets is set up as long as you don't
overwrite them obviously they are reusable, it only takes
one instruction (svindex) to activate them.

there is a persistent mode for svindex and a nonpersistent.
you almost certainly want the nonpersistent one, for which
the rmm argument is a bitmask which specifies whether,
in lsb to msb order, RA RB RC RT EA/2nd-outputreg is to
be REMAPped.

so if you want only RB of sv.add *RT,*RA,*RB to be REMAPped
set rmm=0b00010.  RT and RA, set rmm=0b01001

this same rmm field will be in the svshape2 instruction as
well.

Comment 31 Jacob Lifshay 2022-08-24 22:52:03 BST

utf-8 <-> utf-16
https://woboq.com/blog/utf-8-processing-using-simd.html

Comment 32 Jacob Lifshay 2022-08-26 10:04:36 BST

did more work on utf-8 validation using svp64. I got it to run successfully for validating the empty string! It's still broken for " " though...

I'm having to fix the simulator a bunch, setvl. didn't correctly output CR0.

I changed log() to support more granular enable/disable:
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=774ed2fd6547e7dc7ebea89e6f522b4c21792108

I added support to the svp64 assembler for comments.

I also added the original instruction as comments on the `.long`s generated by the svp64 assembler -- it makes debugging much easier to see:
.long 0x580005B6 # setvl 0, 0, 3, 0, 1, 1
rather than:
.long 0x580005B6

Comment 33 Jacob Lifshay 2022-08-26 10:08:52 BST

As an example of log filtering:
> SILENCELOG='!instr_in_outs' python src/openpower/decoder/isa/test_caller_svp64_utf_8_validation.py

outputs:
<LogKind.Default: 'default'> silenced
<LogKind.InstrInOuts: 'instr_in_outs'> active
<LogKind.SkipCase: 'skip_case'> silenced
running test:  case_empty {'data': b'', 'expected': 1}
<snip>
0x003C: 58A40FB7 .long 0x58A40FB7 # setvl. 5, 4, 8, 0, 1, 1
read reg r4: 0x0
write reg CR: 0x20000000
write reg SVSTATE: 0x1000000000000000
write reg CTR: 0x0

0x0040: 418200BC bc 12, 2, final_check # beq final_check
write reg CTR: 0x0
write reg CR: 0x20000000
write reg LR: 0x10000000

0x00FC: 580001B6 .long 0x580001B6 # setvl 0, 0, 1, 0, 1, 1
read reg r0: 0x15CEE3293AA9BFBE
write reg CR: 0x20000000
write reg SVSTATE: 0x204000000000000
write reg CTR: 0x0

0x0100: 05400100 .long 0x05400100 # sv.cmpli 0, 1, 45, 240
read reg r45: 0x0
write reg CR: 0x80000000

0x0108: 4080FFEC bc 4, 0, fail # bge fail
write reg CTR: 0x0
write reg CR: 0x80000000
write reg LR: 0x10000000
<snip>

Comment 34 Luke Kenneth Casson Leighton 2022-08-26 10:44:41 BST

  44         else if _RA != 0         then
  45             if (RA) >u 0b1111111 then VL <- 0b1111111
  46             else VL <- (RA)[57:63]

i have no idea why when i added exactly this a few days ago
it is not already committed, duh

+        if Rc = 1 then
+            if step = 0 then c <- 0b001
+            else c <- 0b010
+            CR[32:35] <- c || XER[SO]

this should already be done and does not look correct
according to spec

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/caller.py;h=f3d9d8085115bc0c053116707b37a2cba5e40d6b;hb=HEAD#l1665

ah hang on yes check_step_increment is not called on 32bit
scalar ops. that may need fixing esp. for "svstep."

Comment 35 Luke Kenneth Casson Leighton 2022-08-26 10:48:17 BST

 203         # sv.andi. is buggy,

sorted. if not please do add a unit test and i will take
a look.

Comment 36 Jacob Lifshay 2022-08-29 08:48:27 BST

I got UTF-8 validation to work!
I had to do a bunch of instruction substitution to work around limitations/bugs in the instruction simulator, most of the limitations/bugs are documented in the comments on the function that generates the assembly:
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/test/algorithms/svp64_utf_8_validation.py;h=f040f1e6f114927bfcae0714d9d76bb737e0c42c;hb=918f3eadf7118a6ecd0e2eb6caaaed9da6936299#l135

I also implemented a better memory dump for logging, like hexdump -C:
https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=661ae80360644dfe3a9e7f3610d534cc3a7e545f

I also implemented support for svp64 prefixed instructions that have a libre-soc-custom suffix, e.g. sv.maxu:
https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=0e80cab3b809d432354ca05464e95dc53db11b64

I also added support to Expected for when tests don't care what so, ov, and ca get set to, those can just be set to None:
https://git.libre-soc.org/?p=openpower-isa.git;a=commit;h=07f5f22461d5eda844141b2ffd33e021d2b43ffb

Comment 37 Luke Kenneth Casson Leighton 2022-08-29 10:13:40 BST

(In reply to Jacob Lifshay from comment #36)
> I got UTF-8 validation to work!

frickin-A!

> I had to do a bunch of instruction substitution to work around
> limitations/bugs in the instruction simulator, most of the limitations/bugs
> are documented in the comments on the function that generates the assembly:

see comment #35 

and you can use "svstep." instead of this:

 190         f"sv.addi *{cur_bytes + 1}, *{cur_bytes}, 1",  # create indexes

and this:

 183         # clear cur bytes, so bytes beyond end end up being zeros
 184         f"setvl 0, 0, {vec_sz}, 0, 1, 1",  # set VL to vec_sz

is what data-dependent fail-first is for (although it needs implementing)
it will auto-truncate VL at the terminating zero.  you need to set the
"/vli" option to include the failing-terminating-zero.

> 
> I also implemented a better memory dump for logging, like hexdump -C:
> https://git.libre-soc.org/?p=openpower-isa.git;a=commit;
> h=661ae80360644dfe3a9e7f3610d534cc3a7e545f

brilliant

> I also implemented support for svp64 prefixed instructions that have a
> libre-soc-custom suffix, e.g. sv.maxu:
> https://git.libre-soc.org/?p=openpower-isa.git;a=commit;
> h=0e80cab3b809d432354ca05464e95dc53db11b64

mmm... this may have damaged detection of "sv.fmadds."
please check that.

+        if not v30b_op.endswith('.'):
+            v30b_op += rc
         # argh, sv.fmadds etc. need to be done manually
         if v30b_op == 'ffmadds':

> I also added support to Expected for when tests don't care what so, ov, and
> ca get set to, those can just be set to None:
> https://git.libre-soc.org/?p=openpower-isa.git;a=commit;
> h=07f5f22461d5eda844141b2ffd33e021d2b43ffb

excellent.

Comment 38 Jacob Lifshay 2022-08-29 10:40:25 BST

(In reply to Luke Kenneth Casson Leighton from comment #37)
> (In reply to Jacob Lifshay from comment #36)
> > I also implemented support for svp64 prefixed instructions that have a
> > libre-soc-custom suffix, e.g. sv.maxu:
> > https://git.libre-soc.org/?p=openpower-isa.git;a=commit;
> > h=0e80cab3b809d432354ca05464e95dc53db11b64
> 
> mmm... this may have damaged detection of "sv.fmadds."
> please check that.

it passed all tests in the openpower-isa repo on my computer, so I assumed that means I didn't break anything.

i'll work on moving all those sv.* special cases to CUSTOM_INSNS and add the apprpriate sv.*. mnemonics tomorrow, that should be a good cleanup.

since fmadds. specifically is a v3.0b op, sending it straight to gas should work fine, no special case should be needed.

Comment 39 Luke Kenneth Casson Leighton 2022-08-29 10:49:42 BST

(In reply to Jacob Lifshay from comment #38)

> i'll work on moving all those sv.* special cases to CUSTOM_INSNS and add the
> apprpriate sv.*. mnemonics tomorrow, that should be a good cleanup.

in the process, do *not* pre-pend "." onto v30b_op until *after* all
processing has been completed. that is what caused the failure. 

or, ensure that CUSTOM_INSNS has the required match-patterns:
both "ffmadds" *and* "ffmadds.", "maxu" *and* "maxu."

> since fmadds. specifically is a v3.0b op, sending it straight to gas should
> work fine, no special case should be needed.

"ffmadds." not "fmadds."

Comment 40 Jacob Lifshay 2022-08-29 10:53:05 BST

(In reply to Luke Kenneth Casson Leighton from comment #39)
> or, ensure that CUSTOM_INSNS has the required match-patterns:
> both "ffmadds" *and* "ffmadds.", "maxu" *and* "maxu."

that's what I said I'd do:
> and add the
> apprpriate sv.*. mnemonics tomorrow


> > since fmadds. specifically is a v3.0b op, sending it straight to gas should
> > work fine, no special case should be needed.
> 
> "ffmadds." not "fmadds."

k, i responded about fmadds. since you said fmadds. in comment #37

Comment 41 Jacob Lifshay 2022-09-01 09:08:11 BST

Pushed the fixed cherry-picked code to master. CI passes:

https://salsa.debian.org/Kazan-team/mirrors/openpower-isa/-/commit/7217fe80d54a5dab33566e6d8fff949b84ce433e/pipelines

Comment 42 Jacob Lifshay 2022-09-04 15:11:55 BST

uuh, actually just utf-8 verification is done, neither utf-8 <-> utf-16 algorithm is done, so payment should be on a subtask rather than this bug directly

Comment 43 Luke Kenneth Casson Leighton 2022-09-07 13:37:06 BST

(In reply to Jacob Lifshay from comment #42)
> uuh, actually just utf-8 verification is done, neither utf-8 <-> utf-16
> algorithm is done, so payment should be on a subtask rather than this bug
> directly

moving them to a separate bugreport for future work is fine.
moving the entire budget to a separate task is not.