tplaten | in openpower/test/runner.py I fix the broken names so that all needed signals show up in gtkwave | 15:19 |
---|---|---|
tplaten | I start with wishbone_runner | 15:24 |
*** A_Dragon is now known as AAAAA_DRAGON | 15:53 | |
lkcl | tplaten, that's really helpful. | 16:13 |
ghostmansd | In svp64.py: assert src_zero == 0, "dest-zero not allowed in failfirst mode" | 17:12 |
ghostmansd | On the first glance it seems there should be dst_zero, amirite? | 17:13 |
ghostmansd | All stuff below deals with dst_zero as well | 17:14 |
ghostmansd | OTOH... It seems that dst_zero should be non-zero, e.g. `(dst_zero << SVP64MODE.DZ) # predicate dst-zeroing` | 17:15 |
ghostmansd | (I'm currently teaching binutils to parse ff/pr sections) | 17:15 |
tplaten | done https://git.libre-soc.org/?p=openpower-isa.git;a=summary | 17:44 |
*** tplaten <tplaten!~isengaara@55d46f66.access.ecotel.net> has left #libre-soc | 18:37 | |
*** kylel1 is now known as kylel | 19:12 | |
lkcl | ghostmansd-pc, sometimes there's just not enough bits to fit things | 19:48 |
lkcl | but there's nothing to stop the assembly syntax from trying to specify stuff that just doesn't have space | 19:49 |
octavius | Good evening. While sorting my bookmark collection, discovered this interesting question about sorted/unsorted array summation time-difference: https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array | 20:43 |
lkcl | eevnin octavius | 21:54 |
lkcl | whyyy.... https://en.cppreference.com/w/cpp/algorithm/partition | 21:59 |
lkcl | c++ is now illegible thanks to massive overuse of "standard" templates. | 21:59 |
octavius | XD | 22:00 |
octavius | What does this do? Arrange the elements to improve branch prediction? | 22:00 |
octavius | The name makes me think of memory pages, or cache boundaries | 22:01 |
octavius | Ah wait, I got it | 22:01 |
lkcl | great, you can explain it to me. i don't get it | 22:02 |
* lkcl going slightly dizzy dealing with dcache.py and icache.py | 22:03 | |
octavius | So in the specific example of the stackoverflow, the partition() function would re-arrange the list, such that all elements greater than 128 would come first, the one's less than 128 would come second. | 22:03 |
octavius | This would speed up execution, because the branch predictor will have a nice easy pattern to follow | 22:04 |
lkcl | urrr and then... yuk | 22:04 |
octavius | As the code adds the element to the sum if it's greater than 128 | 22:04 |
lkcl | if predication was used (available) that would be false | 22:05 |
octavius | what's predication? | 22:06 |
lkcl | even scalar predication (ARM). funnily enough that's exactly what's on the wikipedia page about predicates / condition-codes | 22:06 |
lkcl | "making an operation optional based on a bit" | 22:06 |
octavius | Ah, but you'd need to make a comparison between more than 1 bit, right? | 22:07 |
lkcl | instruction 1: predicate = data[c] >= 128 | 22:07 |
lkcl | instruction 2: optional(predicate){sum += data[c]) | 22:07 |
octavius | But surely the predicate will still take time to process? | 22:09 |
lkcl | count the number of instructions compared to having a branch. | 22:10 |
octavius | You need to check if a element is greater than 128, so you need to have the element and 128 loaded somewher | 22:10 |
lkcl | count the number of instructions compared to having a branch. | 22:10 |
octavius | Sure | 22:10 |
octavius | two instructions | 22:11 |
lkcl | no: three. | 22:11 |
octavius | what's the 3rd one? | 22:11 |
lkcl | instruction 1: predicate (CR0) = data[c] >= 128 | 22:11 |
lkcl | instruction 2: bc CR0, instruction 4 | 22:11 |
lkcl | instruction 3: sum += data[c] | 22:12 |
lkcl | instruction 4: loop.... | 22:12 |
lkcl | when predicates used: ==> 2 instructions | 22:13 |
lkcl | when branches used: ==> 3 instructions | 22:13 |
octavius | Ok, but I don't really understand the mechanism behind the predicate. It sounds like a complex instruction (combination of several operations) | 22:16 |
lkcl | nope. just 1 bit. | 22:16 |
lkcl | if bit set, instruction executed. | 22:16 |
lkcl | if bit clear, instruction not executed | 22:16 |
octavius | What's the bit? | 22:16 |
lkcl | dead | 22:16 |
lkcl | simple | 22:16 |
lkcl | in a register | 22:16 |
octavius | to determine if a number is bigger than 128, you need to check more than one bit though? | 22:17 |
octavius | 129 for example | 22:17 |
lkcl | and the result of that comparison is stored iiiiin.... a register of size "1 bit" | 22:17 |
lkcl | if result of comparison is true, bit equals 1 | 22:18 |
lkcl | if result of comparison is false, bit equals 0 | 22:18 |
lkcl | also | 22:18 |
lkcl | dead | 22:18 |
lkcl | simpe | 22:18 |
lkcl | ple | 22:18 |
octavius | However before the comparison is made, the 128 constant has to be loaded somewhere? (which takes an instruction) | 22:18 |
lkcl | not if the ISA has compare-against-immediate | 22:18 |
octavius | Ah ok, now it makes sense | 22:19 |
lkcl | and if it doesn't then the number of instructions in predicate goes up by one | 22:19 |
octavius | Just an ISA feature | 22:19 |
lkcl | and the number of instructions in branch goes up by one | 22:19 |
lkcl | now you are comparing | 22:19 |
lkcl | when predicates used ==> 3 instructions | 22:19 |
octavius | So you still gain regardless | 22:19 |
lkcl | when branches used ==> 4 instructions | 22:19 |
lkcl | of course | 22:19 |
lkcl | it's real simple | 22:19 |
lkcl | and ARM and many other ISAs have had predication for decades | 22:20 |
lkcl | Vector ISAs have had predication for over 50 years | 22:20 |
lkcl | there, they have multiple bits because it is multiple elements | 22:21 |
lkcl | one bit per element in the operation. | 22:21 |
lkcl | also dead simple. | 22:21 |
lkcl | out-of-order speculative engines can jam/smash a ton of predicate-calculations-plus-predicated-operations into their Reservation Stations | 22:22 |
lkcl | then once the predicate bit is calculated (by the compare) | 22:22 |
lkcl | simply cancel the instruction if that predicate bit is zero | 22:22 |
lkcl | which then frees up the Function Unit for re-use by other incoming instructions | 22:23 |
lkcl | where branch speculation does pretty much exactly the same thing... | 22:24 |
octavius | That's beautiful | 22:24 |
lkcl | ... but by jamming 50% more instructions into the Reservation Stations | 22:24 |
lkcl | some CPUs will actually put *both* the paths into the RSes! | 22:24 |
lkcl | and cancel one of them | 22:25 |
octavius | Would that be considered a "high-performance" mode, where more power is consumed for a faster outcome? | 22:25 |
lkcl | the penalty: you have *twice* as many instructions in-flight in the RSes compared to using predication, if you put both branch-success and branch-fail into RSes | 22:26 |
lkcl | yes, but it's insane. | 22:26 |
octavius | Insane because it's a waste most of the time? | 22:26 |
lkcl | well, branch prediction is supposed to "help" | 22:27 |
lkcl | but in the case of randomised data, it's never going to help, is it? | 22:27 |
octavius | Yeah, worse than a coin flip | 22:27 |
octavius | I'll be off, will leave you with one more fun blog I'm sure you'll relate (software shipping with O(n^2) algos :) ) Good night | 22:47 |
lkcl | :) | 22:47 |
octavius | https://randomascii.wordpress.com/2021/02/16/arranging-invisible-icons-in-quadratic-time/ | 22:47 |
Generated by irclog2html.py 2.17.1 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!