Skip to content

Reimplement php_round_helper() using modf() #12220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 19, 2023
Merged

Conversation

TimWolla
Copy link
Member

This change makes the implementation much easier to understand, by explicitly handling the various cases.

It fixes rounding for 0.49999999999999994, because no loss of precision happens by adding / subtracing 0.5 before turning the result into an integral float. Instead the fractional parts are explicitly compared.

see GH-12143 (this fixes one of the reported cases)
Closes GH-12159 which was an alternative attempt to fix the rounding issue for 0.49999999999999994

/cc @jorgsowa

This change makes the implementation much easier to understand, by explicitly
handling the various cases.

It fixes rounding for `0.49999999999999994`, because no loss of precision
happens by adding / subtracing `0.5` before turning the result into an integral
float. Instead the fractional parts are explicitly compared.

see phpGH-12143 (this fixes one of the reported cases)
Closes phpGH-12159 which was an alternative attempt to fix the rounding issue for
`0.49999999999999994`
Comment on lines 102 to 106
if (fractional >= 0.5) {
return integral + copysign(1.0, integral);
}

return integral;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang is able to compile this branchless as per: https://godbolt.org/z/5cb8W1416, which is likely a good thing.

@SakiTakamachi
Copy link
Member

SakiTakamachi commented Sep 16, 2023

@TimWolla
lgtm!

Since math.h may have an issue with the FPU control word, I built your branch on alpine 32bit and ran all tests.
All tests performed as expected, and we found that modf and fmod can be used with confidence.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but LGTM

@bukka
Copy link
Member

bukka commented Sep 17, 2023

Are you able to do some perf tests to see if there's any difference?

This makes the code even clearer to understand and also improves the assembly,
allowing the compiler to use an actual jump table for the switch cases.
@TimWolla TimWolla requested a review from Girgias September 17, 2023 12:53
@TimWolla
Copy link
Member Author

Are you able to do some perf tests to see if there's any difference?

Using

#include <math.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdint.h>

#ifndef PHP_ROUND_HALF_UP
#define PHP_ROUND_HALF_UP        0x01    /* Arithmetic rounding, up == away from zero */
#endif

#ifndef PHP_ROUND_HALF_DOWN
#define PHP_ROUND_HALF_DOWN      0x02    /* Down == towards zero */
#endif

#ifndef PHP_ROUND_HALF_EVEN
#define PHP_ROUND_HALF_EVEN      0x03    /* Banker's rounding */
#endif

#ifndef PHP_ROUND_HALF_ODD
#define PHP_ROUND_HALF_ODD       0x04
#endif

double old(double value, int mode) {
	double tmp_value;

	if (value >= 0.0) {
		tmp_value = floor(value + 0.5);
		if ((mode == PHP_ROUND_HALF_DOWN && value == (-0.5 + tmp_value)) ||
			(mode == PHP_ROUND_HALF_EVEN && value == (0.5 + 2 * floor(tmp_value/2.0))) ||
			(mode == PHP_ROUND_HALF_ODD  && value == (0.5 + 2 * floor(tmp_value/2.0) - 1.0)))
		{
			tmp_value = tmp_value - 1.0;
		}
	} else {
		tmp_value = ceil(value - 0.5);
		if ((mode == PHP_ROUND_HALF_DOWN && value == (0.5 + tmp_value)) ||
			(mode == PHP_ROUND_HALF_EVEN && value == (-0.5 + 2 * ceil(tmp_value/2.0))) ||
			(mode == PHP_ROUND_HALF_ODD  && value == (-0.5 + 2 * ceil(tmp_value/2.0) + 1.0)))
		{
			tmp_value = tmp_value + 1.0;
		}
	}

	return tmp_value;
}

double new_jorg(double value, int mode) {
	double tmp_value;

	if (value >= 0.0) {
		tmp_value = floor(value + 0.5);
		if ((mode == PHP_ROUND_HALF_DOWN && value == (-0.5 + tmp_value)) ||
			(mode == PHP_ROUND_HALF_EVEN && value == (0.5 + 2 * floor(tmp_value/2.0))) ||
			(mode == PHP_ROUND_HALF_ODD  && value == (0.5 + 2 * floor(tmp_value/2.0) - 1.0)) ||
			value < (-0.5 + tmp_value))
		{
			tmp_value = tmp_value - 1.0;
		}
	} else {
		tmp_value = ceil(value - 0.5);
		if ((mode == PHP_ROUND_HALF_DOWN && value == (0.5 + tmp_value)) ||
			(mode == PHP_ROUND_HALF_EVEN && value == (-0.5 + 2 * ceil(tmp_value/2.0))) ||
			(mode == PHP_ROUND_HALF_ODD  && value == (-0.5 + 2 * ceil(tmp_value/2.0) + 1.0)) ||
			value > (0.5 + tmp_value))
		{
			tmp_value = tmp_value + 1.0;
		}
	}

	return tmp_value;
}

double new_tim(double value, int mode) {
	double integral, fractional;

	fractional = fabs(modf(value, &integral));

	switch (mode) {
		case PHP_ROUND_HALF_UP:
			if (fractional >= 0.5) {
				return integral + copysign(1.0, integral);
			}

			return integral;

		case PHP_ROUND_HALF_DOWN:
			if (fractional > 0.5) {
				return integral + copysign(1.0, integral);
			}

			return integral;

		case PHP_ROUND_HALF_EVEN:
			if (fractional > 0.5) {
				return integral + copysign(1.0, integral);
			}

			if (fractional == 0.5) {
				bool even = !fmod(integral, 2.0);

				if (!even) {
					return integral + copysign(1.0, integral);
				}
			}

			return integral;
		case PHP_ROUND_HALF_ODD:
			if (fractional > 0.5) {
				return integral + copysign(1.0, integral);
			}

			if (fractional == 0.5) {
				bool even = !fmod(integral, 2.0);

				if (even) {
					return integral + copysign(1.0, integral);
				}
			}

			return integral;
	}

	__builtin_unreachable();
}

static inline uint64_t rotl(const uint64_t x, int k) {
	return (x << k) | (x >> (64 - k));
}


static uint64_t s[4];

uint64_t next(void) {
	const uint64_t result = s[0] + s[3];

	const uint64_t t = s[1] << 17;

	s[2] ^= s[0];
	s[3] ^= s[1];
	s[1] ^= s[2];
	s[0] ^= s[3];

	s[2] ^= t;

	s[3] = rotl(s[3], 45);

	return result;
}


int
main() {
	s[0] = 0xbe0abf86eeacfd2d;
	s[1] = 0x5212a180ba6c1136;
	s[2] = 0xbb1f87b46572ab77;
	s[3] = 0xab0fde1b8ab187da;

	const double step_size = 1.0 / (1ULL << 53);

	double sum = 0;
	for (size_t i = 0; i < 100000000; i++) {
		sum += FUNC(step_size * (next() >> 11), PHP_ROUND_HALF_UP);
	}

	printf("%.17g\n", sum);
}

with old being the unfixed implementation that returns incorrect results for 0.49999999999999994, new_jorg being the one in #12159 that is correct for 0.49999999999999994, but might or might not be correct for other numbers (I find this hard to confirm based on the code) and new_tim being this PR.


$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ gcc test.c -O2 -lm -DFUNC=old -o old
$ gcc test.c -O2 -lm -DFUNC=new_jorg -o new_jorg
$ gcc test.c -O2 -lm -DFUNC=new_tim -o new_tim
$ perf stat ./old
50004176

 Performance counter stats for './old':

            752.10 msec task-clock                #    0.997 CPUs utilized          
                10      context-switches          #    0.013 K/sec                  
                 1      cpu-migrations            #    0.001 K/sec                  
                82      page-faults               #    0.109 K/sec                  
     2,171,047,145      cycles                    #    2.887 GHz                      (82.99%)
       537,622,125      stalled-cycles-frontend   #   24.76% frontend cycles idle     (83.22%)
        32,217,694      stalled-cycles-backend    #    1.48% backend cycles idle      (67.04%)
     5,088,010,500      instructions              #    2.34  insn per cycle         
                                                  #    0.11  stalled cycles per insn  (83.52%)
       798,136,913      branches                  # 1061.210 M/sec                    (83.58%)
            19,982      branch-misses             #    0.00% of all branches          (83.17%)

       0.754016025 seconds time elapsed

       0.752905000 seconds user
       0.000000000 seconds sys


$ perf stat ./new_jorg
50004176

 Performance counter stats for './new_jorg':

            886.31 msec task-clock                #    0.999 CPUs utilized          
                 9      context-switches          #    0.010 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                83      page-faults               #    0.094 K/sec                  
     2,577,178,202      cycles                    #    2.908 GHz                      (83.31%)
       776,626,140      stalled-cycles-frontend   #   30.13% frontend cycles idle     (83.30%)
        60,990,337      stalled-cycles-backend    #    2.37% backend cycles idle      (66.62%)
     5,596,624,715      instructions              #    2.17  insn per cycle         
                                                  #    0.14  stalled cycles per insn  (83.30%)
       898,010,830      branches                  # 1013.199 M/sec                    (83.30%)
            21,415      branch-misses             #    0.00% of all branches          (83.47%)

       0.887352489 seconds time elapsed

       0.887204000 seconds user
       0.000000000 seconds sys


$ perf stat ./new_tim
50004176

 Performance counter stats for './new_tim':

          1,823.03 msec task-clock                #    0.920 CPUs utilized          
               321      context-switches          #    0.176 K/sec                  
                 5      cpu-migrations            #    0.003 K/sec                  
                82      page-faults               #    0.045 K/sec                  
     5,156,887,099      cycles                    #    2.829 GHz                      (83.61%)
     1,876,988,149      stalled-cycles-frontend   #   36.40% frontend cycles idle     (82.95%)
     1,176,785,107      stalled-cycles-backend    #   22.82% backend cycles idle      (66.75%)
     6,702,998,765      instructions              #    1.30  insn per cycle         
                                                  #    0.28  stalled cycles per insn  (83.44%)
     1,354,930,863      branches                  #  743.228 M/sec                    (83.19%)
        68,833,162      branch-misses             #    5.08% of all branches          (83.50%)

       1.981621115 seconds time elapsed

       1.817790000 seconds user
       0.008025000 seconds sys


$ clang --version
clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ clang test.c -O2 -lm -DFUNC=old -o old
$ clang test.c -O2 -lm -DFUNC=new_jorg -o new_jorg
$ clang test.c -O2 -lm -DFUNC=new_tim -o new_tim
$ perf stat ./old
50004176

 Performance counter stats for './old':

            566.44 msec task-clock                #    0.998 CPUs utilized          
                 6      context-switches          #    0.011 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                82      page-faults               #    0.145 K/sec                  
     1,633,264,912      cycles                    #    2.883 GHz                      (83.06%)
       739,693,452      stalled-cycles-frontend   #   45.29% frontend cycles idle     (83.06%)
       193,368,052      stalled-cycles-backend    #   11.84% backend cycles idle      (66.84%)
     2,881,930,833      instructions              #    1.76  insn per cycle         
                                                  #    0.26  stalled cycles per insn  (83.76%)
       399,686,622      branches                  #  705.613 M/sec                    (83.76%)
         2,536,755      branch-misses             #    0.63% of all branches          (83.29%)

       0.567402531 seconds time elapsed

       0.563262000 seconds user
       0.003994000 seconds sys


$ perf stat ./new_jorg
50004176

 Performance counter stats for './new_jorg':

            525.11 msec task-clock                #    0.999 CPUs utilized          
                 2      context-switches          #    0.004 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                82      page-faults               #    0.156 K/sec                  
     1,512,602,871      cycles                    #    2.881 GHz                      (83.24%)
       381,235,706      stalled-cycles-frontend   #   25.20% frontend cycles idle     (83.24%)
        71,231,523      stalled-cycles-backend    #    4.71% backend cycles idle      (66.49%)
     3,725,342,139      instructions              #    2.46  insn per cycle         
                                                  #    0.10  stalled cycles per insn  (83.24%)
       399,930,124      branches                  #  761.609 M/sec                    (83.50%)
         3,444,782      branch-misses             #    0.86% of all branches          (83.53%)

       0.525536758 seconds time elapsed

       0.525558000 seconds user
       0.000000000 seconds sys


$ perf stat ./new_tim
50004176

 Performance counter stats for './new_tim':

            738.76 msec task-clock                #    0.999 CPUs utilized          
                 2      context-switches          #    0.003 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                80      page-faults               #    0.108 K/sec                  
     2,157,405,725      cycles                    #    2.920 GHz                      (83.22%)
       531,329,337      stalled-cycles-frontend   #   24.63% frontend cycles idle     (83.22%)
        46,384,645      stalled-cycles-backend    #    2.15% backend cycles idle      (66.45%)
     5,888,314,921      instructions              #    2.73  insn per cycle         
                                                  #    0.09  stalled cycles per insn  (83.23%)
       700,593,865      branches                  #  948.340 M/sec                    (83.76%)
           648,639      branch-misses             #    0.09% of all branches          (83.36%)

       0.739445586 seconds time elapsed

       0.739488000 seconds user
       0.000000000 seconds sys

Within a Debian Sid container:

root@619b16aece80:/pwd# gcc --version
gcc (Debian 13.2.0-4) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@619b16aece80:/pwd# gcc test.c -O2 -lm -DFUNC=old -o old              
root@619b16aece80:/pwd# gcc test.c -O2 -lm -DFUNC=new_jorg -o new_jorg         
root@619b16aece80:/pwd# gcc test.c -O2 -lm -DFUNC=new_tim -o new_tim 
root@619b16aece80:/pwd# perf stat ./old
50004176

 Performance counter stats for './old':

            455.64 msec task-clock                       #    0.993 CPUs utilized             
                38      context-switches                 #   83.398 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                68      page-faults                      #  149.239 /sec                      
        1320845630      cycles                           #    2.899 GHz                         (83.35%)
         430678611      stalled-cycles-frontend          #   32.61% frontend cycles idle        (83.39%)
          22960185      stalled-cycles-backend           #    1.74% backend cycles idle         (66.02%)
        2998435405      instructions                     #    2.27  insn per cycle            
                                                  #    0.14  stalled cycles per insn     (82.64%)
         298614002      branches                         #  655.367 M/sec                       (83.55%)
             14700      branch-misses                    #    0.00% of all branches             (83.69%)

       0.459069293 seconds time elapsed

       0.456727000 seconds user
       0.000000000 seconds sys


root@619b16aece80:/pwd# perf stat ./new_jorg
50004176

 Performance counter stats for './new_jorg':

            563.68 msec task-clock                       #    0.999 CPUs utilized             
                 5      context-switches                 #    8.870 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                68      page-faults                      #  120.635 /sec                      
        1630325168      cycles                           #    2.892 GHz                         (82.97%)
         582552196      stalled-cycles-frontend          #   35.73% frontend cycles idle        (82.98%)
          36504394      stalled-cycles-backend           #    2.24% backend cycles idle         (67.24%)
        3486180165      instructions                     #    2.14  insn per cycle            
                                                  #    0.17  stalled cycles per insn     (83.69%)
         398755894      branches                         #  707.412 M/sec                       (83.68%)
             14245      branch-misses                    #    0.00% of all branches             (83.13%)

       0.564458368 seconds time elapsed

       0.564450000 seconds user
       0.000000000 seconds sys


root@619b16aece80:/pwd# perf stat ./new_tim 
50004176

 Performance counter stats for './new_tim':

           1107.17 msec task-clock                       #    0.999 CPUs utilized             
                30      context-switches                 #   27.096 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                65      page-faults                      #   58.708 /sec                      
        3243212689      cycles                           #    2.929 GHz                         (82.68%)
         780360162      stalled-cycles-frontend          #   24.06% frontend cycles idle        (83.39%)
         596152111      stalled-cycles-backend           #   18.38% backend cycles idle         (66.78%)
        4705937442      instructions                     #    1.45  insn per cycle            
                                                  #    0.17  stalled cycles per insn     (83.38%)
         698895653      branches                         #  631.244 M/sec                       (83.69%)
          53285445      branch-misses                    #    7.62% of all branches             (83.45%)

       1.108475996 seconds time elapsed

       1.107896000 seconds user
       0.000000000 seconds sys

root@619b16aece80:/pwd# clang --version
Debian clang version 16.0.6 (15)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
root@619b16aece80:/pwd# clang test.c -O2 -lm -DFUNC=old -o old              
root@619b16aece80:/pwd# clang test.c -O2 -lm -DFUNC=new_jorg -o new_jorg
root@619b16aece80:/pwd# clang test.c -O2 -lm -DFUNC=new_tim -o new_tim
root@619b16aece80:/pwd# perf stat ./old
50004176

 Performance counter stats for './old':

            541.62 msec task-clock                       #    0.998 CPUs utilized             
                12      context-switches                 #   22.156 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                66      page-faults                      #  121.857 /sec                      
        1586209091      cycles                           #    2.929 GHz                         (83.07%)
         764279600      stalled-cycles-frontend          #   48.18% frontend cycles idle        (83.07%)
         125170248      stalled-cycles-backend           #    7.89% backend cycles idle         (66.94%)
        2797149771      instructions                     #    1.76  insn per cycle            
                                                  #    0.27  stalled cycles per insn     (83.76%)
         401951167      branches                         #  742.130 M/sec                       (83.75%)
           1424522      branch-misses                    #    0.35% of all branches             (83.17%)

       0.542643639 seconds time elapsed

       0.542337000 seconds user
       0.000000000 seconds sys


root@619b16aece80:/pwd# perf stat ./new_jorg
50004176

 Performance counter stats for './new_jorg':

            462.69 msec task-clock                       #    0.998 CPUs utilized             
                10      context-switches                 #   21.613 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                68      page-faults                      #  146.967 /sec                      
        1355162354      cycles                           #    2.929 GHz                         (82.81%)
         285941528      stalled-cycles-frontend          #   21.10% frontend cycles idle        (83.58%)
          23171761      stalled-cycles-backend           #    1.71% backend cycles idle         (67.17%)
        3597104663      instructions                     #    2.65  insn per cycle            
                                                  #    0.08  stalled cycles per insn     (83.58%)
         399722203      branches                         #  863.909 M/sec                       (83.58%)
            722537      branch-misses                    #    0.18% of all branches             (82.86%)

       0.463469216 seconds time elapsed

       0.463478000 seconds user
       0.000000000 seconds sys


root@619b16aece80:/pwd# perf stat ./new_tim 
50004176

 Performance counter stats for './new_tim':

            816.11 msec task-clock                       #    0.999 CPUs utilized             
                 7      context-switches                 #    8.577 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
                67      page-faults                      #   82.097 /sec                      
        2389408522      cycles                           #    2.928 GHz                         (83.34%)
         762095490      stalled-cycles-frontend          #   31.89% frontend cycles idle        (83.34%)
          42274754      stalled-cycles-backend           #    1.77% backend cycles idle         (66.68%)
        5986786590      instructions                     #    2.51  insn per cycle            
                                                  #    0.13  stalled cycles per insn     (83.34%)
         598058805      branches                         #  732.816 M/sec                       (83.34%)
            405579      branch-misses                    #    0.07% of all branches             (83.30%)

       0.816982484 seconds time elapsed

       0.816957000 seconds user
       0.000000000 seconds sys

@SakiTakamachi

This comment was marked as resolved.

@TimWolla
Copy link
Member Author

This was on a Intel(R) Core(TM) i5-2430M. The implementation in this PR is indeed the slowest of all of them, with the majority of the time spent in modf. I must say I'm surprised by the clang results of Jorg's implementation being faster than the current broken version.

@TimWolla
Copy link
Member Author

Differences become much smaller when changing main() to (generating numbers in (-1.0, 1.0), instead of [0.0, 1.0)):

int
main() {
	s[0] = 0xbe0abf86eeacfd2d;
	s[1] = 0x5212a180ba6c1136;
	s[2] = 0xbb1f87b46572ab77;
	s[3] = 0xab0fde1b8ab187da;

	const double step_size = 2.0 / (1ULL << 53);

	double sum = 0;
	for (size_t i = 0; i < 100000000; i++) {
		sum += FUNC(step_size * (next() >> 11) - 1.0, PHP_ROUND_HALF_UP);
	}

	printf("%.17g\n", sum);
}
$ gcc test.c -O2 -lm -DFUNC=old -o old
$ gcc test.c -O2 -lm -DFUNC=new_jorg -o new_jorg
$ gcc test.c -O2 -lm -DFUNC=new_tim -o new_tim
$ perf stat ./old
-2015

 Performance counter stats for './old':

          1,245.34 msec task-clock                #    0.999 CPUs utilized          
                34      context-switches          #    0.027 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                82      page-faults               #    0.066 K/sec                  
     3,642,416,216      cycles                    #    2.925 GHz                      (83.31%)
       699,375,366      stalled-cycles-frontend   #   19.20% frontend cycles idle     (83.31%)
       698,958,651      stalled-cycles-backend    #   19.19% backend cycles idle      (66.63%)
     5,198,550,125      instructions              #    1.43  insn per cycle         
                                                  #    0.13  stalled cycles per insn  (83.31%)
       799,344,996      branches                  #  641.871 M/sec                    (83.39%)
        49,988,133      branch-misses             #    6.25% of all branches          (83.36%)

       1.246728160 seconds time elapsed

       1.245879000 seconds user
       0.000000000 seconds sys


$ perf stat ./new_jorg
-2015

 Performance counter stats for './new_jorg':

          1,262.75 msec task-clock                #    0.999 CPUs utilized          
                 6      context-switches          #    0.005 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                82      page-faults               #    0.065 K/sec                  
     3,708,544,127      cycles                    #    2.937 GHz                      (83.21%)
       825,428,656      stalled-cycles-frontend   #   22.26% frontend cycles idle     (83.22%)
       640,776,981      stalled-cycles-backend    #   17.28% backend cycles idle      (66.72%)
     5,606,240,058      instructions              #    1.51  insn per cycle         
                                                  #    0.15  stalled cycles per insn  (83.51%)
       899,311,752      branches                  #  712.185 M/sec                    (83.53%)
        50,339,991      branch-misses             #    5.60% of all branches          (83.32%)

       1.263618565 seconds time elapsed

       1.263560000 seconds user
       0.000000000 seconds sys


$ perf stat ./new_tim
-2015

 Performance counter stats for './new_tim':

          1,529.17 msec task-clock                #    0.999 CPUs utilized          
                 1      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                83      page-faults               #    0.054 K/sec                  
     4,464,461,977      cycles                    #    2.920 GHz                      (83.26%)
     1,161,457,124      stalled-cycles-frontend   #   26.02% frontend cycles idle     (83.26%)
       664,504,433      stalled-cycles-backend    #   14.88% backend cycles idle      (66.58%)
     6,808,168,189      instructions              #    1.52  insn per cycle         
                                                  #    0.17  stalled cycles per insn  (83.32%)
     1,351,505,702      branches                  #  883.817 M/sec                    (83.52%)
        52,372,812      branch-misses             #    3.88% of all branches          (83.38%)

       1.530165468 seconds time elapsed

       1.530177000 seconds user
       0.000000000 seconds sys


$ clang test.c -O2 -lm -DFUNC=old -o old
$ clang test.c -O2 -lm -DFUNC=new_jorg -o new_jorg
$ clang test.c -O2 -lm -DFUNC=new_tim -o new_tim
$ perf stat ./old
per-2015

 Performance counter stats for './old':

            648.25 msec task-clock                #    0.999 CPUs utilized          
                 4      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                82      page-faults               #    0.126 K/sec                  
     1,891,215,722      cycles                    #    2.917 GHz                      (83.34%)
       387,663,388      stalled-cycles-frontend   #   20.50% frontend cycles idle     (83.34%)
        50,827,266      stalled-cycles-backend    #    2.69% backend cycles idle      (66.69%)
     4,182,372,632      instructions              #    2.21  insn per cycle         
                                                  #    0.09  stalled cycles per insn  (83.34%)
       698,291,409      branches                  # 1077.192 M/sec                    (83.35%)
           786,488      branch-misses             #    0.11% of all branches          (83.28%)

       0.649210206 seconds time elapsed

       0.649047000 seconds user
       0.000000000 seconds sys


$ perf stat ./new_jorg
-2015

 Performance counter stats for './new_jorg':

          1,080.69 msec task-clock                #    0.999 CPUs utilized          
                 4      context-switches          #    0.004 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                84      page-faults               #    0.078 K/sec                  
     3,175,315,925      cycles                    #    2.938 GHz                      (83.34%)
       763,401,980      stalled-cycles-frontend   #   24.04% frontend cycles idle     (83.35%)
       638,205,951      stalled-cycles-backend    #   20.10% backend cycles idle      (66.69%)
     3,556,084,096      instructions              #    1.12  insn per cycle         
                                                  #    0.21  stalled cycles per insn  (83.34%)
       600,679,904      branches                  #  555.832 M/sec                    (83.35%)
        51,872,522      branch-misses             #    8.64% of all branches          (83.27%)

       1.081628502 seconds time elapsed

       1.081635000 seconds user
       0.000000000 seconds sys


$ perf stat ./new_tim
-2015

 Performance counter stats for './new_tim':

            784.68 msec task-clock                #    0.998 CPUs utilized          
                18      context-switches          #    0.023 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                83      page-faults               #    0.106 K/sec                  
     2,292,702,762      cycles                    #    2.922 GHz                      (83.19%)
       618,770,400      stalled-cycles-frontend   #   26.99% frontend cycles idle     (83.19%)
        49,778,731      stalled-cycles-backend    #    2.17% backend cycles idle      (66.43%)
     5,999,290,793      instructions              #    2.62  insn per cycle         
                                                  #    0.10  stalled cycles per insn  (83.24%)
       697,748,667      branches                  #  889.216 M/sec                    (83.70%)
           536,689      branch-misses             #    0.08% of all branches          (83.49%)

       0.786275333 seconds time elapsed

       0.781597000 seconds user
       0.004008000 seconds sys


Then all versions are competitive with gcc and Jorg's version is much slower with clang, because the branch misses skyrocket.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this looks good, but I'll let @bukka comment again.

@TimWolla
Copy link
Member Author

With main being:

int
main() {
	s[0] = 0xbe0abf86eeacfd2d;
	s[1] = 0x5212a180ba6c1136;
	s[2] = 0xbb1f87b46572ab77;
	s[3] = 0xab0fde1b8ab187da;

	const double step_size = 2.0 / (1ULL << 53);

	double sum = 0;
	for (size_t i = 0; i < 100000000; i++) {
		sum += FUNC(step_size * (next() >> 11) - 1.0, PHP_ROUND_HALF_EVEN);
	}

	printf("%.17g\n", sum);
}

(i.e. (1.0, 1.0) with rounding half to even “banker's rounding”) my version beats both the old version and Jorg's version:

$ gcc test.c -O2 -lm -DFUNC=old -o old
$ gcc test.c -O2 -lm -DFUNC=new_tim -o new_tim
$ gcc test.c -O2 -lm -DFUNC=new_jorg -o new_jorg
$ perf stat ./old
-2015

 Performance counter stats for './old':

          1,667.57 msec task-clock                #    0.999 CPUs utilized          
                37      context-switches          #    0.022 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                80      page-faults               #    0.048 K/sec                  
     4,855,980,455      cycles                    #    2.912 GHz                      (83.22%)
     1,865,601,989      stalled-cycles-frontend   #   38.42% frontend cycles idle     (83.23%)
       567,348,548      stalled-cycles-backend    #   11.68% backend cycles idle      (66.75%)
     7,204,940,510      instructions              #    1.48  insn per cycle         
                                                  #    0.26  stalled cycles per insn  (83.47%)
     1,001,937,636      branches                  #  600.839 M/sec                    (83.46%)
        50,116,987      branch-misses             #    5.00% of all branches          (83.35%)

       1.669524323 seconds time elapsed

       1.668623000 seconds user
       0.000000000 seconds sys


$ perf stat ./new_jorg
-2015

 Performance counter stats for './new_jorg':

          1,738.67 msec task-clock                #    0.988 CPUs utilized          
               303      context-switches          #    0.174 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                82      page-faults               #    0.047 K/sec                  
     5,084,530,195      cycles                    #    2.924 GHz                      (83.16%)
     1,972,234,079      stalled-cycles-frontend   #   38.79% frontend cycles idle     (83.38%)
       532,719,936      stalled-cycles-backend    #   10.48% backend cycles idle      (66.80%)
     7,699,914,640      instructions              #    1.51  insn per cycle         
                                                  #    0.26  stalled cycles per insn  (83.63%)
     1,151,052,259      branches                  #  662.029 M/sec                    (83.43%)
        50,208,400      branch-misses             #    4.36% of all branches          (83.24%)

       1.760317642 seconds time elapsed

       1.726577000 seconds user
       0.016023000 seconds sys


$ perf stat ./new_tim
-2015

 Performance counter stats for './new_tim':

          1,602.60 msec task-clock                #    0.999 CPUs utilized          
                10      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                81      page-faults               #    0.051 K/sec                  
     4,675,258,290      cycles                    #    2.917 GHz                      (83.28%)
     1,358,777,298      stalled-cycles-frontend   #   29.06% frontend cycles idle     (83.28%)
       773,307,762      stalled-cycles-backend    #   16.54% backend cycles idle      (66.60%)
     6,712,431,080      instructions              #    1.44  insn per cycle         
                                                  #    0.20  stalled cycles per insn  (83.32%)
     1,249,192,424      branches                  #  779.477 M/sec                    (83.45%)
        53,331,830      branch-misses             #    4.27% of all branches          (83.39%)

       1.604168764 seconds time elapsed

       1.603302000 seconds user
       0.000000000 seconds sys

$ clang test.c -O2 -lm -DFUNC=old -o old
$ clang test.c -O2 -lm -DFUNC=new_jorg -o new_jorg
$ clang test.c -O2 -lm -DFUNC=new_tim -o new_tim
$ perf stat ./old
-2015

 Performance counter stats for './old':

          1,380.73 msec task-clock                #    0.999 CPUs utilized          
                 5      context-switches          #    0.004 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                84      page-faults               #    0.061 K/sec                  
     3,995,933,536      cycles                    #    2.894 GHz                      (83.20%)
     1,394,148,218      stalled-cycles-frontend   #   34.89% frontend cycles idle     (83.20%)
       862,300,694      stalled-cycles-backend    #   21.58% backend cycles idle      (66.81%)
     4,489,172,806      instructions              #    1.12  insn per cycle         
                                                  #    0.31  stalled cycles per insn  (83.49%)
       900,959,583      branches                  #  652.525 M/sec                    (83.49%)
        53,362,571      branch-misses             #    5.92% of all branches          (83.30%)

       1.381509128 seconds time elapsed

       1.381467000 seconds user
       0.000000000 seconds sys


$ perf stat ./new_jorg
-2015

 Performance counter stats for './new_jorg':

          1,592.40 msec task-clock                #    0.995 CPUs utilized          
                50      context-switches          #    0.031 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                82      page-faults               #    0.051 K/sec                  
     4,622,058,007      cycles                    #    2.903 GHz                      (83.28%)
     1,458,687,467      stalled-cycles-frontend   #   31.56% frontend cycles idle     (83.18%)
       851,065,229      stalled-cycles-backend    #   18.41% backend cycles idle      (66.83%)
     5,955,613,420      instructions              #    1.29  insn per cycle         
                                                  #    0.24  stalled cycles per insn  (83.45%)
     1,053,075,802      branches                  #  661.313 M/sec                    (83.44%)
        56,444,889      branch-misses             #    5.36% of all branches          (83.27%)

       1.600838210 seconds time elapsed

       1.593224000 seconds user
       0.000000000 seconds sys


$ perf stat ./new_tim
-2015

 Performance counter stats for './new_tim':

          1,298.35 msec task-clock                #    0.997 CPUs utilized          
                53      context-switches          #    0.041 K/sec                  
                 1      cpu-migrations            #    0.001 K/sec                  
                83      page-faults               #    0.064 K/sec                  
     3,773,061,211      cycles                    #    2.906 GHz                      (83.21%)
       989,454,803      stalled-cycles-frontend   #   26.22% frontend cycles idle     (83.39%)
       784,657,116      stalled-cycles-backend    #   20.80% backend cycles idle      (66.75%)
     5,598,290,346      instructions              #    1.48  insn per cycle         
                                                  #    0.18  stalled cycles per insn  (83.37%)
       901,733,975      branches                  #  694.524 M/sec                    (83.43%)
        54,266,355      branch-misses             #    6.02% of all branches          (83.21%)

       1.302192036 seconds time elapsed

       1.299178000 seconds user
       0.000000000 seconds sys

@TimWolla
Copy link
Member Author

My conclusion from these tests is that the performance heavily depends on the compiler used, the input distribution, and the rounding mode (with my version benefiting from the switch() for the alternative rounding modes).

I'd argue that the modf-based version is the most readable one and thus the one that is easiest to verify to be correct. Together with the performance results that are unclear at best, I'd say it's the winner here.

@Girgias
Copy link
Member

Girgias commented Sep 17, 2023

For reference, LLVM seems to have recently merged a faster version of fmod which may explain the compiler differences: https://reviews.llvm.org/D127046?id=436463

@SakiTakamachi
Copy link
Member

@TimWolla 's implementation appears to be numerically stable.

Regarding the case of slowness, isn't this within an acceptable range considering accuracy?

@TimWolla
Copy link
Member Author

For reference, LLVM seems to have recently merged a faster version of fmod which may explain the compiler differences: https://reviews.llvm.org/D127046?id=436463

@Girgias fmod (calculate the remainder for floating points, used for even/odd only) != modf (split into integral and fractional part) 😄

Also the clang version is tested for the non [0.0, 1.0) distributions is clang 10 from Ubuntu 20.04, so it's a little dated anyway.

Copy link
Member

@bukka bukka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed privately I agree that it's better to have more consistent results. The code is certainly more readable now. Also thanks for adding really good explaining comments.

@TimWolla TimWolla merged commit 9652889 into php:master Sep 19, 2023
@TimWolla TimWolla deleted the round-modf branch September 19, 2023 16:05
@TimWolla
Copy link
Member Author

Now merged, thank you.

TimWolla added a commit to TimWolla/php-src that referenced this pull request Sep 23, 2023
Since phpGH-12220 the implementation of `php_round_helper()`, which performs
rounding to an integral value, is easy to verify for correctness up to the
floating point precision.

If rounding to 0 places is desired, i.e. the userland `round()` function is
called with `$precision = 0`, we bypass all logic for the decimal point
adjustment and instead directly call `php_round_helper()`.

This change fixes the remaining two cases of phpGH-12143 and likely guarantees
correct rounding for all possible inputs and `$precision = 0`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants