-
Notifications
You must be signed in to change notification settings - Fork 187
Rewrite parts of stdlib where optval could be used #524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does anyone have a good understanding of the likely performance implications of using a function-call, rather than a 2-line x = default
if(present(x_in)) x = x_in While we would hope there was no difference, I would not be shocked if there was (just because function or subroutine calls can have overhead). So I would be cautious about these changes. Apologies to raise this question without any evidence whatsoever. But I wonder if anyone has experience on the issue? |
I thought there was always a small performance hit with things like present() because it's checked at runtime, not compile time? |
I recall seeing Steve Lionel comment at one stage that But this is all speculation. However, putting this style of coding all throughout stdlib will (IMO) be seen as encouraging this style of programming. So it would be nice to be confident that it doesn't lead to performance problems. Or, if there are known cases where it's an issue, then we could just warn about them in the Again, sorry I don't have the answers here. |
Ok so I've made a test here, that might help with the discussion. It suggests there can be performance issues (factor of 2 difference), assuming there's nothing too wrong with how I've written this example. (Disclaimer: Written quickly -- there could be problems !!). I've made a simple subroutine that either squares or cubes a number, depending on the value of an optional argument, using both !
! square_or_cube_mod.f90
!
module square_or_cube_mod
!
! This module contains two versions of a subroutine that square or cube a number
! One version uses optval, one uses if(present(...))
!
use stdlib_optval, only: optval
implicit none
contains
subroutine square_or_cube_with_optval(x, use_square)
real, intent(inout) :: x
logical, optional, intent(in) :: use_square
logical :: local_use_square
local_use_square = optval(use_square, .true.)
if(local_use_square) then
x = x*x
else
x = x*x*x
end if
end subroutine
subroutine square_or_cube_with_present(x, use_square)
real, intent(inout) :: x
logical, optional, intent(in) :: use_square
logical :: local_use_square
local_use_square = .true. ! default
if(present(use_square)) local_use_square = use_square
if(local_use_square) then
x = x*x
else
x = x*x*x
end if
end subroutine
end module Here is the program that times each version !
! run_test.f90
!
program speed_test_optval_vs_present
use square_or_cube_mod, only: square_or_cube_with_present, square_or_cube_with_optval
implicit none
integer, parameter :: N = 1e+07, reps = 10
real, allocatable :: x(:), x_orig(:)
integer :: i, j, t0, t1, t_present, t_optval
allocate(x(N), x_orig(N))
call random_number(x_orig)
t_present = 0
t_optval = 0
do j = 1, reps
! Case with present
x = x_orig
call system_clock(t0)
do i = 1, N
call square_or_cube_with_present(x(i), .true.)
end do
call system_clock(t1)
t_present = t_present + (t1 - t0)
! Case with optval
x = x_orig
call system_clock(t0)
do i = 1, N
call square_or_cube_with_optval(x(i), .true.)
end do
call system_clock(t1)
t_optval = t_optval + (t1 - t0)
end do
print*, 'Time ratio (present / optval) :', t_present * 1.0 / t_optval
end program
Here is how I compiled it
When running this I get a factor-of-2 difference in favour of the version with
Let me stress that I'm not trying to say people should never use |
Could you repeat the test with the additional flag |
Done -- there was no change. [Edit -- this was a mistake, see thread below -- actually the performance advantage of of Here is the modified build script:
Compiled and ran with
|
Are you sure the calculation is doing anything? The compiler might skip the whole loop, because you are not using |
Good point. Also, while checking this, I realised there was an error in my use of the When I do it correctly the advantages of I get the same relative performance when adding some code to try to prevent optimizations, as you suggested. For simplicity I only show that version here: The new code is: program speed_test_optval_vs_present
use square_or_cube_mod, only: square_or_cube_with_present, square_or_cube_with_optval
implicit none
integer, parameter :: N = 1e+07, reps = 10
real, allocatable :: x(:), x_orig(:)
real :: random_val
integer :: i, j, t0, t1, t_present, t_optval, random_index
logical :: all_passed = .true.
allocate(x(N), x_orig(N))
call random_number(x_orig)
t_present = 0
t_optval = 0
do j = 1, reps
! Make a random index
call random_number(random_val)
random_index = max(1, min(int(random_val * N), N))
! Case with present
x = x_orig
call system_clock(t0)
do i = 1, N
call square_or_cube_with_present(x(i), .true.)
end do
call system_clock(t1)
t_present = t_present + (t1 - t0)
! Use the value of x in some way.
! Compare this with the subsequent call, and ensure they are the same.
random_val = x(random_index)
! Case with optval
x = x_orig
call system_clock(t0)
do i = 1, N
call square_or_cube_with_optval(x(i), .true.)
end do
call system_clock(t1)
t_optval = t_optval + (t1 - t0)
! Comparison with previous (to try to inhibit compiler optimization)
if(x(random_index) /= random_val) all_passed = .false.
end do
print*, 'Time ratio (present / optval) :', t_present * 1.0 / t_optval
print*, merge('PASS', 'FAIL', all_passed)
end program And here is the corrected build script
Compiled and ran with
|
Using the first code provided by @gareth-nx and adding a When using the new code provided by @gareth-nx, the ratio is almost always 1.0000 when using the options Removing '-flto -fPIC' from the compilation of @gareth-nx could you check if you compile stdlib with the options From my understanding, |
Unless the dummy argument has the |
@jvdp1 I am quite ignorant of CMake and so on, but tried to change the compilation of I'm not sure if that's the right thing to do. Probably not -- now I'm getting even worse performance for
|
Some additions to the tests:
gfortran -O3 -flto -fPIC -fopt-info -I ~/stdlib/build/src/mod_files/ modtest.f90 newtest.f90 ~/stdlib/build/src/libfortran_stdlib.a
And the ratio:
and the ratio:
With these outputs, it is clear that using |
I am also quite ignorant of CMake. @awvwgk could probably help for a better use of CMake |
@jvdp1 @awvwgk @Carltoffel So I followed the approach of @jvdp1 and hacked the CMakelists.txt file in stdlib to remove a number of compilation options and enforce others:
Now, I am getting relative speeds very close to 1, as reported by @jvdp1 . So it seems there are a bunch of interacting issues here:
This might be tricky to deal with in the general case? For instance I recall having some codes that I could not compile with "lto" (I think they used pre-compiled netcdf or similar). |
@awvwgk Are these options mentioned by @gareth-nx always use in the Release version of CMake? Or are there additional otpions I didn't find? |
I did a test with the
Time measurement same as above, but printing the plain number. 1e7 iterations and 100 measurements per routine
Edit: testing |
As long as |
There is no need to modify the CMake build files. I think the most straight-forward way to test this (with CMake) is using stdlib-cmake-example to build stdlib together with your test code. You can just overwrite the stubs in the
Note that the important step here is telling CMake that you have custom Fortran compile arguments by setting the |
We actually build a no-config version by default, neither release nor debug build is used yet. Except for setting I personally find this setup suboptimal but I haven't heard any complaints from others about this status, therefore I didn't send a patch for this yet. But let's continue this discussion in a separate issue. |
@Carltoffel -- thanks for those results. If it's not too hard, could you please show how they change if instead of I understand quite a few codes won't use |
Here you are.
Edit: while looking at these numbers one could argue that |
Is that correct? I got the idea that link-time optimization would require these flags to be passed at the linking stage (which I presume means, when the user is compiling their executable). Or is that not true? |
Great -- so there is really no sensitivity to optimization flags, so long as we at least have |
Very interesting, Although it seems like that you have already found a (probably) better solution, I'd like to throw-in also the option to use the Compiler Explorer for snippets of code (if you want to check assembly code generated by different compilers with different options). |
Thank you for all this analysis, it's very useful. I understand that comparisons here intentionally minimize the workload in the body of the subroutine, i.e. the square or cube of a scalar. If we do choose to use |
I totally agree with that. E.g., I have no problems of using |
One idea I haven't seen in this discussion, which may violate everyone's notion of clean fortran code, is to #include the optval function, instead of placing it in a separate compilation unit. In a separate unit, LTO is necessary for the compiler to inline it. But if it gets #included as a private module function, the compiler is free to inline it as it sees fit. Another idea is for the compiler that creates the optval module to include the implementation of this "small function" in the module binary file, and not just the interface definition. This behavior would act as a "precompiled header", and still allow the desired inlining. But that requires modifying a compiler, of course. |
I like But for this particular setting, for cases where performance is an issue, it's likely simpler to just use |
While exploring the
stdlib
I discovered theoptval
function and shortly after a routine which could have made use of it but it doesn't. Should we search for such optional arguments and replace theif
-else
-blocks with theoptval
-function?If there are no other plans which would interfere with this, I think this should be done because it shortens the code which makes the code easier to understand.
If I find time to build the
stdlib
I could do the job and open a PR. Please tell me if I have to keep something in mind, otherwise I will just try it.The text was updated successfully, but these errors were encountered: