Skip to content

Rewrite parts of stdlib where optval could be used #524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Carltoffel opened this issue Sep 17, 2021 · 28 comments · Fixed by #528
Closed

Rewrite parts of stdlib where optval could be used #524

Carltoffel opened this issue Sep 17, 2021 · 28 comments · Fixed by #528
Labels
refactoring Internal change for better maintainablility

Comments

@Carltoffel
Copy link
Member

While exploring the stdlib I discovered the optval function and shortly after a routine which could have made use of it but it doesn't. Should we search for such optional arguments and replace the if-else-blocks with the optval-function?
If there are no other plans which would interfere with this, I think this should be done because it shortens the code which makes the code easier to understand.

If I find time to build the stdlib I could do the job and open a PR. Please tell me if I have to keep something in mind, otherwise I will just try it.

@gareth-nx
Copy link
Contributor

Does anyone have a good understanding of the likely performance implications of using a function-call, rather than a 2-line

x = default
if(present(x_in)) x = x_in

While we would hope there was no difference, I would not be shocked if there was (just because function or subroutine calls can have overhead). So I would be cautious about these changes.

Apologies to raise this question without any evidence whatsoever. But I wonder if anyone has experience on the issue?

@leonfoks
Copy link

I thought there was always a small performance hit with things like present() because it's checked at runtime, not compile time?

@gareth-nx
Copy link
Contributor

I recall seeing Steve Lionel comment at one stage that present was cheap -- whereas I could imagine that optval is more of an unknown for the compiler.

But this is all speculation.

However, putting this style of coding all throughout stdlib will (IMO) be seen as encouraging this style of programming. So it would be nice to be confident that it doesn't lead to performance problems. Or, if there are known cases where it's an issue, then we could just warn about them in the optval documentation (but perhaps not use it everywhere in stdlib).

Again, sorry I don't have the answers here.

@gareth-nx
Copy link
Contributor

Ok so I've made a test here, that might help with the discussion.

It suggests there can be performance issues (factor of 2 difference), assuming there's nothing too wrong with how I've written this example. (Disclaimer: Written quickly -- there could be problems !!).

I've made a simple subroutine that either squares or cubes a number, depending on the value of an optional argument, using both optval, and present.

!
! square_or_cube_mod.f90
!
module square_or_cube_mod
    !
    ! This module contains two versions of a subroutine that square or cube a number
    ! One version uses optval, one uses if(present(...))
    !
    use stdlib_optval, only: optval
    implicit none

contains

    subroutine square_or_cube_with_optval(x, use_square)
        real, intent(inout) :: x
        logical, optional, intent(in) :: use_square
        logical :: local_use_square

        local_use_square = optval(use_square, .true.)

        if(local_use_square) then
            x = x*x
        else
            x = x*x*x
        end if

    end subroutine

    subroutine square_or_cube_with_present(x, use_square)
        real, intent(inout) :: x
        logical, optional, intent(in) :: use_square
        logical :: local_use_square

        local_use_square = .true. ! default
        if(present(use_square)) local_use_square = use_square

        if(local_use_square) then
            x = x*x
        else
            x = x*x*x
        end if

    end subroutine

end module

Here is the program that times each version

!
! run_test.f90
!
program speed_test_optval_vs_present
    use square_or_cube_mod, only: square_or_cube_with_present, square_or_cube_with_optval
    implicit none
    
    integer, parameter :: N = 1e+07, reps = 10
    real, allocatable :: x(:), x_orig(:)
    integer :: i, j, t0, t1, t_present, t_optval

    allocate(x(N), x_orig(N))

    call random_number(x_orig)

    t_present = 0
    t_optval = 0

    do j = 1, reps

        ! Case with present
        x = x_orig
        call system_clock(t0)
        do i = 1, N
            call square_or_cube_with_present(x(i), .true.)
        end do
        call system_clock(t1)
        t_present = t_present + (t1 - t0)

        ! Case with optval
        x = x_orig
        call system_clock(t0)
        do i = 1, N
            call square_or_cube_with_optval(x(i), .true.)
        end do
        call system_clock(t1)
        t_optval = t_optval + (t1 - t0)

    end do

    print*, 'Time ratio (present / optval) :',  t_present * 1.0 / t_optval

end program

Here is how I compiled it

# I define stdlib variables STDLIB_CFLAGS and STDLIB_LIBS
# source ~/setup_fortran_stdlib.sh
    
# Compile
gfortran -O3 -cpp -c $STDLIB_CFLAGS square_or_cube_mod.f90
gfortran -O3 -c $STDLIB_CFLAGS run_test.f90
# Link
gfortran -O3 -o run_test square_or_cube_mod.o run_test.o $STDLIB_LIBS

When running this I get a factor-of-2 difference in favour of the version with present

./run_test 
 Time ratio (present / optval) :  0.467796624    

Let me stress that I'm not trying to say people should never use optval, even if the above results hold. Obviously in this context the subroutine is not doing lots of work, so there's more opportunity for issues. But such things do happen, and IMO are worth considering before promoting overly uniform usage of one or other programming style.

@Carltoffel
Copy link
Member Author

Could you repeat the test with the additional flag -flto, please?

@gareth-nx
Copy link
Contributor

gareth-nx commented Sep 18, 2021

Done -- there was no change. [Edit -- this was a mistake, see thread below -- actually the performance advantage of of present was increased]

Here is the modified build script:

# I define stdlib variables STDLIB_CFLAGS and STDLIB_LIBS
# source ~/setup_fortran_stdlib.sh

# Compile
gfortran -O3 -cpp -c -flto $STDLIB_CFLAGS square_or_cube_mod.f90
gfortran -O3 -c -flto $STDLIB_CFLAGS run_test.f90
# Link
gfortran -O3 -o -flto run_test square_or_cube_mod.o run_test.o $STDLIB_LIBS 

Compiled and ran with

$ source build.sh 
$ ./run_test 
 Time ratio (present / optval) :  0.458193988    

@Carltoffel
Copy link
Member Author

Are you sure the calculation is doing anything? The compiler might skip the whole loop, because you are not using x in any way. You could print (parts of) x to verify both routines are doing the same.

@gareth-nx
Copy link
Contributor

Good point. Also, while checking this, I realised there was an error in my use of the -flto option.

When I do it correctly the advantages of present are even greater (about 4 or 5 times faster). I guess this is because lto cannot see inside the pre-compiled stdlib library (?).

I get the same relative performance when adding some code to try to prevent optimizations, as you suggested. For simplicity I only show that version here:

The new code is:

program speed_test_optval_vs_present
    use square_or_cube_mod, only: square_or_cube_with_present, square_or_cube_with_optval
    implicit none

    integer, parameter :: N = 1e+07, reps = 10
    real, allocatable :: x(:), x_orig(:)
    real :: random_val
    integer :: i, j, t0, t1, t_present, t_optval, random_index
    logical :: all_passed = .true.

    allocate(x(N), x_orig(N))

    call random_number(x_orig)

    t_present = 0
    t_optval = 0

    do j = 1, reps

        ! Make a random index
        call random_number(random_val)
        random_index = max(1, min(int(random_val * N), N))

        ! Case with present
        x = x_orig
        call system_clock(t0)
        do i = 1, N
            call square_or_cube_with_present(x(i), .true.)
        end do
        call system_clock(t1)
        t_present = t_present + (t1 - t0)

        ! Use the value of x in some way.
        ! Compare this with the subsequent call, and ensure they are the same.
        random_val = x(random_index)

        ! Case with optval
        x = x_orig
        call system_clock(t0)
        do i = 1, N
            call square_or_cube_with_optval(x(i), .true.)
        end do
        call system_clock(t1)
        t_optval = t_optval + (t1 - t0)

        ! Comparison with previous (to try to inhibit compiler optimization)
        if(x(random_index) /= random_val) all_passed = .false.

    end do

    print*, 'Time ratio (present / optval) :',  t_present * 1.0 / t_optval
    print*, merge('PASS', 'FAIL', all_passed)

end program

And here is the corrected build script

# I define stdlib variables STDLIB_CFLAGS and STDLIB_LIBS
# source ~/setup_fortran_stdlib.sh

# Compile
gfortran -O3 -cpp -c -flto $STDLIB_CFLAGS square_or_cube_mod.f90
gfortran -O3 -c -flto $STDLIB_CFLAGS run_test.f90
# Link
gfortran -O3 -flto -o run_test square_or_cube_mod.o run_test.o $STDLIB_LIBS 

Compiled and ran with

$ source build.sh 
$ ./run_test 
 Time ratio (present / optval) :  0.209183678    
 PASS

@jvdp1
Copy link
Member

jvdp1 commented Sep 18, 2021

Are you sure the calculation is doing anything? The compiler might skip the whole loop, because you are not using x in any way. You could print (parts of) x to verify both routines are doing the same.

Using the first code provided by @gareth-nx and adding a print*,x(i) after each loop, I got a ratio around 1 when compiling with -flto -fPIC BOTH stdlib AND the test code.

When using the new code provided by @gareth-nx, the ratio is almost always 1.0000 when using the options flto -fPIC for both stdlib and @gareth-nx code.

Removing '-flto -fPIC' from the compilation of stdlib resulted in ratios similar to the ones obtained by @gareth-nx

@gareth-nx could you check if you compile stdlib with the options -lto and/or -fPIC too?

From my understanding, present() just checks if the address of the arg is "0" or not.

@awvwgk
Copy link
Member

awvwgk commented Sep 18, 2021

From my understanding, present() just checks if the address of the arg is "0" or not.

Unless the dummy argument has the value attribute, in this case it is most likely an additional descriptor passed which is checked.

@gareth-nx
Copy link
Contributor

@jvdp1 I am quite ignorant of CMake and so on, but tried to change the compilation of stdlib by adding the line add_compile_options(-flto -fPIC) to CMakeLists.txt in stdlib, near where a bunch of similar options are defined.

I'm not sure if that's the right thing to do. Probably not -- now I'm getting even worse performance for optval.

$ ./run_test 
Time ratio (present / optval) :  0.152777776    
PASS

@jvdp1
Copy link
Member

jvdp1 commented Sep 18, 2021

Some additions to the tests:

  • stdlib is compiled with the options -O3 or -O3 -lto -fPIC.

  • Compilation of the test code as follows (always):

gfortran -O3 -flto -fPIC  -fopt-info -I ~/stdlib/build/src/mod_files/ modtest.f90 newtest.f90 ~/stdlib/build/src/libfortran_stdlib.a 
  • Output of gfortran when stdlib is compiled without -flto -fPIC (so only -O3):
modtest.f90:29:42: optimized: Semantic equality hit:__square_or_cube_mod_MOD_square_or_cube_with_present.part.0/3->__square_or_cube_mod_MOD_square_or_cube_with_optval.part.0/4
modtest.f90:29:42: optimized: Assembler symbol names:__square_or_cube_mod_MOD_square_or_cube_with_present.part.0/3->__square_or_cube_mod_MOD_square_or_cube_with_optval.part.0/4
optimized:  Inlined __square_or_cube_mod_MOD_square_or_cube_with_present.part.0.constprop/44 into square_or_cube_with_present.constprop/42 which now has time 7.100000 and size 12, net change
 of +0.
optimized:  Inlined __square_or_cube_mod_MOD_square_or_cube_with_present.part.0.constprop/43 into square_or_cube_with_optval.constprop/41 which now has time 20.000000 and size 15, net change
 of -7.
optimized:  Inlined square_or_cube_with_present.constprop/42 into speed_test_optval_vs_present/8 which now has time 25375.085144 and size 222, net change of -7 (cross module).
optimized:  Inlined square_or_cube_with_optval.constprop/41 into speed_test_optval_vs_present/8 which now has time 30816.329254 and size 230, net change of -7 (cross module).
newtest.f90:38:18: optimized: loop vectorized using 16 byte vectors
newtest.f90:38:18: optimized:  loop versioned for vectorization because of possible aliasing
newtest.f90:27:14: optimized: loop vectorized using 16 byte vectors
newtest.f90:25:18: optimized: loop vectorized using 16 byte vectors
newtest.f90:25:18: optimized:  loop versioned for vectorization because of possible aliasing
newtest.f90:38:18: optimized: loop with 2 iterations completely unrolled (header execution count 4541736)
newtest.f90:25:18: optimized: loop with 2 iterations completely unrolled (header execution count 4541735)
newtest.f90:6:40: optimized: basic block part vectorized using 16 byte vectors
newtest.f90:11:29: optimized: basic block part vectorized using 16 byte vectors
newtest.f90:11:29: optimized: basic block part vectorized using 16 byte vectors
newtest.f90:51:74: optimized: basic block part vectorized using 16 byte vectors
newtest.f90:52:45: optimized: basic block part vectorized using 16 byte vectors

And the ratio:

 Time ratio (present / optval) :  0.333333343    
 PASS
  • Output of gfortran when stdlib is compiled with -O3 -flto -fPIC
modtest.f90:29:42: optimized: Semantic equality hit:__square_or_cube_mod_MOD_square_or_cube_with_present.part.0/3->__square_or_cube_mod_MOD_square_or_cube_with_optval.part.0/4
modtest.f90:29:42: optimized: Assembler symbol names:__square_or_cube_mod_MOD_square_or_cube_with_present.part.0/3->__square_or_cube_mod_MOD_square_or_cube_with_optval.part.0/4
optimized:  Inlined __square_or_cube_mod_MOD_square_or_cube_with_present.part.0.constprop/63 into square_or_cube_with_present.constprop/61 which now has time 7.100000 and size 12, net change
 of +0.
optimized:  Inlined __square_or_cube_mod_MOD_square_or_cube_with_present.part.0.constprop/62 into square_or_cube_with_optval.constprop/59 which now has time 20.000000 and size 15, net change
 of -7.
optimized:  Inlined optval_ll1.constprop/60 into square_or_cube_with_optval.constprop/59 which now has time 7.350000 and size 12, net change of -8 (cross module).
optimized:  Inlined square_or_cube_with_present.constprop/61 into speed_test_optval_vs_present/8 which now has time 25375.085144 and size 222, net change of -7 (cross module).
optimized:  Inlined square_or_cube_with_optval.constprop/59 into speed_test_optval_vs_present/8 which now has time 17955.206909 and size 227, net change of -7 (cross module).
newtest.f90:40:14: optimized: loop vectorized using 16 byte vectors
newtest.f90:38:18: optimized: loop vectorized using 16 byte vectors
newtest.f90:38:18: optimized:  loop versioned for vectorization because of possible aliasing
newtest.f90:27:14: optimized: loop vectorized using 16 byte vectors
newtest.f90:25:18: optimized: loop vectorized using 16 byte vectors
newtest.f90:25:18: optimized:  loop versioned for vectorization because of possible aliasing
newtest.f90:38:18: optimized: loop with 2 iterations completely unrolled (header execution count 4541736)
newtest.f90:25:18: optimized: loop with 2 iterations completely unrolled (header execution count 4541735)
newtest.f90:6:40: optimized: basic block part vectorized using 16 byte vectors
newtest.f90:11:29: optimized: basic block part vectorized using 16 byte vectors
newtest.f90:11:29: optimized: basic block part vectorized using 16 byte vectors
newtest.f90:51:74: optimized: basic block part vectorized using 16 byte vectors
newtest.f90:52:45: optimized: basic block part vectorized using 16 byte vectors

and the ratio:

Time ratio (present / optval) :  0.980000019    
 PASS

With these outputs, it is clear that using -flto -fPIC when compiling stdlib allows the inlining of optval() in the test code, which result in an expected ratio of ~1.

@jvdp1
Copy link
Member

jvdp1 commented Sep 18, 2021

@jvdp1 I am quite ignorant of CMake and so on, but tried to change the compilation of stdlib by adding the line add_compile_options(-flto -fPIC) to CMakeLists.txt in stdlib, near where a bunch of similar options are defined.

I am also quite ignorant of CMake. @awvwgk could probably help for a better use of CMake
I did similary as you, except that I commented all other (debug) options, because that may cancel the benefit of -flto -fPIC

@gareth-nx
Copy link
Contributor

@jvdp1 @awvwgk @Carltoffel So I followed the approach of @jvdp1 and hacked the CMakelists.txt file in stdlib to remove a number of compilation options and enforce others:

  #add_compile_options(-fimplicit-none)
  #add_compile_options(-ffree-line-length-132)
  #add_compile_options(-Wall)
  #add_compile_options(-Wextra)
  #add_compile_options(-Wimplicit-procedure)
  #add_compile_options(-Wconversion-extra)
  add_compile_options(-O3 -flto -fPIC)

Now, I am getting relative speeds very close to 1, as reported by @jvdp1 .

So it seems there are a bunch of interacting issues here:

  • How stdlib is built
  • Compiler options used by codes that use stdlib

This might be tricky to deal with in the general case? For instance I recall having some codes that I could not compile with "lto" (I think they used pre-compiled netcdf or similar).

@jvdp1
Copy link
Member

jvdp1 commented Sep 18, 2021

@awvwgk Are these options mentioned by @gareth-nx always use in the Release version of CMake? Or are there additional otpions I didn't find?

@Carltoffel
Copy link
Member Author

Carltoffel commented Sep 18, 2021

I did a test with the fpm version of stdlib. My test calculates x = x * x - 1 with four subroutines, but only if the flag is .true.:

  • noif uses no if, it is just the plain formular
  • nonopt has an non-optional flag.
  • present uses if(present(flag)) then if(flag)...
  • optval uses if(optval(flag, .true.))

Time measurement same as above, but printing the plain number. 1e7 iterations and 100 measurements per routine
The result:

        | no flags |  -Ofast  | -Ofast -flto
--------------------------------------------
noif    | 33016252 | 18468768 |   18796420
nonopt  | 32831456 | 18578640 |   18646635
present | 33101287 | 18620096 |   18679841 
optval  | 42808225 | 30449917 |   18674615

Edit: testing fpm project: https://github.com/Carltoffel/optval-testing

@Carltoffel
Copy link
Member Author

This might be tricky to deal with in the general case? For instance I recall having some codes that I could not compile with "lto" (I think they used pre-compiled netcdf or similar).

As long as stdlib can be compiled using these flags, I don't see any issues, because we are talking about the performance of optval inside stdlib itself.

@awvwgk
Copy link
Member

awvwgk commented Sep 18, 2021

There is no need to modify the CMake build files. I think the most straight-forward way to test this (with CMake) is using stdlib-cmake-example to build stdlib together with your test code.

You can just overwrite the stubs in the app and src directories with your test code and setup a build with

❯ export FFLAGS="-O3 -flto -fPIC"
❯ cmake -B _build -G Ninja
❯ cmake --build _build
❯ ./_build/app/stdlib-example

Note that the important step here is telling CMake that you have custom Fortran compile arguments by setting the FFLAGS environment variable. Those compile arguments will be applied globally for all Fortran source code in this CMake build (including stdlib).

@awvwgk
Copy link
Member

awvwgk commented Sep 18, 2021

Are these options mentioned by @gareth-nx always use in the Release version of CMake? Or are there additional otpions I didn't find?

We actually build a no-config version by default, neither release nor debug build is used yet. Except for setting -DCMAKE_BUILD_TYPE to Release, DebWithRelInfo or Debug or globally overwriting the compile arguments with the FFLAGS environment variable there is currently no customization possible. Full debug warnings are also always enabled.

I personally find this setup suboptimal but I haven't heard any complaints from others about this status, therefore I didn't send a patch for this yet.

But let's continue this discussion in a separate issue.

@gareth-nx
Copy link
Contributor

@Carltoffel -- thanks for those results.

If it's not too hard, could you please show how they change if instead of Ofast one uses O2 and O3?

I understand quite a few codes won't use -Ofast, but suspect that most codes can use one of the others.

@Carltoffel
Copy link
Member Author

Carltoffel commented Sep 18, 2021

If it's not too hard, could you please show how they change if instead of Ofast one uses O2 and O3?

Here you are.

         |  -flto   | -O3 -flto | -O2 -flto | -O1 -flto |
---------------------------------------------------------
 noif    | 33172161 |  18792568 |  18726294 |  18845316 |
 nonopt  | 32707464 |  18722671 |  18772107 |  18782929 |
 present | 32956918 |  18800152 |  18888857 |  18689901 |
 optval  | 42851931 |  18753826 |  18717697 |  18714853 |

Edit: while looking at these numbers one could argue that optval can be faster than not only present but also faster than no if at all. Negligibly, but still interesting...

@gareth-nx
Copy link
Contributor

As long as stdlib can be compiled using these flags, I don't see any issues, because we are talking about the performance of optval inside stdlib itself.

Is that correct? I got the idea that link-time optimization would require these flags to be passed at the linking stage (which I presume means, when the user is compiling their executable). Or is that not true?

@gareth-nx
Copy link
Contributor

gareth-nx commented Sep 18, 2021

Here you are.

         |  -flto   | -O3 -flto | -O2 -flto | -O1 -flto |
---------------------------------------------------------
 noif    | 33172161 |  18792568 |  18726294 |  18845316 |
 nonopt  | 32707464 |  18722671 |  18772107 |  18782929 |
 present | 32956918 |  18800152 |  18888857 |  18689901 |
 optval  | 42851931 |  18753826 |  18717697 |  18714853 |

Great -- so there is really no sensitivity to optimization flags, so long as we at least have O1. It's largely about the link-time-optimization.

@awvwgk awvwgk added the refactoring Internal change for better maintainablility label Sep 18, 2021
@epagone
Copy link

epagone commented Sep 18, 2021

Very interesting, Although it seems like that you have already found a (probably) better solution, I'd like to throw-in also the option to use the Compiler Explorer for snippets of code (if you want to check assembly code generated by different compilers with different options).

@milancurcic
Copy link
Member

Thank you for all this analysis, it's very useful. I understand that comparisons here intentionally minimize the workload in the body of the subroutine, i.e. the square or cube of a scalar. If we do choose to use optval instead of present like in #528, it should be only in procedures that are expected to be computationally heavy, e.g. an operation over a large array, so that any potential cost of convenience functions like optval can be considered negligible.

@jvdp1
Copy link
Member

jvdp1 commented Sep 18, 2021

it should be only in procedures that are expected to be computationally heavy, e.g. an operation over a large array, so that any potential cost of convenience functions like optval can be considered negligible.

I totally agree with that. E.g., I have no problems of using optval in stdlib_logger, or as used in #528 for stdlib_bitsets.

@dalon-work
Copy link

dalon-work commented Sep 21, 2021

One idea I haven't seen in this discussion, which may violate everyone's notion of clean fortran code, is to #include the optval function, instead of placing it in a separate compilation unit. In a separate unit, LTO is necessary for the compiler to inline it. But if it gets #included as a private module function, the compiler is free to inline it as it sees fit.

Another idea is for the compiler that creates the optval module to include the implementation of this "small function" in the module binary file, and not just the interface definition. This behavior would act as a "precompiled header", and still allow the desired inlining. But that requires modifying a compiler, of course.

@gareth-nx
Copy link
Contributor

One idea I haven't seen in this discussion, which may violate everyone's notion of clean fortran code, is to #include

I like #include in certain circumstances -- as you say it can help the compiler with efficiency.

But for this particular setting, for cases where performance is an issue, it's likely simpler to just use present in a conventional way. So instead of x = optval(x_in, default), we can write x = default; if(present(x_in)) x = x_in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Internal change for better maintainablility
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants