Skip to content

Rayon experiment with matrix multiplication #190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

bluss
Copy link
Member

@bluss bluss commented Apr 17, 2016

No description provided.

@kernelmachine
Copy link

haha, we're working on the same thing: https://github.com/pegasos1/rsmat/blob/master/src/dot/dot.rs. What sort of performance were you able to get? with current implementation, on a 100 x 100 array i have ~400k ns (+/- ~120k ns) w/ rayon, and ~360k ns (+/- 12k ns) w/ simple_parallel.

@bluss
Copy link
Member Author

bluss commented Apr 18, 2016

@pegasos1 I can only compare with you if there's a relative measure. For example your parallelized version vs ndarray's regular matrixmultiply in a single thread.

It looks like we have made the same choices in strategy. I think you should really use the matrixmultiply algorithm and not a naive matrix multiplication implementation, the difference should be pretty huge.

@bluss
Copy link
Member Author

bluss commented Apr 18, 2016

Here are my results from that code, comparing single threaded matrixmultiply based (master) vs rayon + matrixmultiply based (this PR)

name                   onethread.log ns/iter  rayon1.log ns/iter    diff ns/iter   diff %
mat_mul_f32::m064      27,690                 44,536                      16,846   60.84%
mat_mul_f32::m127      188,974                160,620                    -28,354  -15.00%
mat_mul_f32::mix10000  14,242,672             10,600,079              -3,642,593  -25.58%

As you can see the wins are a bit marginal.

Edit: The m064 measurement was possibly wrong due to cpu scaling effects. It's back at this with commit 3 in the PR at least.

mat_mul_f32::m064      27,690                 27,736                          46   0.17%

@bluss
Copy link
Member Author

bluss commented Apr 18, 2016

But it's not been tuned at all.

@bluss
Copy link
Member Author

bluss commented Apr 18, 2016

The table is created using https://github.com/BurntSushi/cargo-benchcmp

@bluss bluss closed this Apr 18, 2016
@bluss bluss reopened this Apr 18, 2016
@bluss
Copy link
Member Author

bluss commented Apr 18, 2016

oops.

@bluss bluss force-pushed the rayon-experiment branch from 8f3b2b9 to 2d41bdd Compare April 18, 2016 11:56
@kernelmachine
Copy link

cool, i just read your blog post about that. i'll try that. good to know we had the same strategy.

one big marker of performance was blocksize, obviously smaller blocksize meant more time was spent splitting the matrix than performing the computation. interestingly simple_parallel and rayon consistently converged in performance when blocksize was about n/2. how did you determine blocksize?

@kernelmachine
Copy link

which i guess is SPLIT in your case

@kernelmachine
Copy link

also thanks for the link, that's useful

@bluss
Copy link
Member Author

bluss commented Apr 18, 2016

I haven't tuned the block size.

@bluss
Copy link
Member Author

bluss commented Apr 18, 2016

A difference is that rayon uses a thread pool and simple parallel does not.

@kernelmachine
Copy link

yep, explains its general worse performance.

@bluss
Copy link
Member Author

bluss commented Apr 18, 2016

I'm a bit surprised by that.

@kernelmachine
Copy link

kernelmachine commented Apr 21, 2016

got some substantial improvements by using matrixmultiply and doing some optimizations. Here are the current benchmarks multiplying 128 x 100 and 100 x 128 matrices, blocksize = 64, on my machine.

test dot::test::tests::bench_dot_single_thread ... bench: 347,286 ns/iter (+/- 10,260)
test dot::test::tests::bench_dot_rayon ... bench: 223,132 ns/iter (+/- 59,974)
test dot::test::tests::bench_dot_simple_parallel ... bench: 259,094 ns/iter (+/- 49,819)
test dot::test_rblas::bench_dot_openblas ... bench: 201,170 ns/iter (+/- 18,666)

https://github.com/pegasos1/rsmat/blob/master/src/dot/dot.rs

@kernelmachine
Copy link

kernelmachine commented Apr 21, 2016

haha. for funsies, I ran the same benchmarks on a c4.2xlarge EC2 instance (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html#c4-instances-hardware), and came up with the following. Of note is that simple_parallel performs better than rayon here. And, oh my god openblas.

test dot::test::tests::bench_dot ... bench: 323,867 ns/iter (+/- 4,005)
test dot::test::tests::bench_dot_rayon ... bench: 239,585 ns/iter (+/- 15,530)
test dot::test::tests::bench_dot_simple_parallel ... bench: 175,125 ns/iter (+/- 4,108)
test dot::test_rblas::bench_dot_openblas ... bench: 91,242 ns/iter (+/- 686)

@bluss
Copy link
Member Author

bluss commented Apr 22, 2016

Wow cool. Did you use -C target-cpu=native (or =avx2)? matrixmultiply would benefit from that.

@bluss
Copy link
Member Author

bluss commented Apr 22, 2016

By the way, there's a public function for matrix multiplication into a place now

https://bluss.github.io/rust-ndarray/master/ndarray/linalg/fn.general_mat_mul.html

@kernelmachine
Copy link

Hm marginal improvements with target-cpu=native. variance goes down

test bench_dot ... bench: 348,404 ns/iter (+/- 16,236)
test bench_dot_rayon ... bench: 219,044 ns/iter (+/- 41,312)
test bench_dot_simple_parallel ... bench: 256,304 ns/iter (+/- 4,923)

@kernelmachine
Copy link

Cool! It would be so cool to make a parallel view, analogous to rayon's parallel iterator (par_iter), such that any relevant operations applied to the par_view gets automatically rayon'ed.

@bluss
Copy link
Member Author

bluss commented Apr 23, 2016

Maybe wrapper types.

I've been contemplating that to make the higher order functions customizable. With specialization we can do that crazy stuff automatically (special case for Send/Sync) operations, but before that we can do it manually.

I mean something like this:

array.mapv(|x| x + 1.);   // singlethreaded, because we don't know if the closure is thread safe
array.mapv(Par(|x| x + 1.));  // parallel
array.mapv(Par(|x| x + 1.).threads(2));  // parallel with options

Using a trait that is a superset of the closure traits so to speak enables the interface to be very ergonomic. You could use the same to opt-in to parallel matrix multiply (but it should probably be automatic or opt-out ?).

@kernelmachine
Copy link

What's the difference between

array.mapv(Par(|x| x + 1))

and

array.par_view().mapv(|x| x + 1)

Is the latter version not more extensible to different types of array operations?

@bluss
Copy link
Member Author

bluss commented Apr 25, 2016

Maybe, it seems to me that par_view requires up front to implement every array view operation on it, to make it reasonable to introduce it.

I think ndarray can implement the parallel iterator traits on its iterators, but I don't see the need for par_view.

@kernelmachine
Copy link

Yeah true

On Mon, Apr 25, 2016 at 4:05 PM bluss [email protected] wrote:

Maybe, it seems to me that par_view requires up front to implement every
array view operation on it, to make it reasonable to introduce it.

I think ndarray can implement the parallel iterator traits on its
iterators, but I don't see the need for par_view.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#190 (comment)

@bluss bluss closed this Dec 14, 2016
@bluss bluss changed the title Rayon experiment Rayon experiment with matrix multiplication Dec 14, 2016
@lloydmeta
Copy link
Contributor

Sorry to revive an old thread, but is parallel matrix multiplication available? I couldn't find it in the current API docs but maybe I'm looking in the wrong place. I'm using blas if that helps.

Thanks.

@bluss
Copy link
Member Author

bluss commented Aug 13, 2017

Hi, it's not available natively, only using a BLAS impl that uses it transparently like OpenBLAS.

There's a library in Rust called matrixmultiply_mt, but I haven't looked at it.

@lloydmeta
Copy link
Contributor

@bluss I see, thanks for letting me know.

I'm using OpenBLAS using the instructions you provided, and I can see that my program is running on 5 threads, but it doesn't look like it's using more than 1 core: the CPU% is only at 100% whereas usually when I use something like Rayon, it goes to 350%-400%. Are there any flags I need to set when compiling or running the program?

@lloydmeta
Copy link
Contributor

I just sampled the process and realised it was almost exclusively using matrixmultiply::gemm::sgemm instead of cblas_sgemm. Strange..

@bluss
Copy link
Member Author

bluss commented Aug 13, 2017

Ok, that could be if the code considers the memory layout to not be blas compatible. I'm not 100% sure we forward to blas in all possible cases. What's the layout of the arrays being multiplied?

@bluss
Copy link
Member Author

bluss commented Aug 13, 2017

Layout as ".shape() and .strides()" information, or as printed with shape and stride by Debug formatter.

@lloydmeta
Copy link
Contributor

lloydmeta commented Aug 13, 2017

Thanks by the way for going through this with me.

I'm using .dot() in a couple of places

m1:
  shape:
  [
      1,
      200
  ]
  stride: 
  [
      0,
      1
  ]
m2:
  shape:
  [
      200,
      7351
  ]
  stride:
  [
      7351,
      1
  ]
m1:
  shape:
  [
      200,
      7351
  ]
  stride: 
  [
      7351,
      1
  ]
m2:
  shape: 
  [
      7351,
      1
  ]
  stride: 
  [
      1,
      7351
  ]
m1:
  shape: 
  [
      7351,
      1
  ]
  stride: 
  [
      1,
      7351
  ]
m2:
  shape: 
  [
      1,
      200
  ]
  stride: 
  [
      0,
      1
  ]

The 1x200 matrices with stride [0,1] were made by broadcasting Array1s into Array2s, but I've also tried not broadcasting and using Array2::from_shape_vec to get stride [200, 1], but it still isn't using OpenBlas..

@bluss
Copy link
Member Author

bluss commented Aug 13, 2017

Thanks -- I will look at this case. As aside from this bug, I see it's a multiplication with a column vector, so the gemv or mat_vec multiply method will handle this better. Neither blas nor ndarray picks algorithm automatically here, though I could imagine one desires that ndarray's .dot() should do that; it's that kind of generic interface isn't it.

I don't know off hand if gemv is multithreaded in blas; it's a much simpler problem so maybe not. Both blas and ndarray's matrixmultiply will do too much work when using their gemm operations, because they need to pad the input to a multiple of the kernel size. In effect, multiplying with 1 column has almost the same cost as multiplying with 4 columns. They are all about the big inputs 😄

@bluss
Copy link
Member Author

bluss commented Aug 13, 2017

I focused on the column vector problem since it's the only one that should be passed to blas (since it's not broadcasted), but maybe it's not that one that is the problem.

@bluss
Copy link
Member Author

bluss commented Aug 13, 2017

Ok, (2) I can't reproduce, it will use BLAS.

Inputs are with (shape, stride, layout):

([200, 7351], [7351, 1], C (0x1))
([7351, 1], [1, 7351], C (0x1))

(1) and (3) should also be able to use BLAS -- they are broadcasted, but only across the axis where they have 1 element, so it doesn't matter.

@lloydmeta
Copy link
Contributor

lloydmeta commented Aug 14, 2017

@bluss Thanks for taking a look. I dug a bit deeper and found these lines in impl_alg.rs, which is used as a guard before using cblas here.

    let s1 = a.strides()[1];
    if s1 != 1 {
        return false;
    }

From what I can tell, in my (2) case:

m1:
shape: [
    200,
    7351
], stride: [
    7351,
    1
]
m2:
shape: [
    7351,
    1
], 
stride: [
    1,
    7351
]

both_f would be false (strides element 0 for both are not == 1), lhs (m1) axes don't get swapped (not a square matrix so m != a), and rhs (m2) axes don't get swapped (strides element 0 not == 1).

This means that when rhs (m2) is passed to blas_row_major_2d, the 2nd element from the strides() array is not equal to 1 -> a false is returned, and we don't get to use cblas :(

The same thing happens to lhs (m1) in the (3) case I wrote about earlier.

The (1) case works though (I just confirmed).

@lloydmeta
Copy link
Contributor

I think I finally figured out what was going on. Those column (N x 1) matrices that are causing problems were made by transposing (1x N) row matrices via t(), and as a result, have strides that are (1, N) instead of (1, 1), as they would be if I had created the column matrices from scratch.

I managed to "fix" the problem by making the transposed column matrices via Array::from_shape_vec, passing the desired column shape and vector data, but it would be great to know if there is an easier way to do this? Perhaps array.t().into_shape()?

@lloydmeta
Copy link
Contributor

Just confirmed that doing array.t().into_shape(new_dims) fixes the problem but I think it would be great if we could have a transpose method that transposes and reshapes without calling into_shape because it requires further unwrapping.

@bluss
Copy link
Member Author

bluss commented Aug 15, 2017

(I might not have time to answer every day or even every week, don't worry.)

It's a bug if it's not using the intended code path. There are a few cases and it's not covering them all; it needs to ignore the stride of the axis that is only length 1, so the work around with into_shape should not be needed. This is issue #340

What's your thought about picking the “gemm” vs “gemv” algorithm automatically for matrix × column? This is issue #341

There are two competing concerns with this question:

  • The user knows their data and picks the operation they know they need
  • The library knows the memory layout cases and picks the best algorithm.

I think it would be natural to have a.dot(b) adaptive but the more raw methods general_mat_mul and general_mat_vec_mul to always use their “gemm” and “gemv” algorithms respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants