-
Notifications
You must be signed in to change notification settings - Fork 324
Rayon experiment with matrix multiplication #190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
haha, we're working on the same thing: https://github.com/pegasos1/rsmat/blob/master/src/dot/dot.rs. What sort of performance were you able to get? with current implementation, on a 100 x 100 array i have ~400k ns (+/- ~120k ns) w/ rayon, and ~360k ns (+/- 12k ns) w/ simple_parallel. |
@pegasos1 I can only compare with you if there's a relative measure. For example your parallelized version vs ndarray's regular matrixmultiply in a single thread. It looks like we have made the same choices in strategy. I think you should really use the matrixmultiply algorithm and not a naive matrix multiplication implementation, the difference should be pretty huge. |
Here are my results from that code, comparing single threaded matrixmultiply based (master) vs rayon + matrixmultiply based (this PR)
As you can see the wins are a bit marginal. Edit: The m064 measurement was possibly wrong due to cpu scaling effects. It's back at this with commit 3 in the PR at least.
|
But it's not been tuned at all. |
The table is created using https://github.com/BurntSushi/cargo-benchcmp |
oops. |
cool, i just read your blog post about that. i'll try that. good to know we had the same strategy. one big marker of performance was blocksize, obviously smaller blocksize meant more time was spent splitting the matrix than performing the computation. interestingly simple_parallel and rayon consistently converged in performance when blocksize was about n/2. how did you determine blocksize? |
which i guess is SPLIT in your case |
also thanks for the link, that's useful |
I haven't tuned the block size. |
A difference is that rayon uses a thread pool and simple parallel does not. |
yep, explains its general worse performance. |
I'm a bit surprised by that. |
got some substantial improvements by using matrixmultiply and doing some optimizations. Here are the current benchmarks multiplying 128 x 100 and 100 x 128 matrices, blocksize = 64, on my machine. test dot::test::tests::bench_dot_single_thread ... bench: 347,286 ns/iter (+/- 10,260) https://github.com/pegasos1/rsmat/blob/master/src/dot/dot.rs |
haha. for funsies, I ran the same benchmarks on a c4.2xlarge EC2 instance (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html#c4-instances-hardware), and came up with the following. Of note is that simple_parallel performs better than rayon here. And, oh my god openblas. test dot::test::tests::bench_dot ... bench: 323,867 ns/iter (+/- 4,005) |
Wow cool. Did you use -C target-cpu=native (or =avx2)? matrixmultiply would benefit from that. |
By the way, there's a public function for matrix multiplication into a place now https://bluss.github.io/rust-ndarray/master/ndarray/linalg/fn.general_mat_mul.html |
Hm marginal improvements with target-cpu=native. variance goes down test bench_dot ... bench: 348,404 ns/iter (+/- 16,236) |
Cool! It would be so cool to make a parallel view, analogous to rayon's parallel iterator (par_iter), such that any relevant operations applied to the par_view gets automatically rayon'ed. |
Maybe wrapper types. I've been contemplating that to make the higher order functions customizable. With specialization we can do that crazy stuff automatically (special case for Send/Sync) operations, but before that we can do it manually. I mean something like this: array.mapv(|x| x + 1.); // singlethreaded, because we don't know if the closure is thread safe
array.mapv(Par(|x| x + 1.)); // parallel
array.mapv(Par(|x| x + 1.).threads(2)); // parallel with options Using a trait that is a superset of the closure traits so to speak enables the interface to be very ergonomic. You could use the same to opt-in to parallel matrix multiply (but it should probably be automatic or opt-out ?). |
What's the difference between array.mapv(Par(|x| x + 1)) and array.par_view().mapv(|x| x + 1) Is the latter version not more extensible to different types of array operations? |
Maybe, it seems to me that par_view requires up front to implement every array view operation on it, to make it reasonable to introduce it. I think ndarray can implement the parallel iterator traits on its iterators, but I don't see the need for par_view. |
Yeah true On Mon, Apr 25, 2016 at 4:05 PM bluss [email protected] wrote:
|
Sorry to revive an old thread, but is parallel matrix multiplication available? I couldn't find it in the current API docs but maybe I'm looking in the wrong place. I'm using blas if that helps. Thanks. |
Hi, it's not available natively, only using a BLAS impl that uses it transparently like OpenBLAS. There's a library in Rust called matrixmultiply_mt, but I haven't looked at it. |
@bluss I see, thanks for letting me know. I'm using OpenBLAS using the instructions you provided, and I can see that my program is running on 5 threads, but it doesn't look like it's using more than 1 core: the CPU% is only at 100% whereas usually when I use something like Rayon, it goes to 350%-400%. Are there any flags I need to set when compiling or running the program? |
I just sampled the process and realised it was almost exclusively using matrixmultiply::gemm::sgemm instead of cblas_sgemm. Strange.. |
Ok, that could be if the code considers the memory layout to not be blas compatible. I'm not 100% sure we forward to blas in all possible cases. What's the layout of the arrays being multiplied? |
Layout as ".shape() and .strides()" information, or as printed with shape and stride by Debug formatter. |
Thanks by the way for going through this with me. I'm using
The 1x200 matrices with stride [0,1] were made by broadcasting Array1s into Array2s, but I've also tried not broadcasting and using Array2::from_shape_vec to get stride [200, 1], but it still isn't using OpenBlas.. |
Thanks -- I will look at this case. As aside from this bug, I see it's a multiplication with a column vector, so the gemv or mat_vec multiply method will handle this better. Neither blas nor ndarray picks algorithm automatically here, though I could imagine one desires that ndarray's I don't know off hand if gemv is multithreaded in blas; it's a much simpler problem so maybe not. Both blas and ndarray's matrixmultiply will do too much work when using their gemm operations, because they need to pad the input to a multiple of the kernel size. In effect, multiplying with 1 column has almost the same cost as multiplying with 4 columns. They are all about the big inputs 😄 |
I focused on the column vector problem since it's the only one that should be passed to blas (since it's not broadcasted), but maybe it's not that one that is the problem. |
Ok, (2) I can't reproduce, it will use BLAS. Inputs are with (shape, stride, layout):
(1) and (3) should also be able to use BLAS -- they are broadcasted, but only across the axis where they have 1 element, so it doesn't matter. |
@bluss Thanks for taking a look. I dug a bit deeper and found these lines in impl_alg.rs, which is used as a guard before using cblas here. let s1 = a.strides()[1];
if s1 != 1 {
return false;
} From what I can tell, in my (2) case:
This means that when rhs (m2) is passed to The same thing happens to lhs (m1) in the (3) case I wrote about earlier. The (1) case works though (I just confirmed). |
I think I finally figured out what was going on. Those column (N x 1) matrices that are causing problems were made by transposing (1x N) row matrices via I managed to "fix" the problem by making the transposed column matrices via |
Just confirmed that doing |
(I might not have time to answer every day or even every week, don't worry.) It's a bug if it's not using the intended code path. There are a few cases and it's not covering them all; it needs to ignore the stride of the axis that is only length 1, so the work around with into_shape should not be needed. This is issue #340 What's your thought about picking the “gemm” vs “gemv” algorithm automatically for matrix × column? This is issue #341 There are two competing concerns with this question:
I think it would be natural to have |
No description provided.