Rayon experiment with matrix multiplication #190

bluss · 2016-04-17T09:29:28Z

No description provided.

kernelmachine · 2016-04-18T11:39:54Z

haha, we're working on the same thing: https://github.com/pegasos1/rsmat/blob/master/src/dot/dot.rs. What sort of performance were you able to get? with current implementation, on a 100 x 100 array i have ~400k ns (+/- ~120k ns) w/ rayon, and ~360k ns (+/- 12k ns) w/ simple_parallel.

bluss · 2016-04-18T11:46:50Z

@pegasos1 I can only compare with you if there's a relative measure. For example your parallelized version vs ndarray's regular matrixmultiply in a single thread.

It looks like we have made the same choices in strategy. I think you should really use the matrixmultiply algorithm and not a naive matrix multiplication implementation, the difference should be pretty huge.

bluss · 2016-04-18T11:52:47Z

Here are my results from that code, comparing single threaded matrixmultiply based (master) vs rayon + matrixmultiply based (this PR)

name                   onethread.log ns/iter  rayon1.log ns/iter    diff ns/iter   diff %
mat_mul_f32::m064      27,690                 44,536                      16,846   60.84%
mat_mul_f32::m127      188,974                160,620                    -28,354  -15.00%
mat_mul_f32::mix10000  14,242,672             10,600,079              -3,642,593  -25.58%

As you can see the wins are a bit marginal.

Edit: The m064 measurement was possibly wrong due to cpu scaling effects. It's back at this with commit 3 in the PR at least.

mat_mul_f32::m064      27,690                 27,736                          46   0.17%

bluss · 2016-04-18T11:52:57Z

But it's not been tuned at all.

bluss · 2016-04-18T11:55:33Z

The table is created using https://github.com/BurntSushi/cargo-benchcmp

bluss · 2016-04-18T11:55:44Z

oops.

kernelmachine · 2016-04-18T11:56:48Z

cool, i just read your blog post about that. i'll try that. good to know we had the same strategy.

one big marker of performance was blocksize, obviously smaller blocksize meant more time was spent splitting the matrix than performing the computation. interestingly simple_parallel and rayon consistently converged in performance when blocksize was about n/2. how did you determine blocksize?

kernelmachine · 2016-04-18T11:57:43Z

which i guess is SPLIT in your case

kernelmachine · 2016-04-18T11:58:19Z

also thanks for the link, that's useful

bluss · 2016-04-18T12:07:42Z

I haven't tuned the block size.

bluss · 2016-04-18T12:09:47Z

A difference is that rayon uses a thread pool and simple parallel does not.

kernelmachine · 2016-04-18T12:21:06Z

yep, explains its general worse performance.

bluss · 2016-04-18T12:24:41Z

I'm a bit surprised by that.

kernelmachine · 2016-04-21T23:04:11Z

got some substantial improvements by using matrixmultiply and doing some optimizations. Here are the current benchmarks multiplying 128 x 100 and 100 x 128 matrices, blocksize = 64, on my machine.

test dot::test::tests::bench_dot_single_thread ... bench: 347,286 ns/iter (+/- 10,260)
test dot::test::tests::bench_dot_rayon ... bench: 223,132 ns/iter (+/- 59,974)
test dot::test::tests::bench_dot_simple_parallel ... bench: 259,094 ns/iter (+/- 49,819)
test dot::test_rblas::bench_dot_openblas ... bench: 201,170 ns/iter (+/- 18,666)

https://github.com/pegasos1/rsmat/blob/master/src/dot/dot.rs

kernelmachine · 2016-04-21T23:27:27Z

haha. for funsies, I ran the same benchmarks on a c4.2xlarge EC2 instance (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html#c4-instances-hardware), and came up with the following. Of note is that simple_parallel performs better than rayon here. And, oh my god openblas.

test dot::test::tests::bench_dot ... bench: 323,867 ns/iter (+/- 4,005)
test dot::test::tests::bench_dot_rayon ... bench: 239,585 ns/iter (+/- 15,530)
test dot::test::tests::bench_dot_simple_parallel ... bench: 175,125 ns/iter (+/- 4,108)
test dot::test_rblas::bench_dot_openblas ... bench: 91,242 ns/iter (+/- 686)

bluss · 2016-04-22T03:56:14Z

Wow cool. Did you use -C target-cpu=native (or =avx2)? matrixmultiply would benefit from that.

bluss · 2016-04-22T04:11:19Z

By the way, there's a public function for matrix multiplication into a place now

https://bluss.github.io/rust-ndarray/master/ndarray/linalg/fn.general_mat_mul.html

kernelmachine · 2016-04-23T00:15:15Z

Hm marginal improvements with target-cpu=native. variance goes down

test bench_dot ... bench: 348,404 ns/iter (+/- 16,236)
test bench_dot_rayon ... bench: 219,044 ns/iter (+/- 41,312)
test bench_dot_simple_parallel ... bench: 256,304 ns/iter (+/- 4,923)

kernelmachine · 2016-04-23T00:19:18Z

Cool! It would be so cool to make a parallel view, analogous to rayon's parallel iterator (par_iter), such that any relevant operations applied to the par_view gets automatically rayon'ed.

bluss · 2016-04-23T13:25:19Z

Maybe wrapper types.

I've been contemplating that to make the higher order functions customizable. With specialization we can do that crazy stuff automatically (special case for Send/Sync) operations, but before that we can do it manually.

I mean something like this:

array.mapv(|x| x + 1.);   // singlethreaded, because we don't know if the closure is thread safe
array.mapv(Par(|x| x + 1.));  // parallel
array.mapv(Par(|x| x + 1.).threads(2));  // parallel with options

Using a trait that is a superset of the closure traits so to speak enables the interface to be very ergonomic. You could use the same to opt-in to parallel matrix multiply (but it should probably be automatic or opt-out ?).

kernelmachine · 2016-04-25T19:15:41Z

What's the difference between

array.mapv(Par(|x| x + 1))

and

array.par_view().mapv(|x| x + 1)

Is the latter version not more extensible to different types of array operations?

bluss · 2016-04-25T20:05:53Z

Maybe, it seems to me that par_view requires up front to implement every array view operation on it, to make it reasonable to introduce it.

I think ndarray can implement the parallel iterator traits on its iterators, but I don't see the need for par_view.

kernelmachine · 2016-04-25T20:38:41Z

Yeah true

On Mon, Apr 25, 2016 at 4:05 PM bluss [email protected] wrote:

Maybe, it seems to me that par_view requires up front to implement every
array view operation on it, to make it reasonable to introduce it.

I think ndarray can implement the parallel iterator traits on its
iterators, but I don't see the need for par_view.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#190 (comment)

lloydmeta · 2017-08-13T07:36:05Z

Sorry to revive an old thread, but is parallel matrix multiplication available? I couldn't find it in the current API docs but maybe I'm looking in the wrong place. I'm using blas if that helps.

Thanks.

bluss · 2017-08-13T10:47:49Z

Hi, it's not available natively, only using a BLAS impl that uses it transparently like OpenBLAS.

There's a library in Rust called matrixmultiply_mt, but I haven't looked at it.

lloydmeta · 2017-08-13T11:18:48Z

@bluss I see, thanks for letting me know.

I'm using OpenBLAS using the instructions you provided, and I can see that my program is running on 5 threads, but it doesn't look like it's using more than 1 core: the CPU% is only at 100% whereas usually when I use something like Rayon, it goes to 350%-400%. Are there any flags I need to set when compiling or running the program?

lloydmeta · 2017-08-13T13:45:15Z

I just sampled the process and realised it was almost exclusively using matrixmultiply::gemm::sgemm instead of cblas_sgemm. Strange..

bluss · 2017-08-13T13:50:26Z

Ok, that could be if the code considers the memory layout to not be blas compatible. I'm not 100% sure we forward to blas in all possible cases. What's the layout of the arrays being multiplied?

bluss · 2017-08-13T13:58:08Z

Layout as ".shape() and .strides()" information, or as printed with shape and stride by Debug formatter.

lloydmeta · 2017-08-13T14:20:28Z

Thanks by the way for going through this with me.

I'm using .dot() in a couple of places

m1:
  shape:
  [
      1,
      200
  ]
  stride: 
  [
      0,
      1
  ]
m2:
  shape:
  [
      200,
      7351
  ]
  stride:
  [
      7351,
      1
  ]

m1:
  shape:
  [
      200,
      7351
  ]
  stride: 
  [
      7351,
      1
  ]
m2:
  shape: 
  [
      7351,
      1
  ]
  stride: 
  [
      1,
      7351
  ]

m1:
  shape: 
  [
      7351,
      1
  ]
  stride: 
  [
      1,
      7351
  ]
m2:
  shape: 
  [
      1,
      200
  ]
  stride: 
  [
      0,
      1
  ]

The 1x200 matrices with stride [0,1] were made by broadcasting Array1s into Array2s, but I've also tried not broadcasting and using Array2::from_shape_vec to get stride [200, 1], but it still isn't using OpenBlas..

bluss · 2017-08-13T15:12:49Z

Thanks -- I will look at this case. As aside from this bug, I see it's a multiplication with a column vector, so the gemv or mat_vec multiply method will handle this better. Neither blas nor ndarray picks algorithm automatically here, though I could imagine one desires that ndarray's .dot() should do that; it's that kind of generic interface isn't it.

I don't know off hand if gemv is multithreaded in blas; it's a much simpler problem so maybe not. Both blas and ndarray's matrixmultiply will do too much work when using their gemm operations, because they need to pad the input to a multiple of the kernel size. In effect, multiplying with 1 column has almost the same cost as multiplying with 4 columns. They are all about the big inputs 😄

bluss · 2017-08-13T15:14:34Z

I focused on the column vector problem since it's the only one that should be passed to blas (since it's not broadcasted), but maybe it's not that one that is the problem.

bluss · 2017-08-13T15:41:28Z

Ok, (2) I can't reproduce, it will use BLAS.

Inputs are with (shape, stride, layout):

([200, 7351], [7351, 1], C (0x1))
([7351, 1], [1, 7351], C (0x1))

(1) and (3) should also be able to use BLAS -- they are broadcasted, but only across the axis where they have 1 element, so it doesn't matter.

lloydmeta · 2017-08-14T13:54:12Z

@bluss Thanks for taking a look. I dug a bit deeper and found these lines in impl_alg.rs, which is used as a guard before using cblas here.

    let s1 = a.strides()[1];
    if s1 != 1 {
        return false;
    }

From what I can tell, in my (2) case:

m1:
shape: [
    200,
    7351
], stride: [
    7351,
    1
]
m2:
shape: [
    7351,
    1
], 
stride: [
    1,
    7351
]

both_f would be false (strides element 0 for both are not == 1), lhs (m1) axes don't get swapped (not a square matrix so m != a), and rhs (m2) axes don't get swapped (strides element 0 not == 1).

This means that when rhs (m2) is passed to blas_row_major_2d, the 2nd element from the strides() array is not equal to 1 -> a false is returned, and we don't get to use cblas :(

The same thing happens to lhs (m1) in the (3) case I wrote about earlier.

The (1) case works though (I just confirmed).

lloydmeta · 2017-08-15T02:36:32Z

I think I finally figured out what was going on. Those column (N x 1) matrices that are causing problems were made by transposing (1x N) row matrices via t(), and as a result, have strides that are (1, N) instead of (1, 1), as they would be if I had created the column matrices from scratch.

I managed to "fix" the problem by making the transposed column matrices via Array::from_shape_vec, passing the desired column shape and vector data, but it would be great to know if there is an easier way to do this? Perhaps array.t().into_shape()?

lloydmeta · 2017-08-15T02:47:08Z

Just confirmed that doing array.t().into_shape(new_dims) fixes the problem but I think it would be great if we could have a transpose method that transposes and reshapes without calling into_shape because it requires further unwrapping.

bluss · 2017-08-15T16:28:29Z

(I might not have time to answer every day or even every week, don't worry.)

It's a bug if it's not using the intended code path. There are a few cases and it's not covering them all; it needs to ignore the stride of the axis that is only length 1, so the work around with into_shape should not be needed. This is issue #340

What's your thought about picking the “gemm” vs “gemv” algorithm automatically for matrix × column? This is issue #341

There are two competing concerns with this question:

The user knows their data and picks the operation they know they need
The library knows the memory layout cases and picks the best algorithm.

I think it would be natural to have a.dot(b) adaptive but the more raw methods general_mat_mul and general_mat_vec_mul to always use their “gemm” and “gemv” algorithms respectively.

bluss added 2 commits April 17, 2016 23:37

Use Send + Sync in LinalgScalar

638d09d

Experimental divide & conquer in matrix multiply using rayon

2d41bdd

bluss closed this Apr 18, 2016

bluss reopened this Apr 18, 2016

bluss force-pushed the rayon-experiment branch from 8f3b2b9 to 2d41bdd Compare April 18, 2016 11:56

Move parallel computation to a separate function

f3ac609

bluss closed this Dec 14, 2016

bluss changed the title ~~Rayon experiment~~ Rayon experiment with matrix multiplication Dec 14, 2016

This was referenced Aug 15, 2017

Pick gemm vs gemv automatically? #341

Open

Blas binding should be even better at recognizing compatible memory layouts #340

Closed

bluss deleted the rayon-experiment branch March 15, 2021 11:56

Rayon experiment with matrix multiplication #190

Rayon experiment with matrix multiplication #190

Conversation

bluss commented Apr 17, 2016

kernelmachine commented Apr 18, 2016

bluss commented Apr 18, 2016

bluss commented Apr 18, 2016 • edited Loading

bluss commented Apr 18, 2016

bluss commented Apr 18, 2016

bluss commented Apr 18, 2016

kernelmachine commented Apr 18, 2016

kernelmachine commented Apr 18, 2016

kernelmachine commented Apr 18, 2016

bluss commented Apr 18, 2016

bluss commented Apr 18, 2016

kernelmachine commented Apr 18, 2016

bluss commented Apr 18, 2016

kernelmachine commented Apr 21, 2016 • edited Loading

kernelmachine commented Apr 21, 2016 • edited Loading

bluss commented Apr 22, 2016

bluss commented Apr 22, 2016

kernelmachine commented Apr 23, 2016

kernelmachine commented Apr 23, 2016

bluss commented Apr 23, 2016 • edited Loading

kernelmachine commented Apr 25, 2016

bluss commented Apr 25, 2016

kernelmachine commented Apr 25, 2016

lloydmeta commented Aug 13, 2017

bluss commented Aug 13, 2017

lloydmeta commented Aug 13, 2017

lloydmeta commented Aug 13, 2017

bluss commented Aug 13, 2017

bluss commented Aug 13, 2017

lloydmeta commented Aug 13, 2017 • edited Loading

bluss commented Aug 13, 2017

bluss commented Aug 13, 2017

bluss commented Aug 13, 2017

lloydmeta commented Aug 14, 2017 • edited Loading

lloydmeta commented Aug 15, 2017

lloydmeta commented Aug 15, 2017

bluss commented Aug 15, 2017

bluss commented Apr 18, 2016 •

edited

Loading

kernelmachine commented Apr 21, 2016 •

edited

Loading

kernelmachine commented Apr 21, 2016 •

edited

Loading

bluss commented Apr 23, 2016 •

edited

Loading

lloydmeta commented Aug 13, 2017 •

edited

Loading

lloydmeta commented Aug 14, 2017 •

edited

Loading