Optimize CSR and sorted conversions #11

mrocklin · 2017-05-01T19:56:19Z

This started by adding the optimzied version of tocsr from #9 (cc @jakevdp)

Then, to avoid calling sorting routines excessively we started tracking sorted= metadata.

Then, to accelerate sorting and to consolidate lexsort and reshape code we based all sorting code onto a single function called linear_loc, which provides the linear location of any index in a C-ordered array. This replaces np.lexsort with np.argsort and performs a quick issorted check beforehand. This can dramatically speed up some workflows, but does limit the size of our arrays to 2**64 elements (zero or nonzero) so for example this library can no longer represent an array of shape (1e6, 1e6, 1e6, 1e6). This was already the case in reshape. I'm personally ok with this for the time being. I think that we can probably remove this limitation in the future (with a cost) if necessary.

There is still work to be done to optimize matvecs and matmuls, but this is a decent start.

pull out linear_loc to separate function

This stores the last few objects returned from reshape and transpose calls. This allows efficiencies from in-place operations like `sum_duplicates` and `sort_indices` to persist in interative workflows. Modern NumPy programmers are accustomed to operations like .transpose() being cheap and aren't accustomed to having to pay sorting costs after many computations. These assumptions are no longer true in sparse by default. However, by caching recent transpose and reshape objects we can reuse their inplace modifications. This greatly accelerates common machine learning workloads.

mrocklin · 2017-05-02T13:32:10Z

I have extended this to cache the last few results of reshape and transpose. This makes things run quite fast for me in practice. Copying commit message here for convenience:

This stores the last few objects returned from reshape and transpose
calls. This allows efficiencies from in-place operations like
sum_duplicates and sort_indices to persist in interative workflows.

Modern NumPy programmers are accustomed to operations like .transpose()
being cheap and aren't accustomed to having to pay sorting costs after
many computations. These assumptions are no longer true in sparse by
default. However, by caching recent transpose and reshape objects we
can reuse their inplace modifications. This greatly accelerates common
machine learning workloads.

mrocklin · 2017-05-02T13:32:21Z

Merging shortly if no objections

jakevdp · 2017-05-02T14:58:42Z

Looks good. A couple things:

the CSC convrtsiom could be made more efficient by sorting the indices as CSC directly, rather than doing CSR first
for the cacheing, does this mean that sparse arrays are effectively read-only? If you modify them in place, cached results will be incorrect yes?
in that subject, we should be careful about CSC/CSR conversions that might use views of data matrices, as you light inadvertently change the sparse matrix when changing the CSR generated from it.

mrocklin · 2017-05-02T15:10:00Z

All good points.

I'm tempted to leave csc as is until I (or someone else) actively needs a fast csc. I do expect this to happen given my current application set, but it hasn't happened yet.
For caching this gets tricky. Currently I think that sparse arrays are practically read-only. We haven't added any top-level in-place modification API. However it is reasonable to expect that this will be added in the future. We can easily invalidate whenever the parent is changed, but it would require more bookkeeping to invalidate whenever a child is changed.
Regarding views how does numpy handle this? If I change a view in NumPy then I expect the underlying array to change as well. I guess that the situation here is more complex because it's not immediately clear to the user what is and is not a view.

For the caching issue I'm tempted to implement the book-keeping necessary for invalidation.

jakevdp · 2017-05-02T21:38:24Z

Makes sense.

Regarding the book-keeping for cacheing: we could detect changes in the arrays by making them properties, but that still wouldn't protect against users modifying the contents of these arrays either directly, or via views in derived CSR matrices (for example). One way we might address that is by setting the writeable flag of the arrays to False.

mrocklin · 2017-05-03T12:13:21Z

Alternatively we could have the user explicitly opt-in to caching. This may be simpler all around.

jakevdp · 2017-05-03T14:16:03Z

Alternatively we could have the user explicitly opt-in to caching. This may be simpler all around.

+1, good solution for right now

mrocklin · 2017-05-03T22:23:25Z

Planning to merge this soon if there are no further comments

jakevdp · 2017-05-03T22:45:50Z

Looks good – one thing came to mind though: I wonder if it would be possible to use functools.lru_cache instead of maintaining the cache manually?

mrocklin · 2017-05-04T00:17:15Z

I would love to use a standard LRU data structure. However the functools.lru_cache solution is strictly a decorator, it doesn't let you easily inspect the contents, which I've found to be inconvenient in the past. I'm inclined to use the simple approach with a deque here until it proves cumbersome. If that fails then I have my own lru data structures that I can bring in if necessary.

jakevdp · 2017-05-04T04:20:16Z

Makes sense!

mrocklin force-pushed the tocsr branch 2 times, most recently from e854fb6 to d4801f4 Compare May 1, 2017 23:56

mrocklin added 2 commits May 2, 2017 08:51

Add tocsr method and sorted checking

29fec85

pull out linear_loc to separate function

mrocklin force-pushed the tocsr branch from ce26829 to cebb4a6 Compare May 2, 2017 12:52

Opt in to caching

2a88763

mrocklin merged commit f5a9c1e into master May 4, 2017

mrocklin deleted the tocsr branch May 4, 2017 11:57

adrinjalali mentioned this pull request Mar 23, 2020

SLEP 014 Pandas in Pandas out scikit-learn/enhancement_proposals#37

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CSR and sorted conversions #11

Optimize CSR and sorted conversions #11

mrocklin commented May 1, 2017

mrocklin commented May 2, 2017

mrocklin commented May 2, 2017

jakevdp commented May 2, 2017

mrocklin commented May 2, 2017

jakevdp commented May 2, 2017

mrocklin commented May 3, 2017

jakevdp commented May 3, 2017

mrocklin commented May 3, 2017

jakevdp commented May 3, 2017

mrocklin commented May 4, 2017

jakevdp commented May 4, 2017

Optimize CSR and sorted conversions #11

Optimize CSR and sorted conversions #11

Conversation

mrocklin commented May 1, 2017

mrocklin commented May 2, 2017

mrocklin commented May 2, 2017

jakevdp commented May 2, 2017

mrocklin commented May 2, 2017

jakevdp commented May 2, 2017

mrocklin commented May 3, 2017

jakevdp commented May 3, 2017

mrocklin commented May 3, 2017

jakevdp commented May 3, 2017

mrocklin commented May 4, 2017

jakevdp commented May 4, 2017