Skip to content

ENH: Rolling rank #43338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Sep 14, 2021
Merged

ENH: Rolling rank #43338

merged 18 commits into from
Sep 14, 2021

Conversation

gsiano
Copy link
Contributor

@gsiano gsiano commented Aug 31, 2021

xref #9481

gsiano added 5 commits August 31, 2021 18:47
rolling rank and percentile rank using skiplist
rolling rank and percentile rank using skiplist
remove unused variables, fix indentation, add comment to roll_rank()
don't fill output when nobs < minp
@pep8speaks
Copy link

pep8speaks commented Aug 31, 2021

Hello @gsiano! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-13 18:29:13 UTC

address lint warnings
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this to the asvs as well.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2021

cc @mroeschke

@jreback
Copy link
Contributor

jreback commented Aug 31, 2021

will this close #9481 ? (can pull some examples fromt here)

@jreback jreback added Enhancement Window rolling, ewma, expanding labels Aug 31, 2021
@jreback
Copy link
Contributor

jreback commented Aug 31, 2021

as a followup (can create an issue), need tests / impl for groupby.rolling.rank (it should just work though)

@jreback jreback changed the title Rolling rank ENH: Rolling rank Aug 31, 2021
Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr @gsiano! Some comments on the cython code

raise MemoryError on skiplist_insert failure, destroy skiplist before
re-init, and other minor changes
@mroeschke
Copy link
Member

What happens currently when there are ties?

DataFrame.rank currently has a method argument to control that behavior https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rank.html#pandas.DataFrame.rank

implement min, max, and average rank methods
@gsiano
Copy link
Contributor Author

gsiano commented Sep 1, 2021

What happens currently when there are ties?

DataFrame.rank currently has a method argument to control that behavior https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rank.html#pandas.DataFrame.rank

I added the min, max, and average rank methods. I still have to add the dense and first methods, and I'll look into adding the other flags from DataFrame.rank

@gsiano
Copy link
Contributor Author

gsiano commented Sep 1, 2021

I added the min, max, and average rank methods. I still have to add the dense and first methods, and I'll look into adding the other flags from DataFrame.rank

the first method doesn't seem relevant for rolling rank, and dense probably deserves a separate pr

@mzeitlin11
Copy link
Member

I added the min, max, and average rank methods. I still have to add the dense and first methods, and I'll look into adding the other flags from DataFrame.rank

the first method doesn't seem relevant for rolling rank, and dense probably deserves a separate pr

+1 on pushing dense to a separate pr. ascending would be a nice flag to have ... I wonder if something like just adding the negative of the value to the skiplist would handle flipping the order

@gsiano
Copy link
Contributor Author

gsiano commented Sep 1, 2021

+1 on pushing dense to a separate pr. ascending would be a nice flag to have ... I wonder if something like just adding the negative of the value to the skiplist would handle flipping the order

That sounds like it will work. I'll try adding it.

added the `ascending` flag, various cleanups, expanded tests and asv benchmark
remove unimplemented rank types
@jreback jreback added this to the 1.4 milestone Sep 2, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, can you add

added tests for `Expanding`. added doc strings and whatsnew note
@gsiano
Copy link
Contributor Author

gsiano commented Sep 4, 2021

looks good, can you add

I added a test for expanding and updated the docs and whatsnew file. Is there a way to render the docs locally so I can see how it looks?

@mroeschke
Copy link
Member

Is there a way to render the docs locally so I can see how it looks?

https://pandas.pydata.org/pandas-docs/stable/development/contributing_documentation.html#how-to-build-the-pandas-documentation

@pytest.mark.parametrize("ascending", [True, False])
@pytest.mark.parametrize("test_data", ["default", "duplicates", "nans"])
def test_rank(window, method, pct, ascending, test_data):
length = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could we test with a smaller sample (e.g. 20) and adjust window accordingly?

@pytest.mark.parametrize("ascending", [True, False])
@pytest.mark.parametrize("test_data", ["default", "duplicates", "nans"])
def test_rank(window, method, pct, ascending, test_data):
length = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments, otherwise LGTM!

int64_t nobs = 0, win
float64_t val
skiplist_t *skiplist
float64_t[::1] output = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float64_t[::1] output = None
float64_t[::1] output

NBD since doesn't affect correctness, but I find this clearer since None initialization usually used only when there's a path where the variable might not end up initialized. Also generates a bit less code :)

else:
rank = NaN
if nobs >= minp:
output[i] = <float64_t>(rank) / nobs if percentile else rank
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cast here necessary?

)
def rank(
self,
method: str = "average",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent this could be the same Literal you defined in aggregations.pyi? Maybe that could be aliased in pandas/_typing.py, so that then if we do something like add dense then we only need to change typing in one place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah these should be consisteent across the rank methods (could be a followup to fix this)

elif test_data == "duplicates":
ser = Series(data=np.random.choice(3, length))
elif test_data == "nans":
ser = Series(data=np.random.choice([1.0, 0.25, 0.75, np.nan], length))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you throw an inf and -inf in here? Not really a special case I guess with how it's implemented in rolling_rank, but it is special cased in other rank algos and were some issues before with inf and nan, so covering here would be good

elif test_data == "duplicates":
ser = Series(data=np.random.choice(3, length))
elif test_data == "nans":
ser = Series(data=np.random.choice([1.0, 0.25, 0.75, np.nan], length))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about inf


.. ipython:: python

>>> s = pd.Series([1, 4, 2, 3, 5, 3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just put the commands; they will render during the doc-build (e.g. L100 w/o the '>>>' and not below)

5 2.0
dtype: float64

>>> s.expanding().rank()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could add a comment; this works for expanding as well (or kill the expanding, up 2 you)

@@ -1139,6 +1141,120 @@ def roll_quantile(const float64_t[:] values, ndarray[int64_t] start,
return output


cdef enum RankType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use the enums defined in pandas/_libs/algos.pyx? e.g. the TIEBREAK_AVERAGE. may need to move the to algox.pyd to import properly.

)
def rank(
self,
method: str = "average",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah these should be consisteent across the rank methods (could be a followup to fix this)

@@ -1409,6 +1409,22 @@ def quantile(self, quantile: float, interpolation: str = "linear", **kwargs):

return self._apply(window_func, name="quantile", **kwargs)

def rank(
self,
method: str = "average",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here

addressing comments
add rolling rank tiebreakers dict
fix docs warnings - remove '>>>' from code block
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. small comment to consider for followup.

cc @mzeitlin11 over to you

@@ -1139,6 +1143,122 @@ def roll_quantile(const float64_t[:] values, ndarray[int64_t] start,
return output


rolling_rank_tiebreakers = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible to unify these with the same in algos.pyx?

@jreback
Copy link
Contributor

jreback commented Sep 14, 2021

oh see you approved, ok then.

@jreback jreback merged commit 9f90bd4 into pandas-dev:master Sep 14, 2021
@jreback
Copy link
Contributor

jreback commented Sep 14, 2021

thanks @gsiano very nice! pls create an issue for for the requested followups (PR as well if you can .....) keep em coming!

@gsiano gsiano mentioned this pull request Sep 15, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement high performance rolling_rank
5 participants