Skip to content

Add row filtering operator #5900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elyase opened this issue Jan 10, 2014 · 9 comments
Closed

Add row filtering operator #5900

elyase opened this issue Jan 10, 2014 · 9 comments
Labels
API Design Ideas Long-Term Enhancement Discussions Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@elyase
Copy link

elyase commented Jan 10, 2014

This would allow chaining operations like:

pd.read_csv('imdb.txt')
  .sort(columns='year')
  .filter(lambda x: x['year']>1990)   # <---this is missing in Pandas
  .to_csv('filtered.csv')

For current alternatives see:

http://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining

@jreback
Copy link
Contributor

jreback commented Jan 10, 2014

Does this not work?

df = pd.read_csv('imdb.txt').sort(columns='year')
df[df['year']>1990].to_csv('filtered.csv')

@elyase
Copy link
Author

elyase commented Jan 10, 2014

Sure that works, but the creation of the unnecessary intermediate variable df interrupts the functional flow that is so nice to have in pandas. Is there something I don't see against this addition?

@jreback
Copy link
Contributor

jreback commented Jan 10, 2014

there was a whole discussion about this in #2460 IIRC

The problem with using the filter function is that it filters an index (and is not what you are doing).

however, could potentially do something like this:

pd.read_csv('imdb.txt')
  .sort(columns='year')
  .[lambda x: x['year']>1990]
  .to_csv('filtered.csv')

or

pd.read_csv('imdb.txt')
  .sort(columns='year')
  .loc[lambda x: x['year']>1990]
  .to_csv('filtered.csv')

or could make filter first argument accept a callable and then use the axis keyword to module the resultant selector

so making __getitem__ and the indexers (iloc/loc/ix) accept a callable that returns a boolean indexer is not too hard

@cpcloud
Copy link
Member

cpcloud commented Jan 12, 2014

Couldn't you use query as well? IMO lambdas in loc et al is bit of feature
creep.

On Friday, January 10, 2014, jreback wrote:

there was a whole discussion about this in #2460https://github.com/pydata/pandas/issues/2460IIRC

The problem with using the filter function is that it filters an index
(and is not what you are doing).

however, could potentially do something like this:

pd.read_csv('imdb.txt')
.sort(columns='year')
.[lambda x: x['year']>1990]
.to_csv('filtered.csv')

or

pd.read_csv('imdb.txt')
.sort(columns='year')
.loc[lambda x: x['year']>1990]
.to_csv('filtered.csv')

or could make filter first argument accept a callable and then use the
axis keyword to module the resultant selector

so making getitem and the indexers (iloc/loc/ix) accept a callable
that returns a boolean indexer is not too hard


Reply to this email directly or view it on GitHubhttps://github.com//issues/5900#issuecomment-32046948
.

Best,
Phillip Cloud

@cpcloud
Copy link
Member

cpcloud commented Jan 12, 2014

Hm nvm u would need the local

On Sunday, January 12, 2014, Phillip Cloud wrote:

Couldn't you use query as well? IMO lambdas in loc et al is bit of feature
creep.

On Friday, January 10, 2014, jreback wrote:

there was a whole discussion about this in #2460https://github.com/pydata/pandas/issues/2460IIRC

The problem with using the filter function is that it filters an index
(and is not what you are doing).

however, could potentially do something like this:

pd.read_csv('imdb.txt')
.sort(columns='year')
.[lambda x: x['year']>1990]
.to_csv('filtered.csv')

or

pd.read_csv('imdb.txt')
.sort(columns='year')
.loc[lambda x: x['year']>1990]
.to_csv('filtered.csv')

or could make filter first argument accept a callable and then use the
axis keyword to module the resultant selector

so making getitem and the indexers (iloc/loc/ix) accept a callable
that returns a boolean indexer is not too hard


Reply to this email directly or view it on GitHubhttps://github.com//issues/5900#issuecomment-32046948
.

Best,
Phillip Cloud

Best,
Phillip Cloud

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 15, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 1, 2015
@naught101
Copy link

Might it be possible with patsy to make a filter method that uses a formula string?

pd.read_csv('imdb.txt')
  .sort(columns='year')
  .filter('year >1990')
  .to_csv('filtered.csv')

@shoyer
Copy link
Member

shoyer commented Sep 16, 2015

@naught101 Using strings to filter dataframes is already possible. The method is query, e.g.,
pd.DataFrame({'x': [1, 2, 3, 4, 5]}).query('x > 3')

@jreback
Copy link
Contributor

jreback commented Sep 16, 2015

I suppose .query could take a lambda to provide this in-line type of chaining

@jreback jreback modified the milestones: 0.18.0, Next Major Release Jan 31, 2016
@jreback
Copy link
Contributor

jreback commented Jan 31, 2016

dupe of #11485 (which has more examples)

@jreback jreback closed this as completed Jan 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Ideas Long-Term Enhancement Discussions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

5 participants