Skip to content

HTML Parsing Cleanup #5395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

cancan101
Copy link
Contributor

I added new (public class) Flavor which users can subclass. Currently the expectation is to pass in the class (not an instance of the class)

CC @cpcloud @jtratner @jreback



def test_custom_html_parser1(self):
class _LiberalLxmlFrameParser(_LxmlFrameParser, Flavor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a place where we should probably use a mock object that just returns something appropriate from the methods that need to be implemented and then check afterwards that they were each called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@jtratner
Copy link
Contributor

jtratner commented Nov 1, 2013

can't add strict=False, that's pretty much the same as supporting a lxml-liberal parser. better to factor out the parser as you suggested.

@cancan101
Copy link
Contributor Author

@jtratner No more strict

@cancan101
Copy link
Contributor Author

It would be great to get this in for v13. If that is feasible and what I have looks reasonable let me know and I will write up docs, etc.

@cancan101
Copy link
Contributor Author

@jtratner Anything else you want me to change before rebase?

@jtratner
Copy link
Contributor

jtratner commented Nov 2, 2013

I'll take a look after you rebase and make the changes.

@cancan101
Copy link
Contributor Author

I am rebasing and adding release notes now. I think I made the requested changes.

@cancan101
Copy link
Contributor Author

@jtratner @cpcloud What about this?

@cancan101
Copy link
Contributor Author

Bump?

@cancan101
Copy link
Contributor Author

Bump
@cpcloud @jtratner @jreback

@cancan101
Copy link
Contributor Author

Any way to get this in for v13?

@jtratner
Copy link
Contributor

jtratner commented Nov 8, 2013

I will try to take a look over the weekend, but I'm hesitant to make
changes to this tricky-ish part of the code right before a release.

@@ -665,10 +682,14 @@ def _print_as_set(s):
def _validate_flavor(flavor):
if flavor is None:
flavor = 'lxml', 'bs4'
elif isinstance(flavor, string_types):
elif (isinstance(flavor, string_types))\
or (type(flavor) is type and Flavor in flavor.__bases__):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just do isinstance(flavor, type) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean instead of "type(flavor) is type“ or instead of the entire
predicate?

If the former, I suppose. I'm not sure what subclassing type entails.
On Nov 10, 2013 2:20 PM, "Phillip Cloud" [email protected] wrote:

In pandas/io/html.py:

@@ -665,10 +682,14 @@ def _print_as_set(s):
def _validate_flavor(flavor):
if flavor is None:
flavor = 'lxml', 'bs4'

  • elif isinstance(flavor, string_types):
  • elif (isinstance(flavor, string_types))\
  •    or (type(flavor) is type and Flavor in flavor.**bases**):
    

Can't you just do isinstance(flavor, type) here?


Reply to this email directly or view it on GitHubhttps://github.com//pull/5395/files#r7549704
.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I'll change to use isinstance. If we want to allow old styles
classes, would want:

isinstance(object, (type, types.ClassType))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things

isinstance(flavor, type) is incorrect, my bad. But so is checking for type(flavor) is type since in Python 2 not all classes are instances of type they are their own classobj type.

No need to worry about what subclassing type entails, just do this:

issubclass(flavor, Flavor)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I just posted about old styles classes.

Does issubclass work recursively?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you break the type checking out into its own function like:

def _string_or_flavor(flav):
    return isinstance(flav, string_types) or issubclass(flav, Flavor)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? issubclass(Flavor, Flavor) will return True. The bases will be checked in a similar way to isinstance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. issubclass should work, but I do need to check that flav is a class. Otherwise, you get:

TypeError: issubclass() arg 1 must be a class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right...you can use np.issubclass_ which will return False instead of throwing that error

string for
    the parsing flavor. This allows user written HTML parsers.
@cancan101
Copy link
Contributor Author

@cpcloud Changes pushed.

@cancan101
Copy link
Contributor Author

@cpcloud Any other issues you see?

@gliptak
Copy link
Contributor

gliptak commented Dec 14, 2013

Is this pull request planned to be merged? Thanks

@cpcloud
Copy link
Member

cpcloud commented Dec 14, 2013

Yes, probably in the next month or so...I just need to find a bit of time to dot T's and cross I's :)

@cancan101
Copy link
Contributor Author

Any updates on this?

@cancan101
Copy link
Contributor Author

I hope that this PR is still an acceptable addition to pandas even given the other discussions we have had regarding feature creep.

I feel this represents only a minor change to pandas itself but allows quite a lot of user extensibility.

@cpcloud
Copy link
Member

cpcloud commented Jan 24, 2014

I would say remove the flavor stuff and it's okay by me (which just leaves the IO change). @y-p?

@cancan101
Copy link
Contributor Author

The flavor addition is was makes this PR useful.
On Jan 24, 2014 12:12 PM, "Phillip Cloud" [email protected] wrote:

I would say remove the flavor stuff and it's okay by me (which just leaves
the IO change). @y-p https://github.com/y-p?


Reply to this email directly or view it on GitHubhttps://github.com//pull/5395#issuecomment-33256543
.

@cancan101
Copy link
Contributor Author

As I said, I don't think the flavor addition is feature creep. It just
empowers the user.
On Jan 24, 2014 12:20 PM, "Alex Rothberg" [email protected] wrote:

The flavor addition is was makes this PR useful.
On Jan 24, 2014 12:12 PM, "Phillip Cloud" [email protected]
wrote:

I would say remove the flavor stuff and it's okay by me (which just
leaves the IO change). @y-p https://github.com/y-p?


Reply to this email directly or view it on GitHubhttps://github.com//pull/5395#issuecomment-33256543
.

@ghost
Copy link

ghost commented Jan 24, 2014

So would a shark with a freakin' laser, but that's out of scope too

@cancan101 , we undertsand. You have lots of good ideas on how read_html could
be made more featureful. But, we want to keep read_html small and are fairly happy with the
scope it currently has. I again invite you to start your own project that takes these
ideas and builds a useful, focused tool to serve users that need more powerful HTML/pandas
capabilities.

@ghost
Copy link

ghost commented Jan 24, 2014

We'll be more then glad to include it on the new "pandas ecosystem" section of the docs.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@cpcloud what are we doing with this?

@cpcloud
Copy link
Member

cpcloud commented Feb 16, 2014

closing in favor of not having sharks with laser beams

@cpcloud cpcloud closed this Feb 16, 2014
@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@cancan101 as noted above....we REALLY appreciate all the effort you have made towards making pandas better. This would be a nice separate project. Clearly if we need hooks to sub-class (the parsing engine wrapper) would be in favor of that.

@cpcloud
Copy link
Member

cpcloud commented Feb 16, 2014

Indeed. @cancan101 You have a lot of good ideas about making a more full featured version of this .... keep your ideas coming.

@cancan101
Copy link
Contributor Author

Since the new html parsing library does not yet exist, it would be great to have some means of (ie place to record) collecting feature requests and ideas beforehand . This applies to not just html parsing but also to other feature requests for which their addition to pandas would be feature creep.

In this specific case, I would want to avoid duplicating existing code so at some point it will make sense to see what hooks would be needed allow a user to tie in an external library.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

I believe the wiki is public, so you could create an Enhancements wanted page?

@gliptak
Copy link
Contributor

gliptak commented Feb 17, 2014

Could (some of) this functionality be considered for a pandas-data(?) subproject (allowing for additional library dependencies)?

@jreback
Copy link
Contributor

jreback commented Feb 17, 2014

sure

@gliptak stepping up to manage????

@gliptak
Copy link
Contributor

gliptak commented Dec 16, 2015

Some folks brave decided to carve it out. pydata/pandas-datareader#148

@cancan101
Copy link
Contributor Author

@gliptak I looked over the project that you referenced, but it doesn't seem to have the HTML parsing factored out.

@gliptak
Copy link
Contributor

gliptak commented Feb 18, 2016

@cancan101 In #5404 I proposed to replace bs4/lxml/html5lib with http://phantomjs.org/ (I completed a limited local refactoring, no patch was uploaded). In #5404 it was stated that HTML scraping is not a goal for https://github.com/pydata/pandas With the separation of https://github.com/pydata/pandas-datareader maybe DataReader could have its HTML scraping strengthened.

As per pydata/pandas-datareader#148 http://phantomjs.org/ discontinued Python support so now http://jeanphix.me/Ghost.py/ would be a better choice.

There are several other issues opened around this like pydata/pandas-datareader#171 pydata/pandas-datareader#148

@jreback
Copy link
Contributor

jreback commented Feb 18, 2016

@gliptak the way to go about this is to create a package where you expose an interface. once stable, pandas could offload this type of parsing to your package (which would then become an optional dep), your package could have whatever deps you want. of course cross-platform is best! good luck.

@gliptak
Copy link
Contributor

gliptak commented Feb 22, 2016

These files reference lxml|html5lib|bs4:

./pandas/io/tests/test_data.py
./pandas/io/tests/test_excel.py
./pandas/io/tests/test_html.py
./pandas/io/data.py
./pandas/io/html.py

data.py is deprecated to pandas-datareader. Is html.py also planned to be moved?

Thanks

@jreback
Copy link
Contributor

jreback commented Feb 23, 2016

no, unless/until it is spun off to
pandas-htmlreader (or could be another name)
where we could deprecate and just use the new library

@gliptak
Copy link
Contributor

gliptak commented Mar 13, 2016

@jreback I separated pandas-htmlreader as

https://github.com/gliptak/pandas-htmlreader

The build works for PYTHON=3.5 PANDAS=0.17.1:

https://travis-ci.org/gliptak/pandas-htmlreader/jobs/115686777

Please review and let me know if there is an interest to move forward with this. Thanks

@jreback
Copy link
Contributor

jreback commented Mar 13, 2016

@gliptak looks interesting. So the key would be a version that is a complete drop-in replacement for pandas (IOW doesn't require ANYTHING beyond what the current version does). You can't depend on requests for example.

That's the initial version (say 1.0), but then you can feel free to branch off in (1.1) or whatever and do whatever you'd like, keeping in mind that a compat API is good.

@gliptak
Copy link
Contributor

gliptak commented Mar 13, 2016

I removed the requests dependencies (they came over from pandas-datareader):

https://github.com/gliptak/pandas-htmlreader
https://travis-ci.org/gliptak/pandas-htmlreader

lxml|html5lib|bs4 dependencies are not indicated ...

(PS Instead of starting from an initial commit, I could overlay this on pandas source tree to keep additional history)

@jreback
Copy link
Contributor

jreback commented Mar 13, 2016

no that's fine the key is to have the same exact API as current

@gliptak
Copy link
Contributor

gliptak commented Mar 18, 2016

@jreback Any directions on how to move forward with this? Should I open a new issue? Thanks

@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

yes, you can open a new issue. To be clear this will have to be a drop-in replacement.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

best if you can make your replacement completely work (in your namespace), then we can migrate the namespace to pydata/pandas-htmlreader. you then publish a 1.0. which can then be incorporated in pandas at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow flavor argument to read_html to be list/instance of _HtmlFrameParser
5 participants