-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
HTML Parsing Cleanup #5395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML Parsing Cleanup #5395
Conversation
|
||
|
||
def test_custom_html_parser1(self): | ||
class _LiberalLxmlFrameParser(_LxmlFrameParser, Flavor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a place where we should probably use a mock object that just returns something appropriate from the methods that need to be implemented and then check afterwards that they were each called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
can't add strict=False, that's pretty much the same as supporting a lxml-liberal parser. better to factor out the parser as you suggested. |
@jtratner No more strict |
It would be great to get this in for v13. If that is feasible and what I have looks reasonable let me know and I will write up docs, etc. |
@jtratner Anything else you want me to change before rebase? |
I'll take a look after you rebase and make the changes. |
I am rebasing and adding release notes now. I think I made the requested changes. |
Bump? |
Any way to get this in for v13? |
I will try to take a look over the weekend, but I'm hesitant to make |
@@ -665,10 +682,14 @@ def _print_as_set(s): | |||
def _validate_flavor(flavor): | |||
if flavor is None: | |||
flavor = 'lxml', 'bs4' | |||
elif isinstance(flavor, string_types): | |||
elif (isinstance(flavor, string_types))\ | |||
or (type(flavor) is type and Flavor in flavor.__bases__): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just do isinstance(flavor, type)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean instead of "type(flavor) is type
“ or instead of the entire
predicate?
If the former, I suppose. I'm not sure what subclassing type entails.
On Nov 10, 2013 2:20 PM, "Phillip Cloud" [email protected] wrote:
In pandas/io/html.py:
@@ -665,10 +682,14 @@ def _print_as_set(s):
def _validate_flavor(flavor):
if flavor is None:
flavor = 'lxml', 'bs4'
- elif isinstance(flavor, string_types):
- elif (isinstance(flavor, string_types))\
or (type(flavor) is type and Flavor in flavor.**bases**):
Can't you just do isinstance(flavor, type) here?
—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5395/files#r7549704
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I'll change to use isinstance. If we want to allow old styles
classes, would want:
isinstance(object, (type, types.ClassType))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of things
isinstance(flavor, type)
is incorrect, my bad. But so is checking for type(flavor) is type
since in Python 2 not all classes are instances of type
they are their own classobj
type.
No need to worry about what subclassing type
entails, just do this:
issubclass(flavor, Flavor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I just posted about old styles classes.
Does issubclass
work recursively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can you break the type checking out into its own function like:
def _string_or_flavor(flav):
return isinstance(flav, string_types) or issubclass(flav, Flavor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? issubclass(Flavor, Flavor)
will return True
. The bases will be checked in a similar way to isinstance
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. issubclass
should work, but I do need to check that flav
is a class. Otherwise, you get:
TypeError: issubclass() arg 1 must be a class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right...you can use np.issubclass_
which will return False
instead of throwing that error
string for the parsing flavor. This allows user written HTML parsers.
@cpcloud Changes pushed. |
@cpcloud Any other issues you see? |
Is this pull request planned to be merged? Thanks |
Yes, probably in the next month or so...I just need to find a bit of time to dot T's and cross I's :) |
Any updates on this? |
I hope that this PR is still an acceptable addition to pandas even given the other discussions we have had regarding feature creep. I feel this represents only a minor change to pandas itself but allows quite a lot of user extensibility. |
I would say remove the flavor stuff and it's okay by me (which just leaves the IO change). @y-p? |
The flavor addition is was makes this PR useful.
|
As I said, I don't think the flavor addition is feature creep. It just
|
@cancan101 , we undertsand. You have lots of good ideas on how |
We'll be more then glad to include it on the new "pandas ecosystem" section of the docs. |
@cpcloud what are we doing with this? |
closing in favor of not having sharks with laser beams |
@cancan101 as noted above....we REALLY appreciate all the effort you have made towards making pandas better. This would be a nice separate project. Clearly if we need hooks to sub-class (the parsing engine wrapper) would be in favor of that. |
Indeed. @cancan101 You have a lot of good ideas about making a more full featured version of this .... keep your ideas coming. |
Since the new html parsing library does not yet exist, it would be great to have some means of (ie place to record) collecting feature requests and ideas beforehand . This applies to not just html parsing but also to other feature requests for which their addition to pandas would be feature creep. In this specific case, I would want to avoid duplicating existing code so at some point it will make sense to see what hooks would be needed allow a user to tie in an external library. |
I believe the wiki is public, so you could create an Enhancements wanted page? |
Could (some of) this functionality be considered for a pandas-data(?) subproject (allowing for additional library dependencies)? |
sure @gliptak stepping up to manage???? |
Some folks brave decided to carve it out. pydata/pandas-datareader#148 |
@gliptak I looked over the project that you referenced, but it doesn't seem to have the HTML parsing factored out. |
@cancan101 In #5404 I proposed to replace bs4/lxml/html5lib with http://phantomjs.org/ (I completed a limited local refactoring, no patch was uploaded). In #5404 it was stated that HTML scraping is not a goal for https://github.com/pydata/pandas With the separation of https://github.com/pydata/pandas-datareader maybe DataReader could have its HTML scraping strengthened. As per pydata/pandas-datareader#148 http://phantomjs.org/ discontinued Python support so now http://jeanphix.me/Ghost.py/ would be a better choice. There are several other issues opened around this like pydata/pandas-datareader#171 pydata/pandas-datareader#148 |
@gliptak the way to go about this is to create a package where you expose an interface. once stable, pandas could offload this type of parsing to your package (which would then become an optional dep), your package could have whatever deps you want. of course cross-platform is best! good luck. |
These files reference
Thanks |
no, unless/until it is spun off to |
@jreback I separated pandas-htmlreader as https://github.com/gliptak/pandas-htmlreader The build works for https://travis-ci.org/gliptak/pandas-htmlreader/jobs/115686777 Please review and let me know if there is an interest to move forward with this. Thanks |
@gliptak looks interesting. So the key would be a version that is a complete drop-in replacement for pandas (IOW doesn't require ANYTHING beyond what the current version does). You can't depend on That's the initial version (say 1.0), but then you can feel free to branch off in (1.1) or whatever and do whatever you'd like, keeping in mind that a compat API is good. |
I removed the https://github.com/gliptak/pandas-htmlreader
(PS Instead of starting from an initial commit, I could overlay this on |
no that's fine the key is to have the same exact API as current |
@jreback Any directions on how to move forward with this? Should I open a new issue? Thanks |
yes, you can open a new issue. To be clear this will have to be a drop-in replacement. |
best if you can make your replacement completely work (in your namespace), then we can migrate the namespace to |
Flavor
to HTML parse (closes Allow flavor argument to read_html to be list/instance of _HtmlFrameParser #4594 and Make _LxmlFrameParser more extensible #5130). Instead of ENH: Added lxml-liberal html parsing flavor #5131I added new (public class)
Flavor
which users can subclass. Currently the expectation is to pass in the class (not an instance of the class)CC @cpcloud @jtratner @jreback