HTML Parsing Cleanup #5395

cancan101 · 2013-10-31T02:43:58Z

Can pass Flavor to HTML parse (closes Allow flavor argument to read_html to be list/instance of _HtmlFrameParser #4594 and Make _LxmlFrameParser more extensible #5130). Instead of ENH: Added lxml-liberal html parsing flavor #5131
Cleanup API for HTML parsers

I added new (public class) Flavor which users can subclass. Currently the expectation is to pass in the class (not an instance of the class)

CC @cpcloud @jtratner @jreback

jtratner · 2013-11-01T00:19:36Z

pandas/io/tests/test_html.py

-
+
+    def test_custom_html_parser1(self):
+        class _LiberalLxmlFrameParser(_LxmlFrameParser, Flavor):


This is a place where we should probably use a mock object that just returns something appropriate from the methods that need to be implemented and then check afterwards that they were each called.

jtratner · 2013-11-01T00:20:33Z

can't add strict=False, that's pretty much the same as supporting a lxml-liberal parser. better to factor out the parser as you suggested.

cancan101 · 2013-11-01T04:10:04Z

@jtratner No more strict

cancan101 · 2013-11-01T04:11:24Z

It would be great to get this in for v13. If that is feasible and what I have looks reasonable let me know and I will write up docs, etc.

cancan101 · 2013-11-02T00:57:17Z

@jtratner Anything else you want me to change before rebase?

jtratner · 2013-11-02T00:58:27Z

I'll take a look after you rebase and make the changes.

cancan101 · 2013-11-02T00:59:26Z

I am rebasing and adding release notes now. I think I made the requested changes.

cancan101 · 2013-11-02T21:02:11Z

@jtratner @cpcloud What about this?

cancan101 · 2013-11-05T01:43:45Z

Bump?

cancan101 · 2013-11-07T16:32:07Z

Bump
@cpcloud @jtratner @jreback

cancan101 · 2013-11-08T22:31:16Z

Any way to get this in for v13?

jtratner · 2013-11-08T23:01:31Z

I will try to take a look over the weekend, but I'm hesitant to make
changes to this tricky-ish part of the code right before a release.

cpcloud · 2013-11-10T19:20:33Z

pandas/io/html.py

@@ -665,10 +682,14 @@ def _print_as_set(s):
 def _validate_flavor(flavor):
    if flavor is None:
        flavor = 'lxml', 'bs4'
-    elif isinstance(flavor, string_types):
+    elif (isinstance(flavor, string_types))\
+        or (type(flavor) is type and Flavor in flavor.__bases__):


Can't you just do isinstance(flavor, type) here?

Do you mean instead of "type(flavor) is type“ or instead of the entire
predicate?

If the former, I suppose. I'm not sure what subclassing type entails.
On Nov 10, 2013 2:20 PM, "Phillip Cloud" [email protected] wrote:

In pandas/io/html.py:

@@ -665,10 +682,14 @@ def _print_as_set(s):
def _validate_flavor(flavor):
if flavor is None:
flavor = 'lxml', 'bs4'

elif isinstance(flavor, string_types):

elif (isinstance(flavor, string_types))\

or (type(flavor) is type and Flavor in flavor.**bases**):

Can't you just do isinstance(flavor, type) here?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5395/files#r7549704
.

Okay. I'll change to use isinstance. If we want to allow old styles
classes, would want:

isinstance(object, (type, types.ClassType))

Couple of things

isinstance(flavor, type) is incorrect, my bad. But so is checking for type(flavor) is type since in Python 2 not all classes are instances of type they are their own classobj type.

No need to worry about what subclassing type entails, just do this:

issubclass(flavor, Flavor)

Yea, I just posted about old styles classes.

Does issubclass work recursively?

Also can you break the type checking out into its own function like:

def _string_or_flavor(flav): return isinstance(flav, string_types) or issubclass(flav, Flavor)

What do you mean? issubclass(Flavor, Flavor) will return True. The bases will be checked in a similar way to isinstance.

Okay. issubclass should work, but I do need to check that flav is a class. Otherwise, you get:

TypeError: issubclass() arg 1 must be a class

ah right...you can use np.issubclass_ which will return False instead of throwing that error

string for the parsing flavor. This allows user written HTML parsers.

cancan101 · 2013-11-10T20:44:17Z

@cpcloud Changes pushed.

cancan101 · 2013-11-13T00:52:39Z

@cpcloud Any other issues you see?

gliptak · 2013-12-14T21:10:30Z

Is this pull request planned to be merged? Thanks

cpcloud · 2013-12-14T21:11:39Z

Yes, probably in the next month or so...I just need to find a bit of time to dot T's and cross I's :)

cancan101 · 2014-01-07T04:00:12Z

Any updates on this?

cancan101 · 2014-01-24T19:33:31Z

I hope that this PR is still an acceptable addition to pandas even given the other discussions we have had regarding feature creep.

I feel this represents only a minor change to pandas itself but allows quite a lot of user extensibility.

cpcloud · 2014-01-24T20:12:25Z

I would say remove the flavor stuff and it's okay by me (which just leaves the IO change). @y-p?

cancan101 · 2014-01-24T20:20:50Z

The flavor addition is was makes this PR useful.
On Jan 24, 2014 12:12 PM, "Phillip Cloud" [email protected] wrote:

I would say remove the flavor stuff and it's okay by me (which just leaves
the IO change). @y-p https://github.com/y-p?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5395#issuecomment-33256543
.

cancan101 · 2014-01-24T20:21:47Z

As I said, I don't think the flavor addition is feature creep. It just
empowers the user.
On Jan 24, 2014 12:20 PM, "Alex Rothberg" [email protected] wrote:

The flavor addition is was makes this PR useful.
On Jan 24, 2014 12:12 PM, "Phillip Cloud" [email protected]
wrote:

I would say remove the flavor stuff and it's okay by me (which just
leaves the IO change). @y-p https://github.com/y-p?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5395#issuecomment-33256543
.

ghost · 2014-01-24T20:28:53Z

~~So would a shark with a freakin' laser, but that's out of scope too~~

@cancan101 , we undertsand. You have lots of good ideas on how read_html could
be made more featureful. But, we want to keep read_html small and are fairly happy with the
scope it currently has. I again invite you to start your own project that takes these
ideas and builds a useful, focused tool to serve users that need more powerful HTML/pandas
capabilities.

ghost · 2014-01-24T20:32:24Z

We'll be more then glad to include it on the new "pandas ecosystem" section of the docs.

jreback · 2014-02-16T22:01:49Z

@cpcloud what are we doing with this?

cpcloud · 2014-02-16T22:04:20Z

closing in favor of not having sharks with laser beams

jreback · 2014-02-16T22:06:10Z

@cancan101 as noted above....we REALLY appreciate all the effort you have made towards making pandas better. This would be a nice separate project. Clearly if we need hooks to sub-class (the parsing engine wrapper) would be in favor of that.

cpcloud · 2014-02-16T22:09:10Z

Indeed. @cancan101 You have a lot of good ideas about making a more full featured version of this .... keep your ideas coming.

cancan101 · 2014-02-16T22:48:22Z

Since the new html parsing library does not yet exist, it would be great to have some means of (ie place to record) collecting feature requests and ideas beforehand . This applies to not just html parsing but also to other feature requests for which their addition to pandas would be feature creep.

In this specific case, I would want to avoid duplicating existing code so at some point it will make sense to see what hooks would be needed allow a user to tie in an external library.

jreback · 2014-02-16T22:51:10Z

I believe the wiki is public, so you could create an Enhancements wanted page?

gliptak · 2014-02-17T17:42:12Z

Could (some of) this functionality be considered for a pandas-data(?) subproject (allowing for additional library dependencies)?

jreback · 2014-02-17T18:38:31Z

sure

@gliptak stepping up to manage????

gliptak · 2015-12-16T02:26:10Z

Some folks brave decided to carve it out. pydata/pandas-datareader#148

cancan101 · 2016-02-18T06:50:52Z

@gliptak I looked over the project that you referenced, but it doesn't seem to have the HTML parsing factored out.

gliptak · 2016-02-18T13:33:30Z

@cancan101 In #5404 I proposed to replace bs4/lxml/html5lib with http://phantomjs.org/ (I completed a limited local refactoring, no patch was uploaded). In #5404 it was stated that HTML scraping is not a goal for https://github.com/pydata/pandas With the separation of https://github.com/pydata/pandas-datareader maybe DataReader could have its HTML scraping strengthened.

As per pydata/pandas-datareader#148 http://phantomjs.org/ discontinued Python support so now http://jeanphix.me/Ghost.py/ would be a better choice.

There are several other issues opened around this like pydata/pandas-datareader#171 pydata/pandas-datareader#148

jreback · 2016-02-18T14:37:50Z

@gliptak the way to go about this is to create a package where you expose an interface. once stable, pandas could offload this type of parsing to your package (which would then become an optional dep), your package could have whatever deps you want. of course cross-platform is best! good luck.

gliptak · 2016-02-22T23:58:37Z

These files reference lxml|html5lib|bs4:

./pandas/io/tests/test_data.py
./pandas/io/tests/test_excel.py
./pandas/io/tests/test_html.py
./pandas/io/data.py
./pandas/io/html.py

data.py is deprecated to pandas-datareader. Is html.py also planned to be moved?

Thanks

jreback · 2016-02-23T00:12:47Z

no, unless/until it is spun off to
pandas-htmlreader (or could be another name)
where we could deprecate and just use the new library

gliptak · 2016-03-13T16:28:49Z

@jreback I separated pandas-htmlreader as

https://github.com/gliptak/pandas-htmlreader

The build works for PYTHON=3.5 PANDAS=0.17.1:

https://travis-ci.org/gliptak/pandas-htmlreader/jobs/115686777

Please review and let me know if there is an interest to move forward with this. Thanks

jreback · 2016-03-13T16:42:38Z

@gliptak looks interesting. So the key would be a version that is a complete drop-in replacement for pandas (IOW doesn't require ANYTHING beyond what the current version does). You can't depend on requests for example.

That's the initial version (say 1.0), but then you can feel free to branch off in (1.1) or whatever and do whatever you'd like, keeping in mind that a compat API is good.

gliptak · 2016-03-13T17:48:41Z

I removed the requests dependencies (they came over from pandas-datareader):

https://github.com/gliptak/pandas-htmlreader
https://travis-ci.org/gliptak/pandas-htmlreader

lxml|html5lib|bs4 dependencies are not indicated ...

(PS Instead of starting from an initial commit, I could overlay this on pandas source tree to keep additional history)

jreback · 2016-03-13T18:05:52Z

no that's fine the key is to have the same exact API as current

gliptak · 2016-03-18T14:52:03Z

@jreback Any directions on how to move forward with this? Should I open a new issue? Thanks

jreback · 2016-03-20T15:24:19Z

yes, you can open a new issue. To be clear this will have to be a drop-in replacement.

jreback · 2016-03-20T15:25:36Z

best if you can make your replacement completely work (in your namespace), then we can migrate the namespace to pydata/pandas-htmlreader. you then publish a 1.0. which can then be incorporated in pandas at some point.

gliptak mentioned this pull request Oct 31, 2013

html parsing with phantomjs? #5404

Closed

jtratner reviewed Nov 1, 2013
View reviewed changes

cancan101 mentioned this pull request Nov 9, 2013

io.html.read_html support XPath expressions for table selection #5416

Closed

cpcloud reviewed Nov 10, 2013
View reviewed changes

ENH: read_html can accept a subclass of Flavor rather than a

829ef04

string for the parsing flavor. This allows user written HTML parsers.

jreback added HTML labels Feb 16, 2014

cpcloud closed this Feb 16, 2014



		def test_custom_html_parser1(self):
		class _LiberalLxmlFrameParser(_LxmlFrameParser, Flavor):

HTML Parsing Cleanup #5395

HTML Parsing Cleanup #5395

Conversation

cancan101 commented Oct 31, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtratner commented Nov 1, 2013

cancan101 commented Nov 1, 2013

cancan101 commented Nov 1, 2013

cancan101 commented Nov 2, 2013

jtratner commented Nov 2, 2013

cancan101 commented Nov 2, 2013

cancan101 commented Nov 2, 2013

cancan101 commented Nov 5, 2013

cancan101 commented Nov 7, 2013

cancan101 commented Nov 8, 2013

jtratner commented Nov 8, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cancan101 commented Nov 10, 2013

cancan101 commented Nov 13, 2013

gliptak commented Dec 14, 2013

cpcloud commented Dec 14, 2013

cancan101 commented Jan 7, 2014

cancan101 commented Jan 24, 2014

cpcloud commented Jan 24, 2014

cancan101 commented Jan 24, 2014

cancan101 commented Jan 24, 2014

ghost commented Jan 24, 2014

ghost commented Jan 24, 2014

jreback commented Feb 16, 2014

cpcloud commented Feb 16, 2014

jreback commented Feb 16, 2014

cpcloud commented Feb 16, 2014

cancan101 commented Feb 16, 2014

jreback commented Feb 16, 2014

gliptak commented Feb 17, 2014

jreback commented Feb 17, 2014

gliptak commented Dec 16, 2015

cancan101 commented Feb 18, 2016

gliptak commented Feb 18, 2016

jreback commented Feb 18, 2016

gliptak commented Feb 22, 2016

jreback commented Feb 23, 2016

gliptak commented Mar 13, 2016

jreback commented Mar 13, 2016

gliptak commented Mar 13, 2016

jreback commented Mar 13, 2016

gliptak commented Mar 18, 2016

jreback commented Mar 20, 2016

jreback commented Mar 20, 2016