Skip to content

ENH: read_html has no timeout #6029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MichaelWS opened this issue Jan 21, 2014 · 16 comments
Closed

ENH: read_html has no timeout #6029

MichaelWS opened this issue Jan 21, 2014 · 16 comments
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@MichaelWS
Copy link
Contributor

Is there any reason why a user cannot pass in a timeout to read_html? I have a bunch of scripts that parse some tables on the web and was having an issue with a site when I noticed it.

If I add a pr for timeout, how would I test it?

@cpcloud
Copy link
Member

cpcloud commented Jan 22, 2014

Did you have a design in mind?

For example, you could give a warning on timeout or raise an exception. You'd then do something like

with tm.assertRaisesRegexp(TimeoutError, "read_html timed out"):
    dfs = read_html('timing-out.com')

And be sure to mark your test as @slow so that it doesn't get run on the fast(er) CI builds.

@cpcloud
Copy link
Member

cpcloud commented Jan 22, 2014

No particular reason this isn't there. One reason this could be happening is that some HTML is malformed enough that the backend parser gets stuck in a cycle (e.g., child nodes become their own parents, so calling some sort next_child method returns itself) and just keeps iterating until the cows come home. I'll try to find the other issue where I discovered this.

@cpcloud
Copy link
Member

cpcloud commented Jan 22, 2014

xref: #4786

@MichaelWS
Copy link
Contributor Author

Interesting. I did not consider it to be malformed. my problem was that I ran the identical code multiple times so I concluded it to be a timeout issue. The code literally parsed everything fine on another attempt.

( I had to parse the old sec data myself because it was so poorly formed. Luckily, its now xml going forward. )

@MichaelWS
Copy link
Contributor Author

my design would just pass the timeout to urlopen and raise the exception.

@cpcloud
Copy link
Member

cpcloud commented Jan 22, 2014

@MichaelWS In that case, the cycle issue probably doesn't apply. If you're able to parse it sometimes and not others, the API you're calling into could be placing a limit on the number of requests you're allowed to make per unit time.

@cpcloud
Copy link
Member

cpcloud commented Jan 22, 2014

@MichaelWS Okay. Should be a simple enough change.

@cpcloud
Copy link
Member

cpcloud commented Jan 22, 2014

Luckily, its now xml going forward

Famous last words :)

@cancan101
Copy link
Contributor

What are people's thoughts on moving the html parsing code from using urllib and urllib to using Requests (http://docs.python-requests.org/en/latest/) ?

Requests supports timeouts (http://docs.python-requests.org/en/latest/user/quickstart/?highlight=timeout#timeouts).

In the past (when not using requests), I have used socket.settimeout (http://docs.python.org/2/library/socket.html#socket.socket.settimeout).

@cpcloud
Copy link
Member

cpcloud commented Jan 22, 2014

@cancan101 requests is awesome. However, that's yet another dependency for HTML parsing. There are already too many. And making it optional would only serve to make someone's hair a bit more gray.

@cancan101
Copy link
Contributor

@cpcloud At some point it might make sense to spin out the html parsing from pandas into its own project. There have been some features filed by myself and others that have been (reasonably so) noted as beyond the scope of Pandas.

Doing this would simplify the dependencies of Pandas itself.

@cancan101
Copy link
Contributor

Here is another example of the infinite loop in parsing: #4770 (comment)

@ghost
Copy link

ghost commented Jan 22, 2014

@cancan101, you are highly encouraged to go ahead and start such a project.
It probably won't become a new dependency but it will be a useful tool with it's own
focus that integrates well with pandas. we love those.

I was going to +1 the timeout arg suggestion (urllib2.urlopen supports a timeout argument btw) but
after a little more thought, I'm not so sure.

First, some code:

import requests
r=requests.get('http://en.wikipedia.org/wiki/List_of_Treme_episodes')
pd.read_html(r.content[0])

Works just fine. Think about that. You can use your powerful tool of choice to handle network details
and then hand pandas the data, Isn't that just perfect? Let's say timeouts seem like a reasonable thing
to add (they kind of do). What about HTTP-basic auth? POST requests? custom HTTP headers?
Requests does a lot of things, should we expose them all through the method signature?

If the pandas plotting/mpl wrapper approach taught us anything is that you can end up in a bad
way if you keep trying to expose more and more of the underlying functionality through your interface. (*)

read_html is about parsing HTML into dataframes. It's not a Requests/urrllib wrapper, it's not a bs4 replacement.
If you need the power of requests, use it. pandas will cooperate. If you need the power of xpath
expressions, use those. pandas will accept the data you give it.

This is a bad pattern we seem to fall into time and time again. I suggest we start being really acerbic when
confronted with the syllogism: "You rely on dep X. dep X can do Y. You should do Y".
I think the user should just "Use X" directly in many (not all. balance!) of those cases.

(*) If you're a design patterns freak, this is Facade. Simplified is the whole point.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2014

👍 to on starting fork of the parsers; as @y-p points out,

there probably are many many variants which people would like to parse; pandas handles the most common, and will accept your data in any event.

@MichaelWS
Copy link
Contributor Author

Ok, y-p's point makes sense. I will use requests for my site and parse with pandas. That's why I posted the issue instead of simply put up a PR.

I am not sure on a separate parsing library. I think it is more likely that people would contribute if it remained part of pandas.

@ghost ghost mentioned this issue Jan 24, 2014
@ghost
Copy link

ghost commented Jan 27, 2014

Thank you, I think that's the right way to go.

@ghost ghost closed this as completed Jan 27, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

4 participants