-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: read_html has no timeout #6029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Did you have a design in mind? For example, you could give a warning on timeout or raise an exception. You'd then do something like with tm.assertRaisesRegexp(TimeoutError, "read_html timed out"):
dfs = read_html('timing-out.com') And be sure to mark your test as |
No particular reason this isn't there. One reason this could be happening is that some HTML is malformed enough that the backend parser gets stuck in a cycle (e.g., child nodes become their own parents, so calling some sort |
xref: #4786 |
Interesting. I did not consider it to be malformed. my problem was that I ran the identical code multiple times so I concluded it to be a timeout issue. The code literally parsed everything fine on another attempt. ( I had to parse the old sec data myself because it was so poorly formed. Luckily, its now xml going forward. ) |
my design would just pass the timeout to urlopen and raise the exception. |
@MichaelWS In that case, the cycle issue probably doesn't apply. If you're able to parse it sometimes and not others, the API you're calling into could be placing a limit on the number of requests you're allowed to make per unit time. |
@MichaelWS Okay. Should be a simple enough change. |
Famous last words :) |
What are people's thoughts on moving the html parsing code from using
In the past (when not using requests), I have used |
@cancan101 |
@cpcloud At some point it might make sense to spin out the html parsing from pandas into its own project. There have been some features filed by myself and others that have been (reasonably so) noted as beyond the scope of Pandas. Doing this would simplify the dependencies of Pandas itself. |
Here is another example of the infinite loop in parsing: #4770 (comment) |
@cancan101, you are highly encouraged to go ahead and start such a project. I was going to +1 the timeout arg suggestion (urllib2.urlopen supports a First, some code:
Works just fine. Think about that. You can use your powerful tool of choice to handle network details If the pandas plotting/mpl wrapper approach taught us anything is that you can end up in a bad read_html is about parsing HTML into dataframes. It's not a Requests/urrllib wrapper, it's not a bs4 replacement. This is a bad pattern we seem to fall into time and time again. I suggest we start being really acerbic when (*) If you're a design patterns freak, this is Facade. Simplified is the whole point. |
👍 to on starting fork of the parsers; as @y-p points out, there probably are many many variants which people would like to parse; pandas handles the most common, and will accept your data in any event. |
Ok, y-p's point makes sense. I will use requests for my site and parse with pandas. That's why I posted the issue instead of simply put up a PR. I am not sure on a separate parsing library. I think it is more likely that people would contribute if it remained part of pandas. |
Thank you, I think that's the right way to go. |
Is there any reason why a user cannot pass in a timeout to read_html? I have a bunch of scripts that parse some tables on the web and was having an issue with a site when I noticed it.
If I add a pr for timeout, how would I test it?
The text was updated successfully, but these errors were encountered: