ENH: read_html has no timeout #6029

MichaelWS · 2014-01-21T22:15:37Z

Is there any reason why a user cannot pass in a timeout to read_html? I have a bunch of scripts that parse some tables on the web and was having an issue with a site when I noticed it.

If I add a pr for timeout, how would I test it?

cpcloud · 2014-01-22T01:12:54Z

Did you have a design in mind?

For example, you could give a warning on timeout or raise an exception. You'd then do something like

with tm.assertRaisesRegexp(TimeoutError, "read_html timed out"):
    dfs = read_html('timing-out.com')

And be sure to mark your test as @slow so that it doesn't get run on the fast(er) CI builds.

cpcloud · 2014-01-22T01:17:30Z

No particular reason this isn't there. One reason this could be happening is that some HTML is malformed enough that the backend parser gets stuck in a cycle (e.g., child nodes become their own parents, so calling some sort next_child method returns itself) and just keeps iterating until the cows come home. I'll try to find the other issue where I discovered this.

cpcloud · 2014-01-22T01:18:36Z

xref: #4786

MichaelWS · 2014-01-22T01:22:45Z

Interesting. I did not consider it to be malformed. my problem was that I ran the identical code multiple times so I concluded it to be a timeout issue. The code literally parsed everything fine on another attempt.

( I had to parse the old sec data myself because it was so poorly formed. Luckily, its now xml going forward. )

MichaelWS · 2014-01-22T01:26:00Z

my design would just pass the timeout to urlopen and raise the exception.

cpcloud · 2014-01-22T01:26:05Z

@MichaelWS In that case, the cycle issue probably doesn't apply. If you're able to parse it sometimes and not others, the API you're calling into could be placing a limit on the number of requests you're allowed to make per unit time.

cpcloud · 2014-01-22T01:26:36Z

@MichaelWS Okay. Should be a simple enough change.

cpcloud · 2014-01-22T01:30:25Z

Luckily, its now xml going forward

Famous last words :)

cancan101 · 2014-01-22T02:17:01Z

What are people's thoughts on moving the html parsing code from using urllib and urllib to using Requests (http://docs.python-requests.org/en/latest/) ?

Requests supports timeouts (http://docs.python-requests.org/en/latest/user/quickstart/?highlight=timeout#timeouts).

In the past (when not using requests), I have used socket.settimeout (http://docs.python.org/2/library/socket.html#socket.socket.settimeout).

cpcloud · 2014-01-22T02:18:44Z

@cancan101 requests is awesome. However, that's yet another dependency for HTML parsing. There are already too many. And making it optional would only serve to make someone's hair a bit more gray.

cancan101 · 2014-01-22T02:21:11Z

@cpcloud At some point it might make sense to spin out the html parsing from pandas into its own project. There have been some features filed by myself and others that have been (reasonably so) noted as beyond the scope of Pandas.

Doing this would simplify the dependencies of Pandas itself.

cancan101 · 2014-01-22T02:27:51Z

Here is another example of the infinite loop in parsing: #4770 (comment)

ghost · 2014-01-22T13:41:12Z

@cancan101, you are highly encouraged to go ahead and start such a project.
It probably won't become a new dependency but it will be a useful tool with it's own
focus that integrates well with pandas. we love those.

I was going to +1 the timeout arg suggestion (urllib2.urlopen supports a timeout argument btw) but
after a little more thought, I'm not so sure.

First, some code:

import requests
r=requests.get('http://en.wikipedia.org/wiki/List_of_Treme_episodes')
pd.read_html(r.content[0])

Works just fine. Think about that. You can use your powerful tool of choice to handle network details
and then hand pandas the data, Isn't that just perfect? Let's say timeouts seem like a reasonable thing
to add (they kind of do). What about HTTP-basic auth? POST requests? custom HTTP headers?
Requests does a lot of things, should we expose them all through the method signature?

If the pandas plotting/mpl wrapper approach taught us anything is that you can end up in a bad
way if you keep trying to expose more and more of the underlying functionality through your interface. (*)

read_html is about parsing HTML into dataframes. It's not a Requests/urrllib wrapper, it's not a bs4 replacement.
If you need the power of requests, use it. pandas will cooperate. If you need the power of xpath
expressions, use those. pandas will accept the data you give it.

This is a bad pattern we seem to fall into time and time again. I suggest we start being really acerbic when
confronted with the syllogism: "You rely on dep X. dep X can do Y. You should do Y".
I think the user should just "Use X" directly in many (not all. balance!) of those cases.

(*) If you're a design patterns freak, this is Facade. Simplified is the whole point.

jreback · 2014-01-22T13:45:46Z

👍 to on starting fork of the parsers; as @y-p points out,

there probably are many many variants which people would like to parse; pandas handles the most common, and will accept your data in any event.

MichaelWS · 2014-01-22T15:08:55Z

Ok, y-p's point makes sense. I will use requests for my site and parse with pandas. That's why I posted the issue instead of simply put up a PR.

I am not sure on a separate parsing library. I think it is more likely that people would contribute if it remained part of pandas.

ghost · 2014-01-27T01:50:11Z

Thank you, I think that's the right way to go.

ghost mentioned this issue Jan 24, 2014

Add read_pdf to IO Tools #4556

Closed

ghost closed this as completed Jan 27, 2014

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: read_html has no timeout #6029

ENH: read_html has no timeout #6029

MichaelWS commented Jan 21, 2014

cpcloud commented Jan 22, 2014

cpcloud commented Jan 22, 2014

cpcloud commented Jan 22, 2014

MichaelWS commented Jan 22, 2014

MichaelWS commented Jan 22, 2014

cpcloud commented Jan 22, 2014

cpcloud commented Jan 22, 2014

cpcloud commented Jan 22, 2014

cancan101 commented Jan 22, 2014

cpcloud commented Jan 22, 2014

cancan101 commented Jan 22, 2014

cancan101 commented Jan 22, 2014

ghost commented Jan 22, 2014

jreback commented Jan 22, 2014

MichaelWS commented Jan 22, 2014

ghost commented Jan 27, 2014

ENH: read_html has no timeout #6029

ENH: read_html has no timeout #6029

Comments

MichaelWS commented Jan 21, 2014

cpcloud commented Jan 22, 2014

cpcloud commented Jan 22, 2014

cpcloud commented Jan 22, 2014

MichaelWS commented Jan 22, 2014

MichaelWS commented Jan 22, 2014

cpcloud commented Jan 22, 2014

cpcloud commented Jan 22, 2014

cpcloud commented Jan 22, 2014

cancan101 commented Jan 22, 2014

cpcloud commented Jan 22, 2014

cancan101 commented Jan 22, 2014

cancan101 commented Jan 22, 2014

ghost commented Jan 22, 2014

jreback commented Jan 22, 2014

MichaelWS commented Jan 22, 2014

ghost commented Jan 27, 2014