Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: handle multiple tbody in read_html() #20690

Closed
jstray opened this issue Apr 13, 2018 · 6 comments · Fixed by #20891
Closed

ENH: handle multiple tbody in read_html() #20690

jstray opened this issue Apr 13, 2018 · 6 comments · Fixed by #20891
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Milestone

Comments

@jstray
Copy link

jstray commented Apr 13, 2018

Code Sample

url = 'https://www.wunderground.com/history/airport/KORD/2018/3/21/CustomHistory.html?dayend=10&monthend=4&yearend=2018'
df = pandas.read_html(url)[1]

Problem description

Expected: table of weather information at page bottom.

Actual:
0.21.1 - first two rows
0.22.0 - exception
current master - first two rows

For reference, Google Sheets IMPORTHTML loads this table correctly.,

@WillAyd
Copy link
Member

WillAyd commented Apr 13, 2018

It looks like the second table has multiple tbody tags but the parser only looks at the first:

res = self._parse_tr(tbody[0])

PRs welcome

@jstray
Copy link
Author

jstray commented Apr 14, 2018

Thanks. I'll probably have to fix this.

@gfyoung gfyoung added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Apr 15, 2018
@gfyoung
Copy link
Member

gfyoung commented Apr 15, 2018

We could always enhance read_html to behave like read_excel where we can read in multiple tbody elements. That doesn't seem too unreasonable for the moment.

@jstray
Copy link
Author

jstray commented Apr 15, 2018

How is read_excel similar to this? There’s no tbody in that case.

@WillAyd
Copy link
Member

WillAyd commented Apr 15, 2018

@gfyoung are you just referring to the ability to return multiple tables? read_html technically already does that as it returns a list instead of just a DataFrame, but chime in if I misunderstand.

@jstray one design consideration to think about - should multiple tbody elements return a MultiIndexed DataFrame? I suppose having those multiple tbody tags in the first place is indicative of different groupings within the table, so maybe it makes sense for each of them to be a unique value within the first level of a MultiIndex?

Not saying that needs to happen at the outset as certainly being able to just parse them would be an improvement over what we have today. Just throwing that out there as food for thought as you try a patch

@gfyoung
Copy link
Member

gfyoung commented Apr 16, 2018

@WillAyd : Sorry, misspoke there. Please ignore that comment. 😄

@chris-b1 chris-b1 changed the title read_html() failing on certain pages ENH: handle multiple tbody in read_html() Apr 16, 2018
@chris-b1 chris-b1 added this to the Next Major Release milestone Apr 16, 2018
adamhooper added a commit to adamhooper/pandas that referenced this issue May 1, 2018
adamhooper added a commit to adamhooper/pandas that referenced this issue May 1, 2018
adamhooper added a commit to adamhooper/pandas that referenced this issue May 1, 2018
@jreback jreback modified the milestones: Next Major Release, 0.23.0 May 1, 2018
TomAugspurger pushed a commit that referenced this issue May 1, 2018
* Read from multiple <tbody> within a <table>

refs #20690
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants