ENH: handle multiple tbody in read_html() #20690

jstray · 2018-04-13T20:37:23Z

Code Sample

url = 'https://www.wunderground.com/history/airport/KORD/2018/3/21/CustomHistory.html?dayend=10&monthend=4&yearend=2018'
df = pandas.read_html(url)[1]

Problem description

Expected: table of weather information at page bottom.

Actual:
0.21.1 - first two rows
0.22.0 - exception
current master - first two rows

For reference, Google Sheets IMPORTHTML loads this table correctly.,

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-04-13T23:42:48Z

It looks like the second table has multiple tbody tags but the parser only looks at the first:

pandas/pandas/io/html.py

Line 394 in ad5affb

res = self._parse_tr(tbody[0])

PRs welcome

jstray · 2018-04-14T00:20:42Z

Thanks. I'll probably have to fix this.

gfyoung · 2018-04-15T20:37:46Z

We could always enhance read_html to behave like read_excel where we can read in multiple tbody elements. That doesn't seem too unreasonable for the moment.

jstray · 2018-04-15T22:18:19Z

How is read_excel similar to this? There’s no tbody in that case.

WillAyd · 2018-04-15T22:52:07Z

@gfyoung are you just referring to the ability to return multiple tables? read_html technically already does that as it returns a list instead of just a DataFrame, but chime in if I misunderstand.

@jstray one design consideration to think about - should multiple tbody elements return a MultiIndexed DataFrame? I suppose having those multiple tbody tags in the first place is indicative of different groupings within the table, so maybe it makes sense for each of them to be a unique value within the first level of a MultiIndex?

Not saying that needs to happen at the outset as certainly being able to just parse them would be an improvement over what we have today. Just throwing that out there as food for thought as you try a patch

gfyoung · 2018-04-16T05:57:50Z

@WillAyd : Sorry, misspoke there. Please ignore that comment. 😄

refs pandas-dev#20690

* Read from multiple <tbody> within a <table> refs #20690

gfyoung added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Apr 15, 2018

gfyoung added the Enhancement label Apr 15, 2018

chris-b1 changed the title ~~read_html() failing on certain pages~~ ENH: handle multiple tbody in read_html() Apr 16, 2018

chris-b1 added Difficulty Intermediate labels Apr 16, 2018

chris-b1 added this to the Next Major Release milestone Apr 16, 2018

adamhooper added a commit to adamhooper/pandas that referenced this issue May 1, 2018

Read from multiple <tbody> within a <table>

edfcc73

refs pandas-dev#20690

adamhooper added a commit to adamhooper/pandas that referenced this issue May 1, 2018

Read from multiple <tbody> within a <table>

fcadfe8

refs pandas-dev#20690

adamhooper added a commit to adamhooper/pandas that referenced this issue May 1, 2018

Read from multiple <tbody> within a <table>

54d47e4

refs pandas-dev#20690

adamhooper mentioned this issue May 1, 2018

Read from multiple <tbody> within a <table> #20891

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.23.0 May 1, 2018

TomAugspurger closed this as completed in #20891 May 1, 2018

TomAugspurger pushed a commit that referenced this issue May 1, 2018

Read from multiple <tbody> within a <table> (#20891)

926f241

* Read from multiple <tbody> within a <table> refs #20690

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: handle multiple tbody in read_html() #20690

ENH: handle multiple tbody in read_html() #20690

jstray commented Apr 13, 2018 •

edited

Loading

WillAyd commented Apr 13, 2018

jstray commented Apr 14, 2018

gfyoung commented Apr 15, 2018 •

edited

Loading

jstray commented Apr 15, 2018

WillAyd commented Apr 15, 2018

gfyoung commented Apr 16, 2018

ENH: handle multiple tbody in read_html() #20690

ENH: handle multiple tbody in read_html() #20690

Comments

jstray commented Apr 13, 2018 • edited Loading

Code Sample

Problem description

WillAyd commented Apr 13, 2018

jstray commented Apr 14, 2018

gfyoung commented Apr 15, 2018 • edited Loading

jstray commented Apr 15, 2018

WillAyd commented Apr 15, 2018

gfyoung commented Apr 16, 2018

jstray commented Apr 13, 2018 •

edited

Loading

gfyoung commented Apr 15, 2018 •

edited

Loading