Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webcmdlets should parse the <html><head><meta charset="foo"> attribute for the correct encoding if not in http header #3267

Closed
SteveL-MSFT opened this issue Mar 6, 2017 · 27 comments · Fixed by #4692 or #18748
Labels
Committee-Reviewed PS-Committee has reviewed this and made a decision Issue-Enhancement the issue is more of a feature request than a bug Resolution-Fixed The issue is fixed. WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module
Milestone

Comments

@SteveL-MSFT
Copy link
Member

Some websites do not populate the charset property of the content-type header so characters aren't rendered correctly. Suggestion is to expose a -charset parameter, however the user still needs to know the expected charset. Advanced users today can do the encoding translation in script. utf-8 probably works in most cases, so not entirely sure how useful this will be to expect the user to know ahead of time the correct charset.

See discussion from #3126 for details on how this came about

@SteveL-MSFT SteveL-MSFT added WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module Issue-Enhancement the issue is more of a feature request than a bug Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors labels Mar 6, 2017
@iSazonov
Copy link
Collaborator

iSazonov commented Mar 6, 2017

I believe we should discuss here a default encoding for Web cmdlets.
Currently we use RFC 2616 "ISO-8859-1"
In fact, browsers for a long time use Windows-1252.
And HTML5 default is UTF-8.

Perhaps we should also use Windows 1252.

@SteveL-MSFT SteveL-MSFT added the Review - Committee The PR/Issue needs a review from the PowerShell Committee label Mar 7, 2017
@iSazonov
Copy link
Collaborator

iSazonov commented Mar 7, 2017

I looked Web cmdlets and found that we use ContentType parameter to encode a request (Content-Type header can contains charset value). If we don't specify a charset in ContenType we use default charset ISO-8859-1.
For decoding a response we use a charset from ContentType of the response. If a server return no ContentType we use the same default charset ISO-8859-1.

So we should treat -CharSet as -ResponseCharSet.

After some thought, I believe that using Windows 1252 as default is obsolete and we should aim at HTML5 and UTF-8 as defaults. https://w3techs.com/technologies/details/ml-html5/all/all

@SteveL-MSFT SteveL-MSFT added Committee-Reviewed PS-Committee has reviewed this and made a decision and removed Review - Committee The PR/Issue needs a review from the PowerShell Committee labels Mar 9, 2017
@SteveL-MSFT
Copy link
Member Author

@PowerShell/powershell-committee reviewed this and agree this is an issue for customers. proposal is to parse the HTTP header first, if charset is in content-type, we use that. otherwise if content-type is html, we parse <meta charset="X"> for the charset attribute and use that

@SteveL-MSFT SteveL-MSFT changed the title Expose -CharSet parameter to webcmdlets Webcmdlets should parse the <html><head><meta charset="foo"> attribute for the correct encoding if not in http header Mar 9, 2017
@iSazonov
Copy link
Collaborator

iSazonov commented Mar 9, 2017

In Windows Powershell we use Internet Explorer to parse HTML. What portable parser we can use in PowerShell Core?

And if HTML don't contain <meta charset="foo"> ? What defaults we should use for fallback?

@SteveL-MSFT
Copy link
Member Author

@iSazonov the proposal is that we don't rely on any browser for the html parsing (if complete parsing is needed, I still think it would make more sense in a convertfrom-html cmdlet). I think "best effort" for this is sufficient (perhaps even just a regular expression) to cover the majority of cases and we wouldn't worry about malformed html.

@iSazonov
Copy link
Collaborator

iSazonov commented Mar 9, 2017

@SteveL-MSFT Original Windows web cmdlet returns ParsedHtml : mshtml.HTMLDocumentClass - we want lost the functionality?

@SteveL-MSFT
Copy link
Member Author

@iSazonov .ParshedHtml relied on Internet Explorer. I don't think we can have a dependency on any particular web browser in the webcmdlets.

To answer your other question I missed, if <meta charset> doesn't exist and charset isn't specified in the content-type HTTP header, then we do what we do today which is assume ISO-8859-1

@iSazonov
Copy link
Collaborator

iSazonov commented Mar 9, 2017

@SteveL-MSFT

.ParshedHtml relied on Internet Explorer

If we use any ported library for HTML parsing we will solve this Issue, get ParshedHtml ported, as well as get a base for ConvertFrom-HTML.

I wonder about ISO-8859-1. Why we do not want to accept the new standard HTML5?

@SteveL-MSFT
Copy link
Member Author

@iSazonov my understanding is that HTTP1.1 still defaults to iso-8859-1, if the content-type is text/html, we should just follow the HTML 5 rules for determining content type ideally just using one of the OSS HTML parsing libraries you already found

@iSazonov
Copy link
Collaborator

iSazonov commented Mar 9, 2017

@SteveL-MSFT It seems the doc is very old. New is http://www.w3.org/TR/html5/syntax.html#the-input-byte-stream It don't mention iso-8859-1 at all.

@iSazonov
Copy link
Collaborator

Currently CoreFX already use UTF8 as default.

@SteveL-MSFT
Copy link
Member Author

@iSazonov that's html5, HTTP 1.1 defaults to ISO-8859-1 if charset is not specified. See 3.4.1 in https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

@iSazonov
Copy link
Collaborator

These standards are too muddled 😕 From https://tools.ietf.org/html/rfc7231:

The default charset of ISO-8859-1 for text media types has been
removed; the default is now whatever the media type definition says.
Likewise, special treatment of ISO-8859-1 has been removed from the
Accept-Charset header field.

In any case we trust CoreFX. Yes?

@SteveL-MSFT
Copy link
Member Author

Ideally, we should just leave this to corefx.

@iSazonov
Copy link
Collaborator

This already is in CoreFX so we can.
Only is this in CoreFX version we use currently or we blocked until NetStandard 2.0?

@SteveL-MSFT SteveL-MSFT added Waiting - DotNetCore waiting on a fix/change in .NET and removed Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors labels Mar 13, 2017
@SteveL-MSFT
Copy link
Member Author

Marking as waiting on NetStandard20. Once we move to latest CoreClr, we can verify if this is still an issue.

@iSazonov
Copy link
Collaborator

@SteveL-MSFT Can you initiate internal conclusion about using HtmlAgilityPack or
AngleSharp? What one we can more trust?
Then I would try to replace IE on one of these parsers for PowerShell Core (and leave IE for FullCLR).

@SteveL-MSFT SteveL-MSFT added Review - Committee The PR/Issue needs a review from the PowerShell Committee and removed Committee-Reviewed PS-Committee has reviewed this and made a decision labels Mar 15, 2017
@dantraMSFT dantraMSFT assigned dantraMSFT and unassigned dantraMSFT Aug 28, 2017
@SteveL-MSFT SteveL-MSFT reopened this Sep 17, 2017
@SteveL-MSFT SteveL-MSFT removed the Resolution-Fixed The issue is fixed. label Sep 17, 2017
@dantraMSFT
Copy link
Contributor

I'm not seeing the problem with 'http://weibo.com'. Invoke-RestMethod is detecting the encoding correctly. Can you be more specific?

For tv.sohu.com and ip138.com, I found a bug in Invoke-RestMethod. It is calling WriteVerbose with the encoding indicating the encoding name or header name but Encoding.EncodingName is throwing. I'll need to change this and update the tests.

@iSazonov
Copy link
Collaborator

We already do this in private static void EncodingRegisterProvider()

@SteveL-MSFT SteveL-MSFT modified the milestones: 6.0.0-HighPriority, 6.1.0 Nov 3, 2017
@SteveL-MSFT
Copy link
Member Author

Any further work on this I'm deferring to 6.1.0

@MSAdministrator
Copy link

Submitted RFC for the creation of ConvertFrom-Html here: PowerShell/PowerShell-RFC#137

@iSazonov
Copy link
Collaborator

We get new HttpClient with .Net 3+ so I remove the label.

@iSazonov iSazonov removed the Waiting - DotNetCore waiting on a fix/change in .NET label Jul 16, 2020
@ghost ghost added the In-PR Indicates that a PR is out for the issue label Dec 8, 2022
Windows PowerShell compatibility (.NET Standard 2.0) automation moved this from Need to validate to Done Jan 21, 2023
@ghost ghost added Resolution-Fixed The issue is fixed. and removed In-PR Indicates that a PR is out for the issue labels Jan 21, 2023
@ghost
Copy link

ghost commented Mar 14, 2023

🎉This issue was addressed in #18748, which has now been successfully released as v7.4.0-preview.2.:tada:

Handy links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Committee-Reviewed PS-Committee has reviewed this and made a decision Issue-Enhancement the issue is more of a feature request than a bug Resolution-Fixed The issue is fixed. WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module
7 participants
@joeyaiello @JamesWTruher @MSAdministrator @SteveL-MSFT @dantraMSFT @iSazonov and others