-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Webcmdlets should parse the <html><head><meta charset="foo"> attribute for the correct encoding if not in http header #3267
Comments
I believe we should discuss here a default encoding for Web cmdlets. Perhaps we should also use Windows 1252. |
I looked Web cmdlets and found that we use So we should treat After some thought, I believe that using Windows 1252 as default is obsolete and we should aim at HTML5 and UTF-8 as defaults. https://w3techs.com/technologies/details/ml-html5/all/all |
@PowerShell/powershell-committee reviewed this and agree this is an issue for customers. proposal is to parse the HTTP header first, if charset is in content-type, we use that. otherwise if content-type is html, we parse |
In Windows Powershell we use Internet Explorer to parse HTML. What portable parser we can use in PowerShell Core? And if HTML don't contain |
@iSazonov the proposal is that we don't rely on any browser for the html parsing (if complete parsing is needed, I still think it would make more sense in a |
@SteveL-MSFT Original Windows web cmdlet returns |
@iSazonov To answer your other question I missed, if |
If we use any ported library for HTML parsing we will solve this Issue, get I wonder about |
@iSazonov my understanding is that HTTP1.1 still defaults to |
@SteveL-MSFT It seems the doc is very old. New is http://www.w3.org/TR/html5/syntax.html#the-input-byte-stream It don't mention |
Currently CoreFX already use UTF8 as default. |
@iSazonov that's html5, HTTP 1.1 defaults to ISO-8859-1 if charset is not specified. See 3.4.1 in https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html |
These standards are too muddled 😕 From https://tools.ietf.org/html/rfc7231:
In any case we trust CoreFX. Yes? |
Ideally, we should just leave this to corefx. |
This already is in CoreFX so we can. |
Marking as waiting on NetStandard20. Once we move to latest CoreClr, we can verify if this is still an issue. |
@SteveL-MSFT Can you initiate internal conclusion about using HtmlAgilityPack or |
I'm not seeing the problem with 'http://weibo.com'. Invoke-RestMethod is detecting the encoding correctly. Can you be more specific? For tv.sohu.com and ip138.com, I found a bug in Invoke-RestMethod. It is calling WriteVerbose with the encoding indicating the encoding name or header name but Encoding.EncodingName is throwing. I'll need to change this and update the tests. |
We already do this in private static void EncodingRegisterProvider() |
Any further work on this I'm deferring to 6.1.0 |
Submitted RFC for the creation of |
We get new HttpClient with .Net 3+ so I remove the label. |
🎉This issue was addressed in #18748, which has now been successfully released as Handy links: |
Some websites do not populate the
charset
property of thecontent-type
header so characters aren't rendered correctly. Suggestion is to expose a-charset
parameter, however the user still needs to know the expected charset. Advanced users today can do the encoding translation in script.utf-8
probably works in most cases, so not entirely sure how useful this will be to expect the user to know ahead of time the correct charset.See discussion from #3126 for details on how this came about
The text was updated successfully, but these errors were encountered: