-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebCmdlets get encoding from BOM #19379
base: master
Are you sure you want to change the base?
Conversation
stream.ReadExactly(buffer, 0, 4); | ||
} | ||
|
||
EncodingHelper.TryDetectEncodingFromBom(buffer, out encoding, out int preambleLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be surprised if this is not already done in .Net. It might be worth looking at HttpClient and other related code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the code in .Net and it doesn't match perfectly with our needs (it doesn't consider UTF32-BE, it checks for the BOM after checking the charset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it checks for the BOM after checking the charset
Maybe it is more right behavior? Could you please point the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gets encoding from charset:
https://github.com/dotnet/runtime/blob/bd6cbe3642f51d70839912a6a666e5de747ad581/src/libraries/System.Net.Http/src/System/Net/Http/HttpContent.cs#L194-L219
Gets encoding from bom:
https://github.com/dotnet/runtime/blob/bd6cbe3642f51d70839912a6a666e5de747ad581/src/libraries/System.Net.Http/src/System/Net/Http/HttpContent.cs#L221-L234
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is 8 years old. So we should follow the logic too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed what was discussed here: #11547 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing the history. It is correct.
Can we use StreamReader with auto encoding detection?
https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs,109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iSazonov I tried using StreamReader with auto encoding detection, what do you think?
internal static string DecodeStream(Stream stream, string characterSet, out Encoding encoding)
{
bool isDefaultEncoding = false;
StreamReader reader = new(stream, detectEncodingFromByteOrderMarks: true);
// reader.CurrentEncoding defaults to UTF8
encoding = reader.CurrentEncoding;
if (encoding == Encoding.UTF8)
{
isDefaultEncoding = !TryGetEncodingFromCharset(characterSet, out encoding);
}
if (isDefaultEncoding)
{
reader = new(stream, encoding);
// We only look within the first 1k characters as the meta element and
// the xml declaration are at the start of the document
int bufferLength = (int)Math.Min(reader.BaseStream.Length, 1024);
char[] buffer = new char[bufferLength];
reader.ReadBlock(buffer, 0, bufferLength);
stream.Seek(0, SeekOrigin.Begin);
string substring = new(buffer);
// Check for a charset attribute on the meta element to override the default
Match match = s_metaRegex.Match(substring);
// Check for a encoding attribute on the xml declaration to override the default
if (!match.Success)
{
match = s_xmlRegex.Match(substring);
}
if (match.Success)
{
characterSet = match.Groups["charset"].Value;
if (TryGetEncodingFromCharset(characterSet, out Encoding localEncoding))
{
encoding = localEncoding;
}
}
}
reader.Dispose();
return new StreamReader(stream, encoding).ReadToEnd();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can use https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs,119
and pass encoding based on characterSet. This will aromatically use BOM if present.
src/Microsoft.PowerShell.Commands.Utility/commands/utility/WebCmdlet/StreamHelper.cs
Outdated
Show resolved
Hide resolved
|
||
stream.Seek(preambleLength, SeekOrigin.Begin); | ||
string content = StreamToString(stream, encoding); | ||
encoding = reader.CurrentEncoding; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CurrentEncoding is updated only after we start reading.
But should we process isDefaultEncoding block below? I guess if we had to process XML this would mean that the original request is fundamentally flawed. I doubt it's worth complicating the code for the sake of it.
@mklement0 What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iSazonov, I'm afraid I don't understand the question (I'm not familiar with the code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mklement0 I ask about p.3 from #11547 (comment)
otherwise, for XML and HTML, respect the encoding specified in the XML declaration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mklement0 Friendly ping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iSazonov, sorry, I forgot to respond: I'm still unclear on the details, but what I suggested in #11547 (comment) meant looking no further once a BOM is found. Only if there's none, look at charset
. Only if there's none, look at the XML declaration / HTML <meta>
element (the latter seems to be absent from the snippet above - is HTML handled elsewhere?).
This hinges on explicitly knowing if a BOM is present.
It is true that the above means that there can be inconsistencies that will be tolerated (e.g., a UTF-8 BOM, but an XML declaration specifying a different encoding), but the above precedence makes it clear which information "wins".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iSazonov, good point about breaking changes.
Note that I am not deeply immersed in this, so do tell me if I'm getting something wrong:
Looking at the old code (
PowerShell/src/Microsoft.PowerShell.Commands.Utility/commands/utility/WebCmdlet/StreamHelper.cs
Line 389 in 420c29d
internal static string DecodeStream(Stream stream, string characterSet, out Encoding encoding, CancellationToken cancellationToken) |
charset
attribute - currently already takes precedence, short-circuiting further investigation.
So we need to stick with that to avoid a breaking change, but do note that it contradicts the relevant RFC, as noted in #11547 (comment) (which recommends prioritizing in-band information).
Rethinking my over-specification statement: perhaps the right thing to do, if both out-of-band encoding information and a BOM are present, is to remove the BOM if the indicated and BOM-implied encoding is the same.
In the absence of out-of-band information, the question is then how the handle the in-band information - BOM and/or encoding
/ charset
attribute in <?xml>
declaration / <meta>
HTML element.
The current code searches only for the latter, and does so incorrectly, as discussed (no support for 2+-byte Unicode encodings).
To fix this:
- For XML-based in-band encoding information, we can delegate to the .NET APIs based on raw byte streams.
- For the HTML
<meta>
case, we still need to dour own text-based matching, but it requires something like the.Replace("\0")
approach mentioned above.
This too avoids a breaking change, though (like now) it requires the relatively expensive parsing of the first 1K / letting a .NET method call potentially fail (for potential XML content).
Giving precedence to manual BOM detection would perform better at least with non-XML content, but the only way this would not amount to a breaking change is if the presence of a BOM currently breaks things, which is the case, right?
I'm not sure how much performance is a concern here, however.
Manual BOM detection is still necessary, at least in the absence of out-of-band encoding information and possibly also XML-declaration / HTML <meta>
in-band information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.Net does encoding detection for XML https://source.dot.net/#System.Private.Xml/System/Xml/Core/XmlTextReaderImpl.cs,2895
So we have no need to do the same but worse.
Since first we try Feeds/Atom as XML we can simplify the code and first try XML reading then fallback to Json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iSazonov we only try Feeds/Atom as XML in Invoke-RestMethod
not in Invoke-WebRequest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not prevent my suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't, but from my understanding it only applies to Invoke-RestMethod
, we could work on your suggestion in another PR
@iSazonov any insight on the many test failures after the last commit? |
I restarted failed CIs. Let's wait result. |
@CarloToso I guess changes in EncodingController.cs cause CI fails. |
@iSazonov it seems the errors were caused by |
This pull request has been automatically marked as Review Needed because it has been there has not been any activity for 7 days. |
What is the status of this PR?? |
@doctordns The code is complete and awaiting further review |
{ | ||
StringBuilder result = new(capacity: ChunkSize); | ||
Decoder decoder = encoding.GetDecoder(); | ||
bool isDefaultEncoding = !TryGetEncodingFromCharset(characterSet, out Encoding encoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The precedence of encoding detection is unclear to me.
By looking at the code, when characterSet
is specified, isDefaultEncoding
will be false
, and
- meta element will not be checked
- encoding detected from BOM will take precedence, then the encoding resolved from
characterSet
when characterSet
is not specified, isDefaultEncoding
will be true
, and
- meta element will be checked
- character set from the meta element will take precedence, then the encoding detected from BOM, and then the default encoding
As is shown, the encoding detection is inconsistent -- if the characterSet
specified in header is lower precedence than BOM, then why the characterSet
specified in meta element is higher than BOM?
Can you please summarize the precedence you use and the reason for that (e.g. to avoid breaking change? RFC defined? If they are contradicted with each other, then maybe we should make a breaking change to adhere to RFC?) The summarization needs to be put in the code as comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please summarize the precedence you use and the reason for that (e.g. to avoid breaking change? RFC defined? If they are contradicted with each other, then maybe we should make a breaking change to adhere to RFC?) The summarization needs to be put in the code as comment.
@CarloToso I see that you've made changes to the encoding detection. But can you please summarize the precedence and the reason for using that precedence?
@CarloToso The CIs are failing due to a compilation error. Can you please fix it? |
There were some extensive changes in #19558, @stevenebutler could you help me once more? |
Hi @CarloToso - If it's still an issue I may be able to have a look once I'm on vacation in a week or two. |
Added extension methods to StreamReader that will add a timeout to each stream read if a timeout property is set in IWR/IRM.
Hi @CarloToso - I have made a PR on your fork with fixes for this PR |
Make encoding changes handle network stall timeouts
This is needed to stop web methods from deadlocking when windows forms are loaded from within the PowerShell process.
This PR has Quantification details
Why proper sizing of changes matters
Optimal pull request sizes drive a better predictable PR flow as they strike a
What can I do to optimize my changes
How to interpret the change counts in git diff output
Was this comment helpful? 👍 :ok_hand: :thumbsdown: (Email) |
This pull request has been automatically marked as Review Needed because it has been there has not been any activity for 7 days. |
PR Summary
PR Context
using: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/IO/StreamReader.cs
fixes #11547
PR Checklist
.h
,.cpp
,.cs
,.ps1
and.psm1
files have the correct copyright headerWIP:
or[ WIP ]
to the beginning of the title (theWIP
bot will keep its status check atPending
while the prefix is present) and remove the prefix when the PR is ready.(which runs in a different PS Host).