Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Modelling of the Standard Library #16840

Merged

Conversation

yoff
Copy link
Contributor

@yoff yoff commented Jun 25, 2024

These are the models gathered so far during experiments not extracting the standard library by default.
A good part of the models are generated by special automation. Where possible, I have made manual adjustments to make them align with their documentation and not be overly specific.
Commit-by-commit review should be beneficiel.

yoff added 6 commits June 25, 2024 14:40
Two of the generated summaries have been excluded:
 - ["re", "Member[split]", "Argument[0,pattern:]", "ReturnValue", "taint"]
   From the documentation, it is not clear why pattern should figure in the return value, as that is the part denoting split point and thus all those instances are filtered out.
   From the implementation
     Spit function: https://github.com/python/cpython/blob/3.12/Lib/re/__init__.py#L199
     _compile function being called by split: https://github.com/python/cpython/blob/3.12/Lib/re/__init__.py#L280
   We see that in case the pattern is already a compiled `Pattern`, it is returned directly from _compile and could thus be part of the return value from split. This is probably not possible to arrange for an attacker, and so an FP in practice.

 - ["urllib2", "Member[unquote]", "Argument[0,string:]", "ReturnValue", "taint"]
   urllib2 seems to be only in Python2 (e.g. https://docs.python.org/2.7/library/urllib2.html) and I cannot locate the function unquote.
@yoff yoff requested a review from a team as a code owner June 25, 2024 23:06
@yoff
Copy link
Contributor Author

yoff commented Jun 26, 2024

Sorry, I somehow pushed to the wrong branch. The force-push should just reset the world without any change..

@yoff yoff mentioned this pull request Jun 26, 2024
@yoff yoff added the no-change-note-required This PR does not need a change note label Jun 26, 2024
@yoff
Copy link
Contributor Author

yoff commented Jun 26, 2024

I have set no change note. I suppose a change note will be added when this branch is merged into main.

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks OK to me, but it would have been nice to have some tests highlighting why flow through some of the more non-obvious methods are good. Value-flow for copy.copy makes 100% sense, but things like taint-flow for subprocess.Popen simply didn't make sense to me 🤷


I assume bc55117 was caused by some code using keywords which we didn't model previously?

Comment on lines +284 to +290
this =
DD::selfTracker(subclassRef()
.getAValueReachableFromSource()
.asExpr()
.(ClassExpr)
.getInnerScope())
or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a test indicating we can now do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point...this is one bit I am somewhat unsure of, do we want this? Also, it should trigger a performance check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you added this in 77a0087 -- I think a DCA performance check for this whole PR is in place yes 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have set off a standard nightly/nightly DCA run. I did not include any extra options, so it should still extract the standard library; thus we should see a slight slowdown from the new modelling and we can gauge that it is not huge. (If we did stop extraction, that speedup would drown out what we are looking for.)

yoff and others added 4 commits June 28, 2024 14:39
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
It is not clear from the code how this could happen and
I do not remember the path I saw, perhaps it was unreasonable.
It would be nice to simplify to a single sequence content type..
@yoff yoff requested a review from RasmusWL June 28, 2024 13:25
MaD row numbers in provenance column
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few unresolved comments, otherwise LGTM 👍

@yoff
Copy link
Contributor Author

yoff commented Jul 23, 2024

DCA run looks fine.

@yoff yoff requested a review from RasmusWL July 23, 2024 13:08
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks for the updates 😊

@sidshank sidshank merged commit 8864ea3 into github:yoff-python-stop-extracting-std-lib Aug 12, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-change-note-required This PR does not need a change note Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants