AdversarialNLP

RL based adversarial attack on RoBERTA toxicity classifier.

This attack is performed with a Reinforce algorithm in a black-box setting. It takes about 60k queries to the model to converge. The success rate on Jigsaw benchmark is 0.89.

The algorithm generates an injection which being inserted in the text "detoxifies" it, even if it contains offencive lexic.

The proposed algorithm is suitable for any deep text classifier.

Requirements:

python 3.7
torch 1.10
transformers 4.6.1

How to use:

Run all cells in Attack_on_RoBerta.ipynb notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Attack_on_RoBerta.ipynb		Attack_on_RoBerta.ipynb
README.md		README.md
example.png		example.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdversarialNLP

Requirements:

How to use:

Example

About

Releases

Packages

Languages

razzant/AdversarialNLP

Folders and files

Latest commit

History

Repository files navigation

AdversarialNLP

Requirements:

How to use:

Example

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages