research-article

Open access

Assessing The Factual Accuracy of Generated Text

Authors:

Mohammad SalehAuthors Info & Claims

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 166 - 175

https://doi.org/10.1145/3292500.3330955

Published: 25 July 2019 Publication History

Abstract

We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce and release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction models. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analyse multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE and other model-free variants by conducting a human evaluation study.

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, 2015, Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations .

[2]

Michele Banko, Michael J. Cafarella, Stephen Soderland, Matt Broadhead, and Oren Etzioni, 2007, Open Information Extraction from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2670--2676.

Digital Library

[3]

Peter F. Brown, Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and Jennifer C. Lai. 1992, An Estimate of an Upper Bound for the Entropy of English, Computational Linguistics, Vol. 18, 1 (March 1992), 31--40.

Digital Library

[4]

Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li, 2017, Faithful to the Original: Fact Aware Neural Abstractive Summarization, CoRR, Vol. abs/1711.04434 (2017). arxiv: 1711.04434 http://arxiv.org/abs/1711.04434

[5]

Jason Chiu and Eric Nichols. 2016, Named Entity Recognition with Bidirectional LSTM-CNNs, Transactions of the Association for Computational Linguistics, Vol. 4 (2016), 357--370.

[6]

Kevin Clark and Christopher D. Manning. 2016, Improving Coreference Resolution by Learning Entity-Level Distributed Representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 643--653.

[7]

Jenny Rose Finkel, Trond Grenager, and Christopher Manning, 2005, Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), 363--370.

Digital Library

[8]

Eduard Hovy, Chin-Yew Lin, Liang Zhou, and Junichi Fukumoto. 2006, Automated Summarization Evaluation with Basic Elements. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06), European Language Resources Association (ELRA).

[9]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer, 2016, Neural Architectures for Named Entity Recognition, In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 260--270.

[10]

Alon Lavie and Abhaya Agarwal. 2007, Meteor: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. In Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, Stroudsburg, PA, USA, 228--231.

Digital Library

[11]

Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011, Stanford's Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task. In In Proceedings of the CoNLL-2011 Shared Task .

Digital Library

[12]

Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky, 2017, Adversarial Learning for Neural Dialogue Generation. In Conference on Empirical Methods in Natural Language Processing. 2157--2169.

[13]

Chin-Yew Lin, 2004, ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Stan Szpakowicz Marie-Francine Moens (Ed.). Association for Computational Linguistics, Barcelona, Spain, 74--81.

[14]

Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016, Neural Relation Extraction with Selective Attention over Instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2124--2133.

[15]

Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Łukasz Kaiser, and Noam Shazeer. 2018, Generating Wikipedia by Summarizing Long Sequences. In Proceedings of the 2018 International Conference on Learning Representations .

[16]

Andrew Mccallum and David Jensen. 2003, A Note on the Unification of Information Extraction and Data Mining using Conditional-Probability, Relational Models. In In Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data .

[17]

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky, 2009, Distant Supervision for Relation Extraction Without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 1003--1011.

Digital Library

[18]

Makoto Miwa and Mohit Bansal. 2016, End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1105--1116.

[19]

Makoto Miwa and Yutaka Sasaki. 2014, Modeling Joint Entity and Relation Extraction with Table Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1858--1869.

[20]

Thahir P. Mohamed, Estevam R. Hruschka, Jr., and Tom M. Mitchell. 2011, Discovering Relations Between Noun Categories. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, 1447--1455.

Digital Library

[21]

Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang, 2016, Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Proceedings of the 2016 SIGNLL Conference on Computational Natural Language Learning .

[22]

Ani Nenkova and Rebecca J. Passonneau. 2004, Evaluating Content Selection in Summarization: The Pyramid Method. In Proceedings of the 2005 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 145--152.

[23]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, 2002, BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, 311--318.

Digital Library

[24]

Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning, 2010, A Multi-pass Sieve for Coreference Resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, 492--501.

Digital Library

[25]

Marta Recasens, Marie-Catherine de Marneffe, and Christopher Potts. 2013, The Life and Death of Discourse Entities: Identifying Singleton Mentions. In Proceedings of the 2013 North American Chapter of the Association for Computational Linguistics .

[26]

Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013, Relation Extraction with Matrix Factorization and Universal Schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 74--84.

[27]

Alexander M. Rush, Sumit Chopra, and Jason Weston, 2015, A Neural Attention Model for Abstractive Sentence Summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing .

[28]

Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kartik Talamadupula, Bowen Zhou, Yoshua Bengio, and Aaron C. Courville, 2017. Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation. In Proceedings of the 2017 AAAI Conference on Artificial Intelligence. 3288--3294.

Digital Library

[29]

Iulian Vlad Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, Sai Mudumba, Alexandre de Bré bisson, Jose Sotelo, Dendi Suhubdy, Vincent Michalski, Alexandre Nguyen, Joelle Pineau, and Yoshua Bengio, 2017. A Deep Reinforcement Learning Chatbot, CoRR, Vol. abs/1709.02349 (2017). arxiv: 1709.02349 http://arxiv.org/abs/1709.02349

[30]

Noam Shazeer and Mitchell Stern. 2018, Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, 4603--4611.

[31]

Yanyao Shen, Hyokun Yun, Zachary C. Lipton, Yakov Kronrod, and Animashree Anandkumar. 2017, Deep Active Learning for Named Entity Recognition, CoRR, Vol. abs/1707.05928 (2017). arxiv: 1707.05928 http://arxiv.org/abs/1707.05928

[32]

Daniil Sorokin and Iryna Gurevych. 2017, Context-Aware Representations for Knowledge Base Relation Extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1784--1789.

[33]

Josef Steinberger and Karel Jezek. 2009, Evaluation Measures for Text Summarization, Computing and Informatics, Vol. 28 (2009), 251--275.

[34]

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012, Multi-instance Multi-label Learning for Relation Extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Stroudsburg, PA, USA, 455--465.

Digital Library

[35]

Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francc ois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018, Tensor2Tensor for Neural Machine Translation, arXiv preprint, Vol. arXiv:1803.07416 (2018), http://arxiv.org/abs/1803.07416

[36]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017, Attention is All you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). 5998--6008.

Digital Library

[37]

Denny Vrandevcić and Markus Krötzsch. 2014, Wikidata: A Free Collaborative Knowledgebase, Commun. ACM, Vol. 57 (2014), 78--85. Issue 10.

Digital Library

[38]

Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush, 2017, Challenges in Data-to-Document Generation, CoRR, Vol. abs/1707.08052 (2017). arxiv: 1707.08052 http://arxiv.org/abs/1707.08052

[39]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016, Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, CoRR, Vol. abs/1609.08144 (2016). arxiv: 1609.08144 http://arxiv.org/abs/1609.08144

[40]

Hengtong Zhang, Yaliang Li, Fenglong Ma, Jing Gao, and Lu Su. 2018, TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2729--2737.

Digital Library

Cited By

Kouris PAlexandridis GStafylopatis A(2024)Text summarization based on semantic graphs: an abstract meaning representation graph-to-text deep learning approachJournal of Big Data10.1186/s40537-024-00950-511:1Online publication date: 14-Jul-2024
https://doi.org/10.1186/s40537-024-00950-5
Zhu XJiang MZhang XNie LDing Z(2024)MTAS: A Reference-Free Approach for Evaluating Abstractive Summarization SystemsProceedings of the ACM on Software Engineering10.1145/36608201:FSE(2561-2583)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660820
Sadeghi MPöttgen DEbel PVogelsang A(2024)Explaining the Unexplainable: The Impact of Misleading Explanations on Trust in Unreliable Predictions for Hardly Assessable TasksProceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization10.1145/3627043.3659573(36-46)Online publication date: 22-Jun-2024
https://dl.acm.org/doi/10.1145/3627043.3659573
Show More Cited By

Index Terms

Assessing The Factual Accuracy of Generated Text
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Structured outputs
    2. Machine learning approaches
      1. Neural networks

Recommendations

FAR-ASS: Fact-aware reinforced abstractive sentence summarization
Highlights
- For natural language generation tasks, fact fabrication is a serious problem.
- An automatic fact extraction scheme leveraging open information extraction and dependency parser tools to extract the structured fact tuples.
- A factual ...
Abstract
Automatic summarization systems provide an effective solution to today's unprecedented growth of textual data. For real-world tasks, such as data mining and information retrieval, the factual correctness of generated summary is critical. However, ...
Evaluating factual accuracy in complex data-to-text
Abstract
It is essential that data-to-text Natural Language Generation (NLG) systems produce texts which are factually accurate. We examine accuracy issues in the task of generating summaries of basketball games, including what accuracy means ...
Highlights
- Factual accuracy problems limit the usefulness of neural solutions for complex data-to-text.
Reducing the Need for Manual Annotated Datasets in Aspect Sentiment Classification by Transfer Learning and Weak-Supervision
Agents and Artificial Intelligence
Abstract
Users’ opinions can be greatly beneficial in developing and providing products and services and improving marketing techniques for customer recommendation and retention. For this reason, sentiment analysis algorithms that automatically extract ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2019

3305 pages

ISBN:9781450362016

DOI:10.1145/3292500

General Chairs:
Ankur Teredesai
KenSci
,
Vipin Kumar
University of Minnesota
,
Program Chairs:
Ying Li
EV Analysis Corporation
,
Rómer Rosales
LinkedIn
,
Evimaria Terzi
Boston University
,
George Karypis
University of Minnesota

Copyright © 2019 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '19

Sponsor:

KDD '19: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 4 - 8, 2019

AK, Anchorage, USA

Acceptance Rates

KDD '19 Paper Acceptance Rate 110 of 1,200 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
2,903
Total Downloads

Downloads (Last 12 months)621
Downloads (Last 6 weeks)79

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kouris PAlexandridis GStafylopatis A(2024)Text summarization based on semantic graphs: an abstract meaning representation graph-to-text deep learning approachJournal of Big Data10.1186/s40537-024-00950-511:1Online publication date: 14-Jul-2024
https://doi.org/10.1186/s40537-024-00950-5
Zhu XJiang MZhang XNie LDing Z(2024)MTAS: A Reference-Free Approach for Evaluating Abstractive Summarization SystemsProceedings of the ACM on Software Engineering10.1145/36608201:FSE(2561-2583)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660820
Sadeghi MPöttgen DEbel PVogelsang A(2024)Explaining the Unexplainable: The Impact of Misleading Explanations on Trust in Unreliable Predictions for Hardly Assessable TasksProceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization10.1145/3627043.3659573(36-46)Online publication date: 22-Jun-2024
https://dl.acm.org/doi/10.1145/3627043.3659573
Kabir SUdo-Imeh DKou BZhang T(2024)Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow QuestionsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642596(1-17)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642596
McIntosh TLiu TSusnjak TWatters PNg AHalgamuge M(2024)A Culturally Sensitive Test to Evaluate Nuanced GPT HallucinationIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33328375:6(2739-2751)Online publication date: Jun-2024
https://doi.org/10.1109/TAI.2023.3332837
Wu JHou M(2024)Learning Latent Variable for Logical Reasoning in Table-Based Fact Verification2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651044(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651044
Wang SZhang CLiu T(2024)Prog-TAPAS: Enabling Table and Program Representation Consistency for Fact Verification2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650908(1-7)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650908
Subbulakshmi NBegum SVenkata Vignesh RChandru R(2024)Asynchronous Event Driven Brain Teaser Using Node.js2024 5th International Conference on Electronics and Sustainable Communication Systems (ICESC)10.1109/ICESC60852.2024.10689905(113-118)Online publication date: 7-Aug-2024
https://doi.org/10.1109/ICESC60852.2024.10689905
Luo ZXie QAnaniadou S(2024)Factual consistency evaluation of summarization in the Era of large language modelsExpert Systems with Applications10.1016/j.eswa.2024.124456254(124456)Online publication date: Nov-2024
https://doi.org/10.1016/j.eswa.2024.124456
Saleh MWazery YAli A(2024)A systematic literature review of deep learning-based text summarization: Techniques, input representation, training strategies, mechanisms, datasets, evaluation, and challengesExpert Systems with Applications10.1016/j.eswa.2024.124153252(124153)Online publication date: Oct-2024
https://doi.org/10.1016/j.eswa.2024.124153
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents