skip to main content
research-article
Free access
Just Accepted

Automatic Extractive Text Summarization using Multiple Linguistic Features

Online AM: 08 April 2024 Publication History

Abstract

Automatic text summarization (ATS) provides a summary of distinct categories of information using natural language processing (NLP). Low-resource languages like Hindi have restricted applications of these techniques. This study proposes a method for automatically generating summaries of Hindi documents using extractive technique. The approach retrieves pertinent sentences from the source documents by employing multiple linguistic features and machine learning (ML) using maximum likelihood estimation (MLE) and maximum entropy (ME). We conducted pre-processing on the input documents, such as eliminating Hindi stop words and stemming. We have obtained 15 linguistic feature scores from each document to identify the phrases with high scores for summary generation. We have performed experiments over BBC News articles, CNN News, DUC 2004, Hindi Text Short Summarization Corpus, Indian Language News Text Summarization Corpus, and Wikipedia Articles for the proposed text summarizer. The Hindi Text Short Summarization Corpus and Indian Language News Text Summarization Corpus datasets are in Hindi, whereas BBC News articles, CNN News, and the DUC 2004 datasets have been translated into Hindi using Google, Microsoft Bing, and Systran translators for experiments. The summarization results have been calculated and shown for Hindi as well as for English to compare the performance of a low and rich-resource language. Multiple ROUGE metrics, along with precision, recall, and F-measure, have been used for the evaluation, which shows the better performance of the proposed method with multiple ROUGE scores. We compare the proposed method with the supervised and unsupervised machine learning methodologies, including support vector machine (SVM), Naive Bayes (NB), decision tree (DT), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), and K-means clustering, and it was found that the proposed method outperforms these methods.

References

[1]
Boorugu, R.; and Ramesh, G.: A survey on NLP based text summarization for summarizing product reviews. In 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA) IEEE, 352-356 (2020).
[2]
Kassas, El.; W. S., Salama; C. R., Rafea; A. A.; and Mohamed, H. K.: Automatic Text Summarization: A Comprehensive Survey. Expert systems with applications. 165, 113679 (2020).
[3]
Andhale, N.; and Bewoor, L. A.: An overview of text summarization techniques. In 2016 international conference on computing communication control and automation (ICCUBEA) IEEE, 1-7 (2016).
[4]
Neto, J. L.; Freitas, A. A.; and Kaestner, C. A.: Automatic text summarization using a machine learning approach. In Advances in Artificial Intelligence: 16th Brazilian Symposium on Artificial Intelligence, SBIA 2002 Porto de Galinhas /Recife, Brazil, November 11–14, 2002 Proceedings 16, Springer Berlin Heidelberg, 205-215 (2002).
[5]
Shirwandkar, N. S.; and Kulkarni, S.: Extractive text summarization using deep learning. In 2018 fourth international conference on computing communication control and automation (ICCUBEA) 1-5 IEEE, (2018).
[6]
Yadav, A. K.; Singh, A.; Dhiman, M.; Vineet, Kaundal, R.; Verma, A.; and Yadav, D.: Extractive text summarization using deep learning approach. International Journal of Information Technology, 14(5), 2407-2415 (2022).
[7]
Harish, B. S.; and Rangan, R. K.: A comprehensive survey on Indian regional language processing. SN Applied Sciences, 2(7), 1204 (2020).
[8]
Sharma, K.; Bafna, N.; and Husain, S. Clause final verb prediction in Hindi: Evidence for noisy channel model of communication. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics 160-170 (2021).
[9]
Hong, K.; and Nenkova, A.: Improving the estimation of word importance for news multi-document summarization. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 712-721 (2014)
[10]
Khurana, A.; and Bhatnagar, V.: Investigating entropy for extractive document summarization. Expert Systems with Applications, 187, 115820 (2022).
[11]
Fattah, M. A.: A machine learning model for multi-document summarization. Applied intelligence, 40, 592-600 (2014).
[12]
Shah, C.; and Jivani, A.: An automatic text summarization on Naive Bayes classifier using latent semantic analysis. Data, Engineering and Applications: Volume 1, 171-180 (2019).
[13]
Wong, K. F.; Wu, M.; and Li, W.: Extractive summarization using supervised and semi-supervised learning. In Proceedings of the 22nd international conference on computational linguistics (Coling 2008) 985-992 (2008).
[14]
Acharya, S.: Extractive Text Summarization Using Machine Learning (2022).
[15]
Belwal, R. C.; Rai, S.; and Gupta, A.: Extractive text summarization using clustering-based topic modeling. Soft Computing, 27(7), 3965-3982 (2023).
[16]
Lin, C. Y.: Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74-81 (2004).
[17]
Mihalcea, R.: Graph-based ranking algorithms for sentence extraction, applied to text summarization. In Proceedings of the ACL interactive poster and demonstration sessions, 170-173 (2004).
[18]
Dutta, M.; Das, A. K.; Mallick, C., Sarkar, A.; and Das, A. K.: A graph based approach on extractive summarization. In Emerging Technologies in Data Mining and Information Security: Proceedings of IEMIS 2018, Volume 2 179-187. Springer Singapore, (2019).
[19]
Lin, C. Y.; and Hovy, E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics. 150-157 (2003).
[20]
Joshi, A;, Fidalgo, E.; Alegre, E.; and Alaiz-Rodriguez, R.: RankSum—An unsupervised extractive text summarization based on rank fusion. Expert Systems with Applications, 200, 116846 (2022).
[21]
Elbarougy, R.; Behery, G.; and El Khatib, A.: Extractive Arabic text summarization using modified PageRank algorithm. Egyptian informatics journal, 21(2), 73-81 (2020).
[22]
Erkan, G.; and Radev, D. R.: Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artificial intelligence research. 22, 457-479 (2004).
[23]
Mallick, C.; Das, A. K.; Dutta, M.; Das, A. K.; and Sarkar, A.: Graph-based text summarization using modified TextRank. In Soft computing in data analytics, Springer, Singapore. 137-146 (2019).
[24]
Mamidala, K. K.; and Sanampudi, S. K.: Text summarization for Indian languages: a survey. Int J Adv Res Eng Technol (IJARET), 12(1), 530-538 (2021).
[25]
Saleh, A. A.; and Weigang, L.: TxLASM: A novel language agnostic summarization model for text documents. Expert Systems with Applications, 237, 121433 (2024).
[26]
Jain, D.; Borah, M. D.; and Biswas, A.: Summarization of Lengthy Legal Documents via Abstractive Dataset Building: An Extract-then-Assign Approach. Expert Systems with Applications, 237, 121571 (2024).
[27]
Fatima, Z.; Zardari, S.; Fahim, M.; Andleeb Siddiqui, M.; Ibrahim, A. A. A.; Nisar, K.; and Naz, L. F.: A novel approach for semantic extractive text summarization. Applied Sciences, 12(9), 4479 (2022).
[28]
Mutlu, B.; Sezer, E. A.; and Akcayol, M. A: Multi-document extractive text summarization: comparative assessment on features. Knowledge-Based Systems, 183, 104848 (2019).
[29]
Adhikari, S.: Nlp based machine learning approaches for text summarization. In 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC) IEEE 535-538 (2020).
[30]
Yadav, D.; Katna, R.; Yadav, A. K.; and Morato, J.: Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey. IEEE Access, 10, 133981-134003 (2022).
[31]
Kumar, Y.; Kaur, K.; and Kaur, S.: Study of automatic text summarization approaches in different languages. Artificial Intelligence Review, 54(8), 5897-5929 (2021).
[32]
Harish, B. S.; and Rangan, R. K.: A comprehensive survey on Indian regional language processing. SN Applied Sciences, 2(7), 1204 (2020).
[33]
Srivastava, R.; Singh, P.; Rana, K. P. S.; and Kumar, V.: A topic modeled unsupervised approach to single document extractive text summarization. Knowledge-Based Systems, 246, 108636 (2022).
[34]
Mao, X.; Yang, H.; Huang, S.; Liu, Y.; and Li, R.: Extractive summarization using supervised and unsupervised learning. Expert systems with applications, 133, 173-181 (2019).
[35]
Bhandari, M.; Gour, P.; Ashfaq, A.; Liu, P.; and Neubig, G.: Re-evaluating evaluation in text summarization. arXiv preprint arXiv:2010.07100 (2020).
[36]
Radev, D. R.; Allison, T.; Blair-Goldensohn, S.; Blitzer, J.; Celebi, A.; Dimitrov, S.; and Zhang, Z.: MEAD-a platform for multidocument multilingual text summarization (2004).
[37]
Gupta, P.; Nigam, S.; and Singh, R.: A Statistical Language Modeling Framework for Extractive Summarization of Text Doents. SN Computer Science, 4(6), 750 (2023).
[38]
Gupta, P.; Nigam, S.; and Singh, R.: A Ranking based Language Model for Automatic Extractive Text Summarization. In 2022 First International Conference on Artificial Intelligence Trends and Pattern Recognition (ICAITPR) 1-5 IEEE (2022).
[39]
Gupta, P.; Nigam, S.; and Singh, R.: A Statistical Approach for Extractive Hindi Text Summarization Using Machine Translation. In Proceedings of Fourth International Conference on Computer and Communication Technologies: IC3T 2022 275-282 Singapore: Springer Nature Singapore (2023).
[40]
Chiche, A.; and Yitagesu, B.: Part of speech tagging: a systematic review of deep learning and machine learning approaches. Journal of Big Data. 9(1), 1-25 (2022).
[41]
Lovins, J. B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1-2), 22-31 (1968).
[42]
Vimal Kumar, K.; and Yadav, D.: An improvised extractive approach to hindi text summarization. In Information Systems Design and Intelligent Applications: Proceedings of Second International Conference INDIA 2015, Volume 1 291-300 Springer India (2015).
[43]
Mohd, M.; Jan, R.; and Shah, M.: Text document summarization using word embedding. Expert Systems with Applications, 143, 112958 (2020).
[44]
Verma, P.; and Om, H.: A novel approach for text summarization using optimal combination of sentence scoring methods. Sādhanā, 44, 1-15 (2019).
[45]
Karotia, A.; and Susan, S: Pre-training Meets Clustering: A Hybrid Extractive Multi-document Summarization Model. In International Conference on Hybrid Intelligent Systems, Cham: Springer Nature Switzerland 532-542 (2022).
[46]
Babu Gl, A.; and Badugu, S.: Extractive Summarization of Telugu Text Using Modified Text Rank and Maximum Marginal Relevance. ACM Transactions on Asian and Low-Resource Language Information Processing (2023).
[47]
Rani, R.; and Lobiyal, D. K.: Document vector embedding based extractive text summarization system for Hindi and English text. Applied Intelligence, 1-20 (2022).
[48]
Verma, P.; Pal, S.; and Om, H.: A comparative analysis on Hindi and English extractive text summarization. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 18(3), 1-39 (2019).
[49]
Kumar, K. V.; Yadav, D.; and Sharma, A.: Graph based technique for Hindi text summarization. In Information Systems Design and Intelligent Applications: Proceedings of Second International Conference INDIA 2015, Springer India, Volume 1 301-310 (2015).
[50]
Dalal, V.; and Malik, L.: Data clustering approach for automatic text summarization of Hindi documents using particle swarm optimization and semantic graph. International Journal of Soft Computing and Engineering (IJSCE), 1-3 (2017).
[51]
Krishnan, D.; Bharathy, P.; and Venugopalan, M.: A supervised approach for extractive text summarization using minimal robust features. In 2019 International Conference on Intelligent Computing and Control Systems (ICCS) IEEE, 521-527 (2019).

Index Terms

  1. Automatic Extractive Text Summarization using Multiple Linguistic Features
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing Just Accepted
          EISSN:2375-4702
          Table of Contents
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Online AM: 08 April 2024
          Accepted: 01 April 2024
          Revised: 14 March 2024
          Received: 30 November 2023

          Check for updates

          Author Tags

          1. Language modeling
          2. machine learning
          3. linguistic features
          4. ROUGE
          5. extractive summarization

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 236
            Total Downloads
          • Downloads (Last 12 months)236
          • Downloads (Last 6 weeks)55
          Reflects downloads up to 13 Sep 2024

          Other Metrics

          Citations

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Get Access

          Login options

          Full Access

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media