This research introduces DARES, a dataset for assessing the readability of Arabic text in Saudi school materials. DARES compromise of 13335 instances from textbooks used in 2021 and contains two subtasks; (a) Coarse-grained readability assessment where the text is classified into different educational levels such as primary and secondary. (b) Fine-grained readability assessment where the text is classified into individual grades.. We fine-tuned five transformer models that support Arabic and found that CAMeLBERTmix performed the best in all input settings. Evaluation results showed high performance for the coarse-grained readability assessment task, achieving a weighted F1 score of 0.91 and a macro F1 score of 0.79. The fine-grained task achieved a weighted F1 score of 0.68 and a macro F1 score of 0.55. These findings demonstrate the potential of our approach for advancing Arabic text readability assessment in education, with implications for future innovations in the field.
Definition modelling (DM) is the task of automatically generating a dictionary definition of a specific word. Computational systems that are capable of DM can have numerous applications benefiting a wide range of audiences. As DM is considered a supervised natural language generation problem, these systems require large annotated datasets to train the machine learning (ML) models. Several DM datasets have been released for English and other high-resource languages. While Portuguese is considered a mid/high-resource language in most natural language processing tasks and is spoken by more than 200 million native speakers, there is no DM dataset available for Portuguese. In this research, we fill this gap by introducing DORE; the first dataset for Definition MOdelling for PoRtuguEse containing more than 100,000 definitions. We also evaluate several deep learning based DM models on DORE and report the results. The dataset and the findings of this paper will facilitate research and study of Portuguese in wider contexts.
Authorship attribution aims to identify the author of an anonymous text. The task becomes even more worthwhile when it comes to literary works. For example, pen names were commonly used by female authors in the 19th century resulting in some literary works being incorrectly attributed or claimed. With this motivation, we collated a dataset of late 19th century novels in English. Due to the imbalance in the dataset and the unavailability of enough data per author, we employed the GANBERT model along with data sampling strategies to fine-tune a transformer-based model for authorship attribution. Differently from the earlier studies on the GAN-BERT model, we conducted transfer learning on comparatively smaller author subsets to train more focused author-specific models yielding performance over 0.88 accuracy and F1 scores. Furthermore, we observed that increasing the sample size has a negative impact on the model’s performance. Our research mainly contributes to the ongoing authorship attribution research using GAN-BERT architecture, especially in attributing disputed novelists in the late 19th century.
Relationship extraction from unstructured data remains one of the most challenging tasks in the field of Natural Language Processing (NLP). The complexity of relationship extraction arises from the need to comprehend the underlying semantics, syntactic structures, and contextual dependencies within the text. Unstructured data poses challenges with diverse linguistic patterns, implicit relationships, contextual nuances, complicating accurate relationship identification and extraction. The emergence of Large Language Models (LLMs), such as GPT (Generative Pre-trained Transformer), has indeed marked a significant advancement in the field of NLP. In this work, we assess and evaluate the effectiveness of LLMs in relationship extraction in the Holocaust testimonies within the context of the Historical realm. By delving into this domain-specific context, we aim to gain deeper insights into the performance and capabilities of LLMs in accurately capturing and extracting relationships within the Holocaust domain by developing a novel knowledge graph to visualise the relationships of the Holocaust. To the best of our knowledge, there is no existing study which discusses relationship extraction in Holocaust testimonies. The majority of current approaches for Information Extraction (IE) in historic documents are either manual or OCR based. Moreover, in this study, we found that the Subject-Object-Verb extraction using GPT3-based relations produced more meaningful results compared to the Semantic Role labeling-based triple extraction.
Text classification is an area of research which has been studied over the years in Natural Language Processing (NLP). Adapting NLP to multiple domains has introduced many new challenges for text classification and one of them is long document classification. While state-of-the-art transformer models provide excellent results in text classification, most of them have limitations in the maximum sequence length of the input sequence. The majority of the transformer models are limited to 512 tokens, and therefore, they struggle with long document classification problems. In this research, we explore on employing Model Fusing for long document classification while comparing the results with well-known BERT and Longformer architectures.
Multiword Terms (MWTs) are domain-specific Multiword Expressions (MWE) where two or more lexemes converge to form a new unit of meaning. The task of processing MWTs is crucial in many Natural Language Processing (NLP) applications, including Machine Translation (MT) and terminology extraction. However, the automatic detection of those terms is a difficult task and more research is still required to give more insightful and useful results in this field. In this study, we seek to fill this gap using state-of-the-art transformer models. We evaluate both BERT like discriminative transformer models and generative pre-trained transformer (GPT) models on this task, and we show that discriminative models perform better than current GPT models in multi-word terms identification task in flower and plant names in English and Spanish languages. Best discriminate models perform 94.3127%, 82.1733% F1 scores in English and Spanish data, respectively while ChatGPT could only perform 63.3183% and 47.7925% respectively.
Machine Translation (MT) has taken off dramatically in recent years due to the advent of Deep Learning methods and Neural Machine Translation (NMT) has enhanced the quality of automatic translation significantly. While most work has covered the automatic translation of technical, legal and medical texts, the application of MT to literary texts and the human role in this process have been underexplored. In an effort to bridge the gap of this under-researched area, this paper presents the results of a study which seeks to evaluate the performance of three MT systems applied to two different literary genres, two novels (1984 by George Orwell and Pride and Prejudice by Jane Austen) and two poems (I Felt a Funeral in my Brain by Emily Dickinson and Siren Song by Margaret Atwood) representing different literary periods and timelines. The evaluation was conducted by way of the automatic evaluation metric BLEU to objectively assess the performance that the MT system shows on each genre. The limitations of this study are also outlined.
This paper explores the readability of translated and interpreted texts compared to the original source texts and target language texts in the same domain. It was shown in the literature that translated and interpreted texts could exhibit lexical and syntactic properties that make them simpler, and hence, easier to process than their sources or comparable non-translations. In translation, this effect is attributed to the tendency to simplify and disambiguate the message. In interpreting, it can be enhanced by the temporal and cognitive constraints. We use readability annotations from the Newsela corpus to formulate a number of classification and regression tasks and fine-tune a multilingual pre-trained model on these tasks, obtaining models that can differentiate between complex and simple sentences. Then, the models are applied to predict the readability of sources, targets, and comparable target language originals in a zero-shot manner. Our test data – parallel and comparable – come from English-German bidirectional interpreting and translation subsets from the Europarl corpus. The results confirm the difference in readability between translated/interpreted targets against sentences in standard originally-authored source and target languages. Besides, we find consistent differences between the translation directions in the English-German language pair.
The task of machine reading comprehension (MRC) is a useful benchmark to evaluate the natural language understanding of machines. It has gained popularity in the natural language processing (NLP) field mainly due to the large number of datasets released for many languages. However, the research in MRC has been understudied in several domains, including religious texts. The goal of the Qur’an QA 2022 shared task is to fill this gap by producing state-of-the-art question answering and reading comprehension research on Qur’an. This paper describes the DTW entry to the Quran QA 2022 shared task. Our methodology uses transfer learning to take advantage of available Arabic MRC data. We further improve the results using various ensemble learning strategies. Our approach provided a partial Reciprocal Rank (pRR) score of 0.49 on the test set, proving its strong performance on the task.
Most studies on word-level Quality Estimation (QE) of machine translation focus on language-specific models. The obvious disadvantages of these approaches are the need for labelled data for each language pair and the high cost required to maintain several language-specific models. To overcome these problems, we explore different approaches to multilingual, word-level QE. We show that multilingual QE models perform on par with the current language-specific models. In the cases of zero-shot and few-shot QE, we demonstrate that it is possible to accurately predict word-level quality for any given new language pair from models trained on other language pairs. Our findings suggest that the word-level QE models based on powerful pre-trained transformers that we propose in this paper generalise well across languages, making them more useful in real-world scenarios.
This paper presents a translationese study based on the parallel data from the Russian National Corpus (RNC). We explored differences between literary texts originally authored in Russian and fiction translated into Russian from 11 languages. The texts are represented with frequency-based features that capture structural and lexical properties of language. Binary classification results indicate that literary translations can be distinguished from non-translations with an accuracy ranging from 82 to 92% depending on the source language and feature set. Multiclass classification confirms that translations from distant languages are more distinct from non-translations than translations from languages that are typologically close to Russian. It also demonstrates that translations from same-family source languages share translationese properties. Structural features return more consistent results than features relying on external resources and capturing lexical properties of texts in both translationese detection and source language identification tasks.
Multiple-choice questions (MCQs) are widely used in knowledge assessment in educational institutions, during work interviews, in entertainment quizzes and games. Although the research on the automatic or semi-automatic generation of multiple-choice test items has been conducted since the beginning of this millennium, most approaches focus on generating questions from a single sentence. In this research, a state-of-the-art method of creating questions based on multiple sentences is introduced. It was inspired by semantic similarity matches used in the translation memory component of translation management systems. The performance of two deep learning algorithms, doc2vec and SBERT, is compared for the paragraph similarity task. The experiments are performed on the ad-hoc corpus within the EU domain. For the automatic evaluation, a smaller corpus of manually selected matching paragraphs has been compiled. The results prove the good performance of Sentence Embeddings for the given task.
Despite the enormous popularity of Translation Memory systems and the active research in the field, their language processing features still suffer from certain limitations. While many recent papers focus on semantic matching capabilities of TMs, this planned study will address how these tools perform when dealing with longer segments and whether this could be a cause of lower match scores. An experiment will be carried out on corpora from two different (repetitive) domains. Following the results, recommendations for future developments of new TMs will be made.
The paper reports the results of a translationese study of literary texts based on translated and non-translated Russian. We aim to find out if translations deviate from non-translated literary texts, and if the established differences can be attributed to typological relations between source and target languages. We expect that literary translations from typologically distant languages should exhibit more translationese, and the fingerprints of individual source languages (and their families) are traceable in translations. We explore linguistic properties that distinguish non-translated Russian literature from translations into Russian. Our results show that non-translated fiction is different from translations to the degree that these two language varieties can be automatically classified. As expected, language typology is reflected in translations of literary texts. We identified features that point to linguistic specificity of Russian non-translated literature and to shining-through effects. Some of translationese features cut across all language pairs, while others are characteristic of literary translations from languages belonging to specific language families.
The domain-specialised application of Named Entity Recognition (NER) is known as Biomedical NER (BioNER), which aims to identify and classify biomedical concepts that are of interest to researchers, such as genes, proteins, chemical compounds, drugs, mutations, diseases, and so on. The BioNER task is very similar to general NER but recognising Biomedical Named Entities (BNEs) is more challenging than recognising proper names from newspapers due to the characteristics of biomedical nomenclature. In order to address the challenges posed by BioNER, seven machine learning models were implemented comparing a transfer learning approach based on fine-tuned BERT with Bi-LSTM based neural models and a CRF model used as baseline. Precision, Recall and F1-score were used as performance scores evaluating the models on two well-known biomedical corpora: JNLPBA and BIOCREATIVE IV (BC-IV). Strict and partial matching were considered as evaluation criteria. The reported results show that a transfer learning approach based on fine-tuned BERT outperforms all others methods achieving the highest scores for all metrics on both corpora.
Named Entity Recognition is an essential task in natural language processing to detect entities and classify them into predetermined categories. An entity is a meaningful word, or phrase that refers to proper nouns. Named Entities play an important role in different NLP tasks such as Information Extraction, Question Answering and Machine Translation. In Machine Translation, named entities often cause translation failures regardless of local context, affecting the output quality of translation. Annotating named entities is a time-consuming and expensive process especially for low-resource languages. One solution for this problem is to use word alignment methods in bilingual parallel corpora in which just one side has been annotated. The goal is to extract named entities in the target language by using the annotated corpus of the source language. In this paper, we compare the performance of two alignment methods, Grow-diag-final-and and Intersect Symmetrisation heuristics, to exploit the annotation projection of English-Brazilian Portuguese bilingual corpus to detect named entities in Brazilian Portuguese. A NER model that is trained on annotated data extracted from the alignment methods, is used to evaluate the performance of aligners. Experimental results show the Intersect Symmetrisation is able to achieve superior performance scores compared to the Grow-diag-final-and heuristic in Brazilian Portuguese.
The exponential growth of the internet and social media in the past decade gave way to the increase in dissemination of false or misleading information. Since the 2016 US presidential election, the term “fake news” became increasingly popular and this phenomenon has received more attention. In the past years several fact-checking agencies were created, but due to the great number of daily posts on social media, manual checking is insufficient. Currently, there is a pressing need for automatic fake news detection tools, either to assist manual fact-checkers or to operate as standalone tools. There are several projects underway on this topic, but most of them focus on English. This research-in-progress paper discusses the employment of deep learning methods, and the development of a tool, for detecting false news in Portuguese. As a first step we shall compare well-established architectures that were tested in other languages and analyse their performance on our Portuguese data. Based on the preliminary results of these classifiers, we shall choose a deep learning model or combine several deep learning models which hold promise to enhance the performance of our fake news detection system.
Despite the increasingly good quality of Machine Translation (MT) systems, MT outputs require corrections. Automatic Post-Editing (APE) models have been introduced to perform these corrections without human intervention. However, no system has been able to fully automate the Post-Editing (PE) process. Moreover, while numerous translation tools, such as Translation Memories (TMs), largely benefit from translators’ input, Human-Computer Interaction (HCI) remains limited when it comes to PE. This research-in-progress paper discusses APE models and suggests that they could be improved in more interactive scenarios, as previously done in MT with the creation of Interactive MT (IMT) systems. Based on the hypothesis that PE would benefit from HCI, two methodologies are proposed. Both suggest that traditional batch learning settings are not optimal for PE. Instead, online techniques are recommended to train and update PE models on the fly, via either real or simulated interactions with the translator.
This paper offers a comparative evaluation of four commercial ASR systems which are evaluated according to the post-editing effort required to reach “publishable” quality and according to the number of errors they produce. For the error annotation task, an original error typology for transcription errors is proposed. This study also seeks to examine whether there is a difference in the performance of these systems between native and non-native English speakers. The experimental results suggest that among the four systems, Trint obtains the best scores. It is also observed that most systems perform noticeably better with native speakers and that all systems are most prone to fluency errors.
Recent years have seen big advances in the field of sentence-level quality estimation (QE), largely as a result of using neural-based architectures. However, the majority of these methods work only on the language pair they are trained on and need retraining for new language pairs. This process can prove difficult from a technical point of view and is usually computationally expensive. In this paper we propose a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. Our evaluation shows that the proposed methods achieve state-of-the-art results outperforming current open-source quality estimation frameworks when trained on datasets from WMT. In addition, the framework proves very useful in transfer learning settings, especially when dealing with low-resourced languages, allowing us to obtain very competitive results.
This paper presents the team TransQuest’s participation in Sentence-Level Direct Assessment shared task in WMT 2020. We introduce a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. The proposed methods achieve state-of-the-art results surpassing the results obtained by OpenKiwi, the baseline used in the shared task. We further fine tune the QE framework by performing ensemble and data augmentation. Our approach is the winning solution in all of the language pairs according to the WMT 2020 official results.
This paper presents the RGCL team submission to SemEval 2020 Task 6: DeftEval, subtasks 1 and 2. The system classifies definitions at the sentence and token levels. It utilises state-of-the-art neural network architectures, which have some task-specific adaptations, including an automatically extended training set. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility in architecture selection.
Matching and retrieving previously translated segments from the Translation Memory is a key functionality in Translation Memories systems. However this matching and retrieving process is still limited to algorithms based on edit distance which we have identified as a major drawback in Translation Memories systems. In this paper, we introduce sentence encoders to improve matching and retrieving process in Translation Memories systems - an effective and efficient solution to replace edit distance-based algorithms.
This article describes the system submitted by the RGCL-WLV team to the SemEval 2019 Task 12: Toponym resolution in scientific papers. The system detects toponyms using a bootstrapped machine learning (ML) approach which classifies names identified using gazetteers extracted from the GeoNames geographical database. The paper evaluates the performance of several ML classifiers, as well as how the gazetteers influence the accuracy of the system. Several runs were submitted. The highest precision achieved for one of the submissions was 89%, albeit it at a relatively low recall of 49%.
We introduce a new method to tag Multiword Expressions (MWEs) using a linguistically interpretable language-independent deep learning architecture. We specifically target discontinuity, an under-explored aspect that poses a significant challenge to computational treatment of MWEs. Two neural architectures are explored: Graph Convolutional Network (GCN) and multi-head self-attention. GCN leverages dependency parse information, and self-attention attends to long-range relations. We finally propose a combined model that integrates complementary information from both, through a gating mechanism. The experiments on a standard multilingual dataset for verbal MWEs show that our model outperforms the baselines not only in the case of discontinuous MWEs but also in overall F-score.
While a number of studies have shown evidence of translationese phenomena, that is, statistical differences between original texts and translated texts (Gellerstam, 1986), results of studies searching for translationese features in postedited texts (what has been called ”posteditese” (Daems et al., 2017)) have presented mixed results. This paper reports a preliminary study aimed at identifying the presence of post-editese features in machine-translated post-edited texts and at understanding how they differ from translationese features. We test the influence of factors such as post-editing (PE) levels (full vs. light), translation proficiency (professionals vs. students) and text domain (news vs. literary). Results show evidence of post-editese features, especially in light PE texts and in certain domains.
Calculating Semantic Textual Similarity (STS) plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. All modern state of the art STS methods rely on word embeddings one way or another. The recently introduced contextualised word embeddings have proved more effective than standard word embeddings in many natural language processing tasks. This paper evaluates the impact of several contextualised word embeddings on unsupervised STS methods and compares it with the existing supervised/unsupervised STS methods for different datasets in different languages and different domains
Calculating the Semantic Textual Similarity (STS) is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. This paper evaluates Siamese recurrent architectures, a special type of neural networks, which are used here to measure STS. Several variants of the architecture are compared with existing methods
The keynote speech presents the speaker’s vision that research should lead to the development of applications which benefit society. To support this, the speaker will present three original methodologies proposed by him which underpin applications jointly implemented with colleagues from across his research group. These Language Technology tools already have a substantial societal impact in the following areas: learning and assessment, translation and care for people with language disabilities.
When processing a text, humans and machines must disambiguate between different uses of the pronoun it, including non-referential, nominal anaphoric or clause anaphoric ones. In this paper we use eye-tracking data to learn how humans perform this disambiguation and use this knowledge to improve the automatic classification of it. We show that by using gaze data and a POS-tagger we are able to significantly outperform a common baseline and classify between three categories of it with an accuracy comparable to that of linguistic-based approaches. In addition, the discriminatory power of specific gaze features informs the way humans process the pronoun, which, to the best of our knowledge, has not been explored using data from a natural reading task.
This paper describes the systems submitted to SemEval 2018 Task 3 “Irony detection in English tweets” for both subtasks A and B. The first system leveraging a combination of sentiment, distributional semantic, and text surface features is ranked third among 44 teams according to the official leaderboard of the subtask A. The second system with slightly different representation of the features ranked ninth in subtask B. We present a method that entails decomposing tweets into separate parts. Searching for contrast within the constituents of a tweet is an integral part of our system. We embrace an extensive definition of contrast which leads to a vast coverage in detecting ironic content.
This paper describes the system submitted to SemEval 2018 shared task 10 ‘Capturing Dicriminative Attributes’. We use a combination of knowledge-based and co-occurrence features to capture the semantic difference between two words in relation to an attribute. We define scores based on association measures, ngram counts, word similarity, and ConceptNet relations. The system is ranked 4th (joint) on the official leaderboard of the task.
This study investigates the supervised token-based identification of Multiword Expressions (MWEs). This is an ongoing research to exploit the information contained in the contexts in which different instances of an expression could occur. This information is used to investigate the question of whether an expression is literal or MWE. Lexical and syntactic context features derived from vector representations are shown to be more effective over traditional statistical measures to identify tokens of MWEs.
Eye tracking studies from the past few decades have shaped the way we think of word complexity and cognitive load: words that are long, rare and ambiguous are more difficult to read. However, online processing techniques have been scarcely applied to investigating the reading difficulties of people with autism and what vocabulary is challenging for them. We present parallel gaze data obtained from adult readers with autism and a control group of neurotypical readers and show that the former required higher cognitive effort to comprehend the texts as evidenced by three gaze-based measures. We divide all words into four classes based on their viewing times for both groups and investigate the relationship between longer viewing times and word length, word frequency, and four cognitively-based measures (word concreteness, familiarity, age of acquisition and imagability).
The TM memory systems changed the work of translators and now the translators not benefiting from these tools are a tiny minority. These tools operate on fuzzy (surface) matching mostly and cannot benefit from already translated texts which are synonymous to (or paraphrased versions of) the text to be translated. The match score is mostly based on character-string similarity, calculated through Levenshtein distance. The TM tools have difficulties with detecting similarities even in sentences which represent a minor revision of sentences already available in the translation memory. This shortcoming of the current TM systems was the subject of the present study and was empirically proven in the experiments we conducted. To this end, we compiled a small translation memory (English-Spanish) and applied several lexical and syntactic transformation rules to the source sentences with both English and Spanish being the source language. The results of this study show that current TM systems have a long way to go and highlight the need for TM systems equipped with NLP capabilities which will offer the translator the advantage of he/she not having to translate a sentence again if an almost identical sentence has already been already translated.
This paper presents an approach for automatic evaluation of the readability of text simplification output for readers with cognitive disabilities. First, we present our work towards the development of the EasyRead corpus, which contains easy-to-read documents created especially for people with cognitive disabilities. We then compare the EasyRead corpus to the simplified output contained in the LocalNews corpus (Feng, 2009), the accessibility of which has been evaluated through reading comprehension experiments including 20 adults with mild intellectual disability. This comparison is made on the basis of 13 disability-specific linguistic features. The comparison reveals that there are no major differences between the two corpora, which shows that the EasyRead corpus is to a similar reading level as the user-evaluated texts. We also discuss the role of Simple Wikipedia (Zhu et al., 2010) as a widely-used accessibility benchmark, in light of our finding that it is significantly more complex than both the EasyRead and the LocalNews corpora.
The paper presents a corpus of text data and its corresponding gaze fixations obtained from autistic and non-autistic readers. The data was elicited through reading comprehension testing combined with eye-tracking recording. The corpus consists of 1034 content words tagged with their POS, syntactic role and three gaze-based measures corresponding to the autistic and control participants. The reading skills of the participants were measured through multiple-choice questions and, based on the answers given, they were divided into groups of skillful and less-skillful readers. This division of the groups informs researchers on whether particular fixations were elicited from skillful or less-skillful readers and allows a fair between-group comparison for two levels of reading ability. In addition to describing the process of data collection and corpus development, we present a study on the effect that word length has on reading in autism. The corpus is intended as a resource for investigating the particular linguistic constructions which pose reading difficulties for people with autism and hopefully, as a way to inform future text simplification research intended for this population.
A syntactically complex text may represent a problem for both comprehension by humans and various NLP tasks. A large number of studies in text simplification are concerned with this problem and their aim is to transform the given text into a simplified form in order to make it accessible to the wider audience. In this study, we were investigating what the natural tendency of texts is in 20th century English language. Are they becoming syntactically more complex over the years, requiring a higher literacy level and greater effort from the readers, or are they becoming simpler and easier to read? We examined several factors of text complexity (average sentence length, Automated Readability Index, sentence complexity and passive voice) in the 20th century for two main English language varieties - British and American, using the `Brown family' of corpora. In British English, we compared the complexity of texts published in 1931, 1961 and 1991, while in American English we compared the complexity of texts published in 1961 and 1992. Furthermore, we demonstrated how the state-of-the-art NLP tools can be used for automatic extraction of some complex features from the raw text version of the corpora.
This paper presents a freely available resource for research on handling negation and speculation in review texts. The SFU Review Corpus, consisting of 400 documents of movie, book, and consumer product reviews, was annotated at the token level with negative and speculative keywords and at the sentence level with their linguistic scope. We report statistics on corpus size and the consistency of annotations. The annotated corpus will be useful in many applications, such as document mining and sentiment analysis.
Due to the increasing number of emergency situations which can have substantial consequences, both financially and fatally, the Crisis Management (CM) domain is developing at an exponential speed. The efficient management of emergency situations relies on clear communication between all of the participants in a crisis situation. For these reasons the Text Complexity (TC) of the CM domain needed to be investigated and showed that CM domain texts exhibit high TC levels. This article presents a new linguistic resource in the form of Controlled Language (CL) guidelines for manual text simplification in the CM domain which aims to address high TC in the CM domain and produce clear messages to be used in crisis situations. The effectiveness of the resource has been tested via evaluation from several different perspectives important for the domain. The overall results show that the CLCM simplification has a positive impact on TC, reading comprehension, manual translation and machine translation. Additionally, an investigation of the cognitive difficulty in applying manual simplification operations led to interesting discoveries. This article provides details of the evaluation methods, the conducted experiments, their results and indications about future work.
This paper is concerned with resources for controlled languages for alert messages and protocols in the European perspective. These resources have been produced as the outcome of a project (Alert Messages and Protocols: MESSAGE) which has been funded with the support of the European Commission - Directorate-General Justice, Freedom and Security, and with the specific objective of 'promoting and supporting the development of security standards, and an exchange of know-how and experience on protection of people'. The MESSAGE project involved the development and transfer of a methodology for writing safe and safely translatable alert messages and protocols created by Centre Tesnière in collaboration with the aircraft industry, the health profession, and emergency services by means of a consortium of four partners to their four European member states in their languages (ES, FR (Coordinator), GB, PL). The paper describes alert messages and protocols, controlled languages for safety and security, the target groups involved, controlled language evaluation, dissemination, the resources that are available, both Freely available and From Owner, together with illustrations of the resources, and the potential transferability to other sectors and users.
This paper describes a novel methodology to perform bilingual terminology extraction, in which automatic alignment is used to improve the performance of terminology extraction for each language. The strengths of monolingual terminology extraction for each language are exploited to improve the performance of terminology extraction in the other language, thanks to the availability of a sentence-level aligned bilingual corpus, and an automatic noun phrase alignment mechanism. The experiment indicates that weaknesses in monolingual terminology extraction due to the limitation of resources in certain languages can be overcome by using another language which has no such limitation.
Evaluation campaigns have become an established way to evaluate automatic systems which tackle the same task. This paper presents the first edition of the Anaphora Resolution Exercise (ARE) and the lessons learnt from it. This first edition focused only on English pronominal anaphora and NP coreference, and was organised as an exploratory exercise where various issues were investigated. ARE proposed four different tasks: pronominal anaphora resolution and NP coreference resolution on a predefined set of entities, pronominal anaphora resolution and NP coreference resolution on raw texts. For each of these tasks different inputs and evaluation metrics were prepared. This paper presents the four tasks, their input data and evaluation metrics used. Even though a large number of researchers in the field expressed their interest to participate, only three institutions took part in the formal evaluation. The paper briefly presents their results, but does not try to interpret them because in this edition of ARE our aim was not about finding why certain methods are better, but to prepare the ground for a fully-fledged edition.
This paper discusses a framework for development of bilingual and multilingual comprehension assistants and presents a prototype implementation of an English-Bulgarian comprehension assistant. The framework is based on the application of advanced graphical user interface techniques, WordNet and compatible lexical databases as well as a series of NLP preprocessing tasks, including POS-tagging, lemmatisation, multiword expressions recognition and word sense disambiguation. The aim of this framework is to speed up the process of dictionary look-up, to offer enhanced look-up functionalities and to perform a context-sensitive narrowing-down of the set of translation alternatives proposed to the user.
Convergence and simplification are two of the so-called universals in translation studies. The first one postulates that translated texts tend to be more similar than non-translated texts. The second one postulates that translated texts are simpler, easier-to-understand than non-translated ones. This paper discusses the results of a project which applies NLP techniques over comparable corpora of translated and non-translated texts in Spanish seeking to establish whether these two universals hold Corpas Pastor (2008).
The adverb "then" is among the most frequent Englishtemporal adverbs, being also capable of filling a variety of semantic roles. The identification of anaphoric usages of "then"is important for temporal expression resolution, while thetemporal relationship usage is important for event ordering. Given that previous work has not tackled the identification and temporal resolution of anaphoric "then", this paper presents a machine learning approach for setting apart anaphoric usages and a rule-based normaliser that resolves it with respect to an antecedent. The performance of the two modules is evaluated. The present paper also describes the construction of an annotated corpus and the subsequent derivation of training data required by the machine learning module.
Progress in Machine Translation (MT) during the last ten years has been observed at different levels, but discourse has yet to make a breakthrough. MT research and development has concentrated so far mostly on sentence translation (discourse analysis being a very complicated task) and the successful operation of most of the working MT systems does not usually go beyond the sentence level. To start with, the paper will refer to the MT research and development in the last ten years at the IAI in Saarbrücken. Next, the MT discourse issues will be discussed both from the point of view of source language analysis and target text generation, and on the basis of the preliminary results of an ongoing "discourse-oriented MT" project . Probably the most important aspect in successfully analysing multisentential source texts is the capacity to establish the anaphoric references to preceding discourse entities. The paper will discuss the problem of anaphora resolution from the perspective of MT. A new integrated model for anaphora resolution, developed for the needs of MT, will be also outlined. As already mentioned, most machine translation systems perform translation sentence by sentence. But even in the case of paragraph translation, the discourse structure of the target text tends to be identical to that of the source text. However, the sublanguage discourse structures may differ across the different languages, and thus a translated text which assumes the same discourse structure as the source text may sound unnatural and perhaps disguise the true intent of the writer. Finally, the paper will outline a new approach for generating discourse structures, appropriate to the target sublanguage and will discuss some of the complicated problems encountered.