Philippe Thomas


2024

pdf bib
Retrieval-Augmented Knowledge Integration into Language Models: A Survey
Yuxuan Chen | Daniel Röder | Justus-Jonas Erker | Leonhard Hennig | Philippe Thomas | Sebastian Möller | Roland Roller
Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)

This survey analyses how external knowledge can be integrated into language models in the context of retrieval-augmentation.The main goal of this work is to give an overview of: (1) Which external knowledge can be augmented? (2) Given a knowledge source, how to retrieve from it and then integrate the retrieved knowledge? To achieve this, we define and give a mathematical formulation of retrieval-augmented knowledge integration (RAKI). We discuss retrieval and integration techniques separately in detail, for each of the following knowledge formats: knowledge graph, tabular and natural language.

pdf bib
Overview of #SMM4H 2024 – Task 2: Cross-Lingual Few-Shot Relation Extraction for Pharmacovigilance in French, German, and Japanese
Lisa Raithel | Philippe Thomas | Bhuvanesh Verma | Roland Roller | Hui-Syuan Yeh | Shuntaro Yada | Cyril Grouin | Shoko Wakamiya | Eiji Aramaki | Sebastian Möller | Pierre Zweigenbaum
Proceedings of The 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks

This paper provides an overview of Task 2 from the Social Media Mining for Health 2024 shared task (#SMM4H 2024), which focused on Named Entity Recognition (NER, Subtask 2a) and the joint task of NER and Relation Extraction (RE, Subtask 2b) for detecting adverse drug reactions (ADRs) in German, Japanese, and French texts written by patients. Participants were challenged with a few-shot learning scenario, necessitating models that can effectively generalize from limited annotated examples. Despite the diverse strategies employed by the participants, the overall performance across submissions from three teams highlighted significant challenges. The results underscored the complexity of extracting entities and relations in multi-lingual contexts, especially from the noisy and informal nature of user-generated content. Further research is required to develop robust systems capable of accurately identifying and associating ADR-related information in low-resource and multilingual settings.

pdf bib
Overview of the 9th Social Media Mining for Health Applications (#SMM4H) Shared Tasks at ACL 2024 – Large Language Models and Generalizability for Social Media NLP
Dongfang Xu | Guillermo Garcia | Lisa Raithel | Philippe Thomas | Roland Roller | Eiji Aramaki | Shoko Wakamiya | Shuntaro Yada | Pierre Zweigenbaum | Karen O’Connor | Sai Samineni | Sophia Hernandez | Yao Ge | Swati Rajwal | Sudeshna Das | Abeed Sarker | Ari Klein | Ana Schmidt | Vishakha Sharma | Raul Rodriguez-Esteban | Juan Banda | Ivan Amaro | Davy Weissenbacher | Graciela Gonzalez-Hernandez
Proceedings of The 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks

For the past nine years, the Social Media Mining for Health Applications (#SMM4H) shared tasks have promoted community-driven development and evaluation of advanced natural language processing systems to detect, extract, and normalize health-related information in publicly available user-generated content. This year, #SMM4H included seven shared tasks in English, Japanese, German, French, and Spanish from Twitter, Reddit, and health forums. A total of 84 teams from 22 countries registered for #SMM4H, and 45 teams participated in at least one task. This represents a growth of 180% and 160% in registration and participation, respectively, compared to the last iteration. This paper provides an overview of the tasks and participating systems. The data sets remain available upon request, and new systems can be evaluated through the post-evaluation phase on CodaLab.

pdf bib
Findings of the WMT 2024 Biomedical Translation Shared Task: Test Sets on Abstract Level
Mariana Neves | Cristian Grozea | Philippe Thomas | Roland Roller | Rachel Bawden | Aurélie Névéol | Steffen Castle | Vanessa Bonato | Giorgio Maria Di Nunzio | Federica Vezzani | Maika Vicente Navarro | Lana Yeganova | Antonio Jimeno Yepes
Proceedings of the Ninth Conference on Machine Translation

We present the results of the ninth edition of the Biomedical Translation Task at WMT’24. We released test sets for six language pairs, namely, French, German, Italian, Portuguese, Russian, and Spanish, from and into English. Eachtest set consists of 50 abstracts from PubMed. Differently from previous years, we did not split abstracts into sentences. We received submissions from five teams, and for almost all language directions. We used a baseline/comparison system based on Llama 3.1 and share the source code at https://github.com/cgrozea/wmt24biomed-ref.

pdf bib
A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages
Lisa Raithel | Hui-Syuan Yeh | Shuntaro Yada | Cyril Grouin | Thomas Lavergne | Aurélie Névéol | Patrick Paroubek | Philippe Thomas | Tomohiro Nishiyama | Sebastian Möller | Eiji Aramaki | Yuji Matsumoto | Roland Roller | Pierre Zweigenbaum
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.

2023

pdf bib
MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset
Leonhard Hennig | Philippe Thomas | Sebastian Möller
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Relation extraction (RE) is a fundamental task in information extraction, whose extension to multilingual settings has been hindered by the lack of supervised resources comparable in size to large English datasets such as TACRED (Zhang et al., 2017). To address this gap, we introduce the MultiTACRED dataset, covering 12 typologically diverse languages from 9 language families, which is created by machine-translating TACRED instances and automatically projecting their entity annotations. We analyze translation and annotation projection quality, identify error categories, and experimentally evaluate fine-tuned pretrained mono- and multilingual language models in common transfer learning scenarios. Our analyses show that machine translation is a viable strategy to transfer RE instances, with native speakers judging more than 83% of the translated instances to be linguistically and semantically acceptable. We find monolingual RE model performance to be comparable to the English original for many of the target languages, and that multilingual models trained on a combination of English and target language data can outperform their monolingual counterparts. However, we also observe a variety of translation and annotation projection errors, both due to the MT systems and linguistic features of the target languages, such as pronoun-dropping, compounding and inflection, that degrade dataset quality and RE model performance.

pdf bib
Findings of the WMT 2023 Biomedical Translation Shared Task: Evaluation of ChatGPT 3.5 as a Comparison System
Mariana Neves | Antonio Jimeno Yepes | Aurélie Névéol | Rachel Bawden | Giorgio Maria Di Nunzio | Roland Roller | Philippe Thomas | Federica Vezzani | Maika Vicente Navarro | Lana Yeganova | Dina Wiemann | Cristian Grozea
Proceedings of the Eighth Conference on Machine Translation

We present an overview of the Biomedical Translation Task that was part of the Eighth Conference on Machine Translation (WMT23). The aim of the task was the automatic translation of biomedical abstracts from the PubMed database. It included twelve language directions, namely, French, Spanish, Portuguese, Italian, German, and Russian, from and into English. We received submissions from 18 systems and for all the test sets that we released. Our comparison system was based on ChatGPT 3.5 and performed very well in comparison to many of the submissions.

2022

pdf bib
Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient’s Perspective
Lisa Raithel | Philippe Thomas | Roland Roller | Oliver Sapina | Sebastian Möller | Pierre Zweigenbaum
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topic imbalance make it a very challenging dataset, since often, the same symptom can have several causes and is not always related to a medication intake. We aim to encourage further multi-lingual efforts in the domain of ADR detection and provide preliminary experiments for binary classification using different methods of zero- and few-shot learning based on a multi-lingual model. When fine-tuning XLM-RoBERTa first on English patient forum data and then on the new German data, we achieve an F1-score of 37.52 for the positive class. We make the dataset and models publicly available for the community.

pdf bib
MobASA: Corpus for Aspect-based Sentiment Analysis and Social Inclusion in the Mobility Domain
Aleksandra Gabryszak | Philippe Thomas
Proceedings of the First Computing Social Responsibility Workshop within the 13th Language Resources and Evaluation Conference

In this paper we show how aspect-based sentiment analysis might help public transport companies to improve their social responsibility for accessible travel. We present MobASA: a novel German-language corpus of tweets annotated with their relevance for public transportation, and with sentiment towards aspects related to barrier-free travel. We identified and labeled topics important for passengers limited in their mobility due to disability, age, or when travelling with young children. The data can be used to identify hurdles and improve travel planning for vulnerable passengers, as well as to monitor a perception of transportation businesses regarding the social inclusion of all passengers. The data is publicly available under: https://github.com/DFKI-NLP/sim3s-corpus

pdf bib
Findings of the WMT 2022 Biomedical Translation Shared Task: Monolingual Clinical Case Reports
Mariana Neves | Antonio Jimeno Yepes | Amy Siu | Roland Roller | Philippe Thomas | Maika Vicente Navarro | Lana Yeganova | Dina Wiemann | Giorgio Maria Di Nunzio | Federica Vezzani | Christel Gerardin | Rachel Bawden | Darryl Johan Estrada | Salvador Lima-lopez | Eulalia Farre-maduel | Martin Krallinger | Cristian Grozea | Aurelie Neveol
Proceedings of the Seventh Conference on Machine Translation (WMT)

In the seventh edition of the WMT Biomedical Task, we addressed a total of seven languagepairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian. This year’s test sets covered three types of biomedical text genre. In addition to scientific abstracts and terminology items used in previous editions, we released test sets of clinical cases. The evaluation of clinical cases translations were given special attention by involving clinicians in the preparation of reference translations and manual evaluation. For the main MEDLINE test sets, we received a total of 609 submissions from 37 teams. For the ClinSpEn sub-task, we had the participation of five teams.

2021

pdf bib
Findings of the WMT 2021 Biomedical Translation Shared Task: Summaries of Animal Experiments as New Test Set
Lana Yeganova | Dina Wiemann | Mariana Neves | Federica Vezzani | Amy Siu | Inigo Jauregi Unanue | Maite Oronoz | Nancy Mah | Aurélie Névéol | David Martinez | Rachel Bawden | Giorgio Maria Di Nunzio | Roland Roller | Philippe Thomas | Cristian Grozea | Olatz Perez-de-Viñaspre | Maika Vicente Navarro | Antonio Jimeno Yepes
Proceedings of the Sixth Conference on Machine Translation

In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian, and English/Basque. Further, our tests were composed of three types of textual test sets. New to this year, we released a test set of summaries of animal experiments, in addition to the test sets of scientific abstracts and terminologies. We received a total of 107 submissions from 15 teams from 6 countries.

2020

pdf bib
Findings of the WMT 2020 Biomedical Translation Shared Task: Basque, Italian and Russian as New Additional Languages
Rachel Bawden | Giorgio Maria Di Nunzio | Cristian Grozea | Inigo Jauregi Unanue | Antonio Jimeno Yepes | Nancy Mah | David Martinez | Aurélie Névéol | Mariana Neves | Maite Oronoz | Olatz Perez-de-Viñaspre | Massimo Piccardi | Roland Roller | Amy Siu | Philippe Thomas | Federica Vezzani | Maika Vicente Navarro | Dina Wiemann | Lana Yeganova
Proceedings of the Fifth Conference on Machine Translation

Machine translation of scientific abstracts and terminologies has the potential to support health professionals and biomedical researchers in some of their activities. In the fifth edition of the WMT Biomedical Task, we addressed a total of eight language pairs. Five language pairs were previously addressed in past editions of the shared task, namely, English/German, English/French, English/Spanish, English/Portuguese, and English/Chinese. Three additional languages pairs were also introduced this year: English/Russian, English/Italian, and English/Basque. The task addressed the evaluation of both scientific abstracts (all language pairs) and terminologies (English/Basque only). We received submissions from a total of 20 teams. For recurring language pairs, we observed an improvement in the translations in terms of automatic scores and qualitative evaluations, compared to previous years.

2018

pdf bib
A German Corpus for Fine-Grained Named Entity Recognition and Relation Extraction of Traffic and Industry Events
Martin Schiersch | Veselina Mironova | Maximilian Schmitt | Philippe Thomas | Aleksandra Gabryszak | Leonhard Hennig
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

bib
Football and Beer - a Social Media Analysis on Twitter in Context of the FIFA Football World Cup 2018
Roland Roller | Philippe Thomas | Sven Schmeier
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

In many societies alcohol is a legal and common recreational substance and socially accepted. Alcohol consumption often comes along with social events as it helps people to increase their sociability and to overcome their inhibitions. On the other hand we know that increased alcohol consumption can lead to serious health issues, such as cancer, cardiovascular diseases and diseases of the digestive system, to mention a few. This work examines alcohol consumption during the FIFA Football World Cup 2018, particularly the usage of alcohol related information on Twitter. For this we analyse the tweeting behaviour and show that the tournament strongly increases the interest in beer. Furthermore we show that countries who had to leave the tournament at early stage might have done something good to their fans as the interest in beer decreased again.

2017

pdf bib
Streaming Text Analytics for Real-Time Event Recognition
Philippe Thomas | Johannes Kirschnick | Leonhard Hennig | Renlong Ai | Sven Schmeier | Holmer Hemsen | Feiyu Xu | Hans Uszkoreit
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

A huge body of continuously growing written knowledge is available on the web in the form of social media posts, RSS feeds, and news articles. Real-time information extraction from such high velocity, high volume text streams requires scalable, distributed natural language processing pipelines. We introduce such a system for fine-grained event recognition within the big data framework Flink, and demonstrate its capabilities for extracting and geo-locating mobility- and industry-related events from heterogeneous text sources. Performance analyses conducted on several large datasets show that our system achieves high throughput and maintains low latency, which is crucial when events need to be detected and acted upon in real-time. We also present promising experimental results for the event extraction component of our system, which recognizes a novel set of event types. The demo system is available at http://dfki.de/sd4m-sta-demo/.

pdf bib
Common Round: Application of Language Technologies to Large-Scale Web Debates
Hans Uszkoreit | Aleksandra Gabryszak | Leonhard Hennig | Jörg Steffen | Renlong Ai | Stephan Busemann | Jon Dehdari | Josef van Genabith | Georg Heigold | Nils Rethmeier | Raphael Rubino | Sven Schmeier | Philippe Thomas | He Wang | Feiyu Xu
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

Web debates play an important role in enabling broad participation of constituencies in social, political and economic decision-taking. However, it is challenging to organize, structure, and navigate a vast number of diverse argumentations and comments collected from many participants over a long time period. In this paper we demonstrate Common Round, a next generation platform for large-scale web debates, which provides functions for eliciting the semantic content and structures from the contributions of participants. In particular, Common Round applies language technologies for the extraction of semantic essence from textual input, aggregation of the formulated opinions and arguments. The platform also provides a cross-lingual access to debates using machine translation.

pdf bib
Findings of the WMT 2017 Biomedical Translation Shared Task
Antonio Jimeno Yepes | Aurélie Névéol | Mariana Neves | Karin Verspoor | Ondřej Bojar | Arthur Boyer | Cristian Grozea | Barry Haddow | Madeleine Kittner | Yvonne Lichtblau | Pavel Pecina | Roland Roller | Rudolf Rosa | Amy Siu | Philippe Thomas | Saskia Trescher
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
Real-Time Discovery and Geospatial Visualization of Mobility and Industry Events from Large-Scale, Heterogeneous Data Streams
Leonhard Hennig | Philippe Thomas | Renlong Ai | Johannes Kirschnick | He Wang | Jakob Pannier | Nora Zimmermann | Sven Schmeier | Feiyu Xu | Jan Ostwald | Hans Uszkoreit
Proceedings of ACL-2016 System Demonstrations

2013

pdf bib
WBI-DDI: Drug-Drug Interaction Extraction using Majority Voting
Philippe Thomas | Mariana Neves | Tim Rocktäschel | Ulf Leser
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf bib
Improving Distantly Supervised Extraction of Drug-Drug and Protein-Protein Interactions
Tamara Bobić | Roman Klinger | Philippe Thomas | Martin Hofmann-Apitius
Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP

2011

pdf bib
Not all links are equal: Exploiting Dependency Types for the Extraction of Protein-Protein Interactions from Text
Philippe Thomas | Stefan Pietschmann | Illés Solt | Domonkos Tikk | Ulf Leser
Proceedings of BioNLP 2011 Workshop

pdf bib
Learning Protein–Protein Interaction Extraction using Distant Supervision
Philippe Thomas | Illés Solt | Roman Klinger | Ulf Leser
Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing

Search
Co-authors