Eyal Shnarch


2024

pdf bib
Efficient Benchmarking (of Language Models)
Yotam Perlitz | Elron Bandel | Ariel Gera | Ofir Arviv | Liat Ein-Dor | Eyal Shnarch | Noam Slonim | Michal Shmueli-Scheuer | Leshem Choshen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature.In this work, we present the problem of Efficient Benchmarking, namely, intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off. We propose to evaluate the reliability of such decisions, by using a new measure – Decision Impact on Reliability, DIoR for short.We find, for example, that a benchmark leader may change by merely removing a low-ranked model from the benchmark, and observe that a correct benchmark ranking can be obtained by considering only a fraction of the evaluation examples.Based on our findings, we outline a set of concrete recommendations for efficient benchmark design and utilization practices. To take a step further, we use our findings to propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability, often reducing computation by x100 or more.

pdf bib
Label-Efficient Model Selection for Text Generation
Shir Ashury Tahan | Ariel Gera | Benjamin Sznajder | Leshem Choshen | Liat Ein-Dor | Eyal Shnarch
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluation.DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model for selecting between models, prompts and configurations. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations – by up to 75% – while maintaining high evaluation reliability.

2023

pdf bib
The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers
Ariel Gera | Roni Friedman | Ofir Arviv | Chulaka Gunasekara | Benjamin Sznajder | Noam Slonim | Eyal Shnarch
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Applying language models to natural language processing tasks typically relies on the representations in the final model layer, as intermediate hidden layer representations are presumed to be less informative. In this work, we argue that due to the gradual improvement across model layers, additional information can be gleaned from the contrast between higher and lower layers during inference. Specifically, in choosing between the probable next token predictions of a generative model, the predictions of lower layers can be used to highlight which candidates are best avoided. We propose a novel approach that utilizes the contrast between layers to improve text generation outputs, and show that it mitigates degenerative behaviors of the model in open-ended generation, significantly improving the quality of generated texts. Furthermore, our results indicate that contrasting between model layers at inference time can yield substantial benefits to certain aspects of general language model capabilities, more effectively extracting knowledge during inference from a given set of model parameters.

2022

pdf bib
Cluster & Tune: Boost Cold Start Performance in Text Classification
Eyal Shnarch | Ariel Gera | Alon Halfon | Lena Dankin | Leshem Choshen | Ranit Aharonov | Noam Slonim
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce. In such cases, the common practice of fine-tuning pre-trained models, such as BERT, for a target classification task, is prone to produce poor performance. We suggest a method to boost the performance of such models by adding an intermediate unsupervised classification task, between the pre-training and fine-tuning phases. As such an intermediate task, we perform clustering and train the pre-trained model on predicting the cluster labels. We test this hypothesis on various data sets, and show that this additional classification phase can significantly improve performance, mainly for topical classification tasks, when the number of labeled instances available for fine-tuning is only a couple of dozen to a few hundred.

pdf bib
Zero-Shot Text Classification with Self-Training
Ariel Gera | Alon Halfon | Eyal Shnarch | Yotam Perlitz | Liat Ein-Dor | Noam Slonim
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Recent advances in large pretrained language models have increased attention to zero-shot text classification. In particular, models finetuned on natural language inference datasets have been widely adopted as zero-shot classifiers due to their promising results and off-the-shelf availability. However, the fact that such models are unfamiliar with the target task can lead to instability and performance issues. We propose a plug-and-play method to bridge this gap using a simple self-training approach, requiring only the class names along with an unlabeled dataset, and without the need for domain expertise or trial and error. We show that fine-tuning the zero-shot classifier on its most confident predictions leads to significant performance gains across a wide range of text classification tasks, presumably since self-training adapts the zero-shot model to the task at hand.

pdf bib
Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours
Eyal Shnarch | Alon Halfon | Ariel Gera | Marina Danilevsky | Yannis Katsis | Leshem Choshen | Martin Santillan Cooper | Dina Epelboim | Zheng Zhang | Dakuo Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Label Sleuth is an open source platform for building text classifiers which does not require coding skills nor machine learning knowledge.- Project website: [https://www.label-sleuth.org/](https://www.label-sleuth.org/)- Link to screencast video: [https://vimeo.com/735675461](https://vimeo.com/735675461)### AbstractText classification can be useful in many real-world scenarios, saving a lot of time for end users. However, building a classifier generally requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier we introduce *Label Sleuth*, a free open source system for labeling and creating text classifiers. This system is unique for: - being a no-code system, making NLP accessible for non-experts. - guiding its users throughout the entire labeling process until they obtain their desired classifier, making the process efficient - from cold start to a classifier in a few hours. - being open for configuration and extension by developers. By open sourcing Label Sleuth we hope to build a community of users and developers that will widen the utilization of NLP models.

pdf bib
GrASP: A Library for Extracting and Exploring Human-Interpretable Textual Patterns
Piyawat Lertvittayakumjorn | Leshem Choshen | Eyal Shnarch | Francesca Toni
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Data exploration is an important step of every data science and machine learning project, including those involving textual data. We provide a novel language tool, in the form of a publicly available Python library for extracting patterns from textual data. The library integrates a first public implementation of the existing GrASP algorithm. It allows users to extract patterns using a number of general-purpose built-in linguistic attributes (such as hypernyms, part-of-speech tags, and syntactic dependency tags), as envisaged for the original algorithm, as well as domain-specific custom attributes which can be incorporated into the library by implementing two functions. The library is equipped with a web-based interface empowering human users to conveniently explore data via the extracted patterns, using complementary pattern-centric and example-centric views: the former includes a reading in natural language and statistics of each extracted pattern; the latter shows applications of each extracted pattern to training examples. We demonstrate the usefulness of the library in classification (spam detection and argument mining), model analysis (machine translation), and artifact discovery in datasets (SNLI and 20Newsgroups).

2020

pdf bib
Unsupervised Expressive Rules Provide Explainability and Assist Human Experts Grasping New Domains
Eyal Shnarch | Leshem Choshen | Guy Moshkowich | Ranit Aharonov | Noam Slonim
Findings of the Association for Computational Linguistics: EMNLP 2020

Approaching new data can be quite deterrent; you do not know how your categories of interest are realized in it, commonly, there is no labeled data at hand, and the performance of domain adaptation methods is unsatisfactory. Aiming to assist domain experts in their first steps into a new task over a new corpus, we present an unsupervised approach to reveal complex rules which cluster the unexplored corpus by its prominent categories (or facets). These rules are human-readable, thus providing an important ingredient which has become in short supply lately - explainability. Each rule provides an explanation for the commonality of all the texts it clusters together. The experts can then identify which rules best capture texts of their categories of interest, and utilize them to deepen their understanding of these categories. These rules can also bootstrap the process of data labeling by pointing at a subset of the corpus which is enriched with texts demonstrating the target categories. We present an extensive evaluation of the usefulness of these rules in identifying target categories, as well as a user study which assesses their interpretability.

pdf bib
Active Learning for BERT: An Empirical Study
Liat Ein-Dor | Alon Halfon | Ariel Gera | Eyal Shnarch | Lena Dankin | Leshem Choshen | Marina Danilevsky | Ranit Aharonov | Yoav Katz | Noam Slonim
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Real world scenarios present a challenge for text classification, since labels are usually expensive and the data is often characterized by class imbalance. Active Learning (AL) is a ubiquitous paradigm to cope with data scarcity. Recently, pre-trained NLP models, and BERT in particular, are receiving massive attention due to their outstanding performance in various NLP tasks. However, the use of AL with deep pre-trained models has so far received little consideration. Here, we present a large-scale empirical study on active learning techniques for BERT-based classification, addressing a diverse set of AL strategies and datasets. We focus on practical scenarios of binary text classification, where the annotation budget is very small, and the data is often skewed. Our results demonstrate that AL can boost BERT performance, especially in the most realistic scenario in which the initial set of labeled examples is created using keyword-based queries, resulting in a biased sample of the minority class. We release our research framework, aiming to facilitate future research along the lines explored here.

2019

pdf bib
Are You Convinced? Choosing the More Convincing Evidence with a Siamese Network
Martin Gleize | Eyal Shnarch | Leshem Choshen | Lena Dankin | Guy Moshkowich | Ranit Aharonov | Noam Slonim
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

With the advancement in argument detection, we suggest to pay more attention to the challenging task of identifying the more convincing arguments. Machines capable of responding and interacting with humans in helpful ways have become ubiquitous. We now expect them to discuss with us the more delicate questions in our world, and they should do so armed with effective arguments. But what makes an argument more persuasive? What will convince you? In this paper, we present a new data set, IBM-EviConv, of pairs of evidence labeled for convincingness, designed to be more challenging than existing alternatives. We also propose a Siamese neural network architecture shown to outperform several baselines on both a prior convincingness data set and our own. Finally, we provide insights into our experimental results and the various kinds of argumentative value our method is capable of detecting.

2018

pdf bib
Will it Blend? Blending Weak and Strong Labeled Data in a Neural Network for Argumentation Mining
Eyal Shnarch | Carlos Alzate | Lena Dankin | Martin Gleize | Yufang Hou | Leshem Choshen | Ranit Aharonov | Noam Slonim
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

The process of obtaining high quality labeled data for natural language understanding tasks is often slow, error-prone, complicated and expensive. With the vast usage of neural networks, this issue becomes more notorious since these networks require a large amount of labeled data to produce satisfactory results. We propose a methodology to blend high quality but scarce strong labeled data with noisy but abundant weak labeled data during the training of neural networks. Experiments in the context of topic-dependent evidence detection with two forms of weak labeled data show the advantages of the blending scheme. In addition, we provide a manually annotated data set for the task of topic-dependent evidence detection. We believe that blending weak and strong labeled data is a general notion that may be applicable to many language understanding tasks, and can especially assist researchers who wish to train a network but have a small amount of high quality labeled data for their task of interest.

pdf bib
Semantic Relatedness of Wikipedia Concepts – Benchmark Data and a Working Solution
Liat Ein Dor | Alon Halfon | Yoav Kantor | Ran Levy | Yosi Mass | Ruty Rinott | Eyal Shnarch | Noam Slonim
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
GRASP: Rich Patterns for Argumentation Mining
Eyal Shnarch | Ran Levy | Vikas Raykar | Noam Slonim
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

GRASP (GReedy Augmented Sequential Patterns) is an algorithm for automatically extracting patterns that characterize subtle linguistic phenomena. To that end, GRASP augments each term of input text with multiple layers of linguistic information. These different facets of the text terms are systematically combined to reveal rich patterns. We report highly promising experimental results in several challenging text analysis tasks within the field of Argumentation Mining. We believe that GRASP is general enough to be useful for other domains too. For example, each of the following sentences includes a claim for a [topic]: 1. Opponents often argue that the open primary is unconstitutional. [Open Primaries] 2. Prof. Smith suggested that affirmative action devalues the accomplishments of the chosen. [Affirmative Action] 3. The majority stated that the First Amendment does not guarantee the right to offend others. [Freedom of Speech] These sentences share almost no words in common, however, they are similar at a more abstract level. A human observer may notice the following underlying common structure, or pattern: [someone][argue/suggest/state][that][topic term][sentiment term]. GRASP aims to automatically capture such underlying structures of the given data. For the above examples it finds the pattern [noun][express][that][noun,topic][sentiment], where [express] stands for all its (in)direct hyponyms, and [noun,topic] means a noun which is also related to the topic.

2013

pdf bib
PLIS: a Probabilistic Lexical Inference System
Eyal Shnarch | Erel Segal-haLevi | Jacob Goldberger | Ido Dagan
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

pdf bib
A Probabilistic Lexical Model for Ranking Textual Inferences
Eyal Shnarch | Ido Dagan | Jacob Goldberger
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

2011

pdf bib
A Probabilistic Modeling Framework for Lexical Entailment
Eyal Shnarch | Jacob Goldberger | Ido Dagan
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Towards a Probabilistic Model for Lexical Entailment
Eyal Shnarch | Jacob Goldberger | Ido Dagan
Proceedings of the TextInfer 2011 Workshop on Textual Entailment

2010

pdf bib
Recognising Entailment within Discourse
Shachar Mirkin | Jonathan Berant | Ido Dagan | Eyal Shnarch
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2009

pdf bib
Text Categorization from Category Name via Lexical Reference
Libby Barak | Ido Dagan | Eyal Shnarch
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

pdf bib
Evaluating the Inferential Utility of Lexical-Semantic Resources
Shachar Mirkin | Ido Dagan | Eyal Shnarch
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Extracting Lexical Reference Rules from Wikipedia
Eyal Shnarch | Libby Barak | Ido Dagan
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2007

pdf bib
Instance-based Evaluation of Entailment Rule Acquisition
Idan Szpektor | Eyal Shnarch | Ido Dagan
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib
Lexical Reference: a Semantic Matching Subtask
Oren Glickman | Eyal Shnarch | Ido Dagan
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing