The LLM Evaluation Framework
-
Updated
Nov 25, 2024 - Python
The LLM Evaluation Framework
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks like CrewAI, Langchain, and Autogen
(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
Evaluate your speech-to-text system with similarity measures such as word error rate (WER)
📈 Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.
A Neural Framework for MT Evaluation
⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Mor…
Data-Driven Evaluation for LLM-Powered Applications
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
A Python wrapper for the ROUGE summarization evaluation package
Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.
It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text analysis, data analysis and data visualization
Add a description, image, and links to the evaluation-metrics topic page so that developers can more easily learn about it.
To associate your repository with the evaluation-metrics topic, visit your repo's landing page and select "manage topics."