#

llm-evaluation

Here are 95 public repositories matching this topic...

langfuse / langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

Updated Aug 19, 2024
TypeScript

promptfoo / promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Aug 20, 2024
TypeScript

giskard

Giskard-AI / giskard

🐢 Open-Source Evaluation & Testing for LLMs and ML models

Updated Aug 20, 2024
Python

confident-ai / deepeval

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Aug 19, 2024
Python

Helicone / helicone

🧊 Open source LLM-Observability Platform for Developers. One-line integration for monitoring, metrics, evals, agent tracing, prompt management, playground, etc. Supports OpenAI SDK, Vercel AI SDK, Anthropic SDK, LiteLLM, LLamaIndex, LangChain, and more. 🍓 YC W23

open-source playground monitoring analytics evaluation ycombinator openai gpt large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability agent-monitoring llm-cost

Updated Aug 20, 2024
TypeScript

agenta

Agenta-AI / agenta

The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.

prompt-toolkit rag human-annotation large-language-models llm prompt-engineering llms langchain llmops llama-index prompt-management llm-tools llm-framework llm-evaluation rag-evaluation

Updated Aug 20, 2024
Python

relari-ai / continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

Updated Aug 15, 2024
Python

onejune2018 / Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

nlp benchmark machine-learning leaderboard evaluation dataset openai llama bert rag awsome-list gpt3 llm awsome-lists chatgpt large-language-model chatglm qwen llm-evaluation

Updated Jul 31, 2024

microsoft / prompty

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

promptengineering llms generative-ai llm-evaluation

Updated Aug 16, 2024
Python

Value4AI / Awesome-LLM-in-Social-Science

Awesome papers involving LLMs in Social Science.

social-network simulation-environment policy economics psychology alignment social-science large-language-models llms llm-agent llm-evaluation

Updated Aug 18, 2024

Psycoy / MixEval

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Aug 19, 2024
Python

athina-ai / athina-evals

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Aug 19, 2024
Python

iMeanAI / WebCanvas

Connect agents to live web environments evaluation.

agent benchmark-framework llm-agent llm-evaluation

Updated Aug 19, 2024
Python

PetroIvaniuk / llms-tools

A list of LLMs Tools & Projects

data-science machine-learning ai chatbots chat-bot llm chatgpt open-source-llm llm-evaluation

Updated Aug 18, 2024

villagecomputing / superpipe

Superpipe - optimized LLM pipelines for structured data

classification data-extraction structured-data data-labeling llm llm-evaluation llm-optimization

Updated Jun 18, 2024
Python

raga-ai-hub / raga-llm-hub

Framework for LLM evaluation, guardrails and security

guardrails llmops llm-security llm-evaluation

Updated Aug 9, 2024
Python

allenai / CommonGen-Eval

Evaluating LLMs with CommonGen-Lite

evaluation text-generation llm chatgpt gpt-evaluation llama2 llm-evaluation

Updated Mar 21, 2024
Python

rungalileo / hallucination-index

Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.

openai rag hallucinations large-language-models llm retrieval-augmented-generation llm-evaluation

Updated Jul 29, 2024

Re-Align / just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

evaluation gpt4 llm llm-eval llm-evaluation llm-evaluation-toolkit

Updated Jan 29, 2024
Python

loganrjmurphy / LeanEuclid

LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.

theorem-proving formalization euclidean-geometry lean4 llm-evaluation autoformalization

Updated May 31, 2024
Lean

Improve this page

Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."