llm-eval

Here are 27 public repositories matching this topic...

promptfoo / promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Aug 20, 2024
TypeScript

Giskard-AI / giskard

Sponsor

Star

🐢 Open-Source Evaluation & Testing for LLMs and ML models

Updated Aug 20, 2024
Python

Arize-ai / phoenix

Star

AI Observability & Evaluation

datasets mlops ai-monitoring ml-observability ai-observability ai-roi model-observability llmops llm-eval aiengineering

Updated Aug 20, 2024
Jupyter Notebook

uptrain-ai / uptrain

Star

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

machine-learning monitoring evaluation experimentation jailbreak-detection autoevaluation root-cause-analysis prompt-engineering llmops openai-evals llm-prompting llm-eval llm-test hallucination-detection

Updated Aug 18, 2024
Python

iterative / datachain

Star

DataChain 🔗 AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps

ai cv embeddings data-analytics data-wrangling multimodal mlops llm llm-eval

Updated Aug 20, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Aug 19, 2024
Python

fiddlecube / fiddlecube-sdk

Star

Generate ideal question-answers for testing RAG

synthetic-data llm-training llm-eval fine-tune-llms

Updated Jul 9, 2024
Python

Re-Align / just-eval

Star

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

evaluation gpt4 llm llm-eval llm-evaluation llm-evaluation-toolkit

Updated Jan 29, 2024
Python

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Aug 20, 2024
Python

kuk / rulm-sbs2

Star

Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat

russian-specific llm-eval

Updated Sep 26, 2023
Jupyter Notebook

Auto-Playground / ragrank

Star

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

machine-learning evaluation language-model rag llm prompt-engineering llmops llm-eval

Updated Aug 15, 2024
Python

alan-turing-institute / prompto

Star

An open source library for asynchronous querying of LLM endpoints

python nlp machine-learning natural-language-processing deep-learning transformers transformer hut23 large-language-models llms llm-eval llm-evaluation

Updated Aug 19, 2024
Python

Networks-Learning / prediction-powered-ranking

Star

Code for "Prediction-Powered Ranking of Large Language Models", Arxiv 2024.

ranking-algorithm llm-eval llm-evaluation llm-evaluation-framework prediction-powered-inference rank-sets

Updated May 27, 2024
Python

honeyhiveai / realign

Star

Realign is a AI testing and simulation framework for multi-turn AI applications. It simulates user interactions, evaluates AI performance, and generates adversarial scenarios to test LLM vulnerabilities.

ai simulation evaluation alignment red-teaming rag prompt-engineering llms llmops llm-eval llm-evaluation aiengineering llm-evaluation-framework

Updated Aug 19, 2024
Python

parea-ai / parea-sdk-ts

Star

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm prompt-engineering llms llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Aug 16, 2024
TypeScript

prompt-foundry / python-sdk

Star

The prompt engineering, prompt management, and prompt evaluation tool for Python

python python3 open-ai llm prompt-engineering prompt-management llm-eval llm-evaluation prompt-evaluation

Updated Aug 19, 2024
Python

harshagrawal523 / GenerativeAgents

Star

Generative agents — computational software agents that simulate believable human behavior and OpenAI LLM models. Our main focus was to develop a game - “Werewolves of Miller’s Hollow”, aiming to replicate human-like behavior.

docker transformers openai mongodb-atlas pygame-gui llm generative-ai llm-eval

Updated Jul 27, 2023
Python

prompt-foundry / typescript-sdk

Star

The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.

typescript gpt open-ai gpt-3 gpt-4 llm prompt-engineering llmops prompt-testing prompt-manager prompt-management llm-eval llm-test llm-ops llm-evaluation prompt-evaluation

Updated Aug 19, 2024
TypeScript

genia-dev / vibraniumdome-docs

Star

LLM Security Platform Docs

security openai prompts llm prompt-engineering chatgpt llmops large-language-model prompt-injection llm-serving adverarial-attacks llm-agent llm-security llm-inference llm-eval llm-framework prompt-injection-tool llm-evaluation llm-firewall

Updated Apr 9, 2024
MDX

yuzu-ai / ShinRakuda

Star

Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.

japanese llm llm-eval llm-evaluation llm-evaluation-framework

Updated Aug 8, 2024
Python

Improve this page

Add a description, image, and links to the llm-eval topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-eval topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-eval

Here are 27 public repositories matching this topic...

promptfoo / promptfoo

Giskard-AI / giskard

Arize-ai / phoenix

uptrain-ai / uptrain

iterative / datachain

athina-ai / athina-evals

fiddlecube / fiddlecube-sdk

Re-Align / just-eval

parea-ai / parea-sdk-py

kuk / rulm-sbs2

Auto-Playground / ragrank

alan-turing-institute / prompto

Networks-Learning / prediction-powered-ranking

honeyhiveai / realign

parea-ai / parea-sdk-ts

prompt-foundry / python-sdk

harshagrawal523 / GenerativeAgents

prompt-foundry / typescript-sdk

genia-dev / vibraniumdome-docs

yuzu-ai / ShinRakuda

Improve this page

Add this topic to your repo