by wujerry2000 on 9/20/23, 2:30 AM with 17 comments
by marcklingen on 9/20/23, 6:45 AM
by retrovrv on 9/20/23, 1:32 PM
by batshit_beaver on 9/20/23, 6:18 AM
by aethelyon on 9/21/23, 12:59 AM
======
outside of us, here's what I see happening
80% of folks aren't building in prod
if you pull apart the 20% that are building, I've seen this from largest to smallest population:
1. most people are not monitoring, followed by 2. home-grown solutions logged into existing observe/analytics platforms, followed by 3. LLMOps tooling like Klu
the 2 cents on the unfortunate truth: I think that many of the AI bolt-on features are living the classic feature lifecycle in that they are launched, no one is monitoring them for improvement, and the feature retention sucks so there's no top-down push to prioritize. the people measuring and improving are exceptional builders regardless of LLMs/RAG.
by sirspacey on 9/20/23, 4:28 AM
by ezedv on 9/21/23, 1:58 PM
Many AI companies use a combination of real-time monitoring, automated alerts, and regular audits to maintain the quality and fairness of their AI systems. It's an ongoing process that plays a vital role in responsible AI development.
In case you have an AI project in mind, feel free to contact us! https://www.ratherlabs.com
by jobseeker36 on 9/20/23, 5:52 AM
by tikkun on 9/20/23, 1:09 PM
Here's my notes on evals --
Things to consider when comparing options:
1) “Types of metrics supported (only NLP metrics, model-graded evals, or both), level of customizability; supports component eval (i.e. single prompts) or pipeline evals (i.e. testing the entire pipeline, all the way from retrieval to post-processing)”
2) “+method of dataset & eval management (config vs UI), tracing to help debug failing evals”
3) “If you wanted to go deeper on evaluation, I'd probably also add:
What to evaluate for:
- Hallucination
- Safety
- Usefulness
- Tone / format (eg conciseness)
- Specific regressions
Tips:
- Model-graded evaluation is taking off
- Use GPT-4, GPT-3.5 is not good enough [for evals]
- Most big companies have some human oversight of the model-grading
- Conversational simulation is an emerging idea building on top of model-graded eval” - AI Startup Founder
---
Here are a few that people are using for evals at production scale:
* Honeyhive https://honeyhive.ai
* Gentrace https://gentrace.ai
* Humanloop https://humanloop.com
* Gantry https://www.gantry.io
I've done calls with the founders of three of those four, and I've talked with enterprise customers who've been evaluating a couple of those.
I see there's a few others mentioned in this thread (langfuse, truera, langkit/whylabs) that I haven't heard about from customers but also look promising. There's also langsmith which I do know is popular amongst enterprises (enterprises hear of langchain, see that they have a big enterprise-oriented offering) but I haven't talked with anyone who uses it.
Then for evals at prototyping scale there are various small tools and open source tools that I've collected here: https://llm-utils.org/List+of+tools+for+prompt+engineering
[1]: I'm working on an AI infra handbook. Email me, email in profile, if you can review/add comments to my draft. It's 23 pages long :x