Shreya Shankar (@sh_reya) Twitter Tweets • TwiCopy

repeat5

account_circle

Ian Arawjo (@[email protected])

@IanArawjo

1 week ago

Shreya has been a delight to work with! If you are doing anything related to LLMOps, reach out to her!🙂

thumb_up_off_alt13

account_circle

got screwed by my evals today. lesson: if you rely on a separate labeled dataset for testing changes to your production LLM pipeline, and some prod LLM inputs may consist of errors like typos, bad grammar, different capitalization, etc. the test set needs to reflect this!!

thumb_up_off_alt109

repeat5

account_circle

hci.social/@jbigham

@jeffbigham

1 week ago

underappreciated in LLM-land is how humans ultimately decide what matters, what is valuable, and what is right.

account_circle

Yuchen Jin

@Yuchenj_UW

1 week ago

Who validates the validators?

An interesting observation from this paper is even Humans are not reliable judges of LLMs, because their evaluation criteria drift as they spend more time with LLMs.

Maybe the approach proposed in the paper can be incorporated into chatbot arena.

thumb_up_off_alt7

repeat2

account_circle

Evidently AI

@EvidentlyAI

2 weeks ago

👩‍🏫 Who validates the validators?

Shreya Shankar et al. introduced EvalGen, a mixed-initiative approach that aligns LLM-generated evaluations of LLM outputs with human preferences, addressing inherent biases and improving reliability.

Implementation details:
arxiv.org/abs/2404.12272

👩‍🏫 Who validates the validators? @sh_reya et al. introduced EvalGen, a mixed-initiative approach that aligns LLM-generated evaluations of LLM outputs with human preferences, addressing inherent biases and improving reliability. Implementation details: arxiv.org/abs/2404.12272

thumb_up_off_alt23

repeat4

account_circle

Ian Arawjo (@[email protected])

@IanArawjo

2 weeks ago

The Multi-Eval node is now in ChainForge!🎉 Define multiple criteria and evaluators on one node, including mix of code- or LLM-based evals. Table View is also improved.📈 Checkout v0.3.1.5 on PyPI, or on chainforge.ai/play. Release notes follow.

thumb_up_off_alt12

repeat2

account_circle

Ian Arawjo (@[email protected])

@IanArawjo

2 weeks ago

We'll roll out EvalGen in ChainForge in the coming weeks. This week, I aim to push the Multi-Eval node, alongside the nifty improvements to the table view for showing scores across many evaluators.

thumb_up_off_alt14

repeat2

account_circle

Fred Jonsson

@enginoid

2 weeks ago

100% relate to this. one reason that evaluation criteria is hard to work out in advance is that you don't know ahead of time how the model's going to fail, and that's usually what you're looking for in evals

ie. there's no need to do evals over something that you know will…

thumb_up_off_alt7

account_circle

Kyle Baxter

@kbaxter

2 weeks ago

Shreya Shankar This is a super interesting workflow and something we are doing really manually currently. Build evals > test using sample with human GT labels > refine eval. Then, in dev, run evals routinely, look for increased scores, but review outputs to apply qualitative adjustment.

thumb_up_off_alt5

account_circle

Shreya Shankar

@sh_reya

2 weeks ago

Evals are arguably the hardest part of LLMOps. LLMs mess up, so we check them w/ other LLMs, but this feels icky. Who validates the validators??

We built an interface to align LLM-based evals with user preferences, learning a lot about why this is hard: arxiv.org/abs/2404.12272

account_circle

David Tippett

@dtaivpp

2 weeks ago

This has been one of the things I’ve been most concerned about when it comes to LLMs.

Scaling user judgements is hard so I’m looking forward to reading into this 👀

thumb_up_off_alt3