Shreya Shankar(@sh_reya) 's Twitter Profileg
Shreya Shankar

@sh_reya

I study ML & AI engineers and try to make their lives a little better. PhD-ing in databases & HCI @Berkeley_EECS @UCBEPIC and MLOps-ing around town. She/they.

ID:2286218053

linkhttp://www.sh-reya.com calendar_today11-01-2014 06:46:16

4,1K Tweets

39,4K Followers

593 Following

Follow People
Shreya Shankar(@sh_reya) 's Twitter Profile Photo

got screwed by my evals today. lesson: if you rely on a separate labeled dataset for testing changes to your production LLM pipeline, and some prod LLM inputs may consist of errors like typos, bad grammar, different capitalization, etc. the test set needs to reflect this!!

account_circle
hci.social/@jbigham(@jeffbigham) 's Twitter Profile Photo

underappreciated in LLM-land is how humans ultimately decide what matters, what is valuable, and what is right.

account_circle
Yuchen Jin(@Yuchenj_UW) 's Twitter Profile Photo

Who validates the validators?

An interesting observation from this paper is even Humans are not reliable judges of LLMs, because their evaluation criteria drift as they spend more time with LLMs.

Maybe the approach proposed in the paper can be incorporated into chatbot arena.

account_circle
Evidently AI(@EvidentlyAI) 's Twitter Profile Photo

πŸ‘©β€πŸ« Who validates the validators?

Shreya Shankar et al. introduced EvalGen, a mixed-initiative approach that aligns LLM-generated evaluations of LLM outputs with human preferences, addressing inherent biases and improving reliability.

Implementation details:
arxiv.org/abs/2404.12272

πŸ‘©β€πŸ« Who validates the validators? @sh_reya et al. introduced EvalGen, a mixed-initiative approach that aligns LLM-generated evaluations of LLM outputs with human preferences, addressing inherent biases and improving reliability. Implementation details: arxiv.org/abs/2404.12272
account_circle
Ian Arawjo (@ianarawjo@hci.social)(@IanArawjo) 's Twitter Profile Photo

The Multi-Eval node is now in ChainForge!πŸŽ‰ Define multiple criteria and evaluators on one node, including mix of code- or LLM-based evals. Table View is also improved.πŸ“ˆ Checkout v0.3.1.5 on PyPI, or on chainforge.ai/play. Release notes follow.

The Multi-Eval node is now in ChainForge!πŸŽ‰ Define multiple criteria and evaluators on one node, including mix of code- or LLM-based evals. Table View is also improved.πŸ“ˆ Checkout v0.3.1.5 on PyPI, or on chainforge.ai/play. Release notes follow.
account_circle
Ian Arawjo (@ianarawjo@hci.social)(@IanArawjo) 's Twitter Profile Photo

We'll roll out EvalGen in ChainForge in the coming weeks. This week, I aim to push the Multi-Eval node, alongside the nifty improvements to the table view for showing scores across many evaluators.

We'll roll out EvalGen in ChainForge in the coming weeks. This week, I aim to push the Multi-Eval node, alongside the nifty improvements to the table view for showing scores across many evaluators.
account_circle
Fred Jonsson(@enginoid) 's Twitter Profile Photo

100% relate to this. one reason that evaluation criteria is hard to work out in advance is that you don't know ahead of time how the model's going to fail, and that's usually what you're looking for in evals

ie. there's no need to do evals over something that you know will…

100% relate to this. one reason that evaluation criteria is hard to work out in advance is that you don't know ahead of time how the model's going to fail, and that's usually what you're looking for in evals ie. there's no need to do evals over something that you know will…
account_circle
Kyle Baxter(@kbaxter) 's Twitter Profile Photo

Shreya Shankar This is a super interesting workflow and something we are doing really manually currently. Build evals > test using sample with human GT labels > refine eval. Then, in dev, run evals routinely, look for increased scores, but review outputs to apply qualitative adjustment.

account_circle
Shreya Shankar(@sh_reya) 's Twitter Profile Photo

Evals are arguably the hardest part of LLMOps. LLMs mess up, so we check them w/ other LLMs, but this feels icky. Who validates the validators??

We built an interface to align LLM-based evals with user preferences, learning a lot about why this is hard: arxiv.org/abs/2404.12272

account_circle
David Tippett(@dtaivpp) 's Twitter Profile Photo

This has been one of the things I’ve been most concerned about when it comes to LLMs.

Scaling user judgements is hard so I’m looking forward to reading into this πŸ‘€

account_circle
Fred Jonsson(@enginoid) 's Twitter Profile Photo

since I'm deeply immersed in evals right now (and the process of building them) I got a kick out of this paper from Shreya Shankar J.D. Zamfirescu bjoern hartmann @[email protected] Aditya Parameswaran Ian Arawjo (@[email protected])

it addresses the challenge of time-efficiently coming up with evals that are aligned with practitioners

some…

since I'm deeply immersed in evals right now (and the process of building them) I got a kick out of this paper from @sh_reya @jdzamfi @bjo3rn @adityagp @IanArawjo it addresses the challenge of time-efficiently coming up with evals that are aligned with practitioners some…
account_circle