Tomasz Korbak

Cited by

	All	Since 2019
Citations	1149	1145
h-index	16	16
i10-index	18	17

700

350

175

525

2019202020212022202320244 11 23 35 372 697

Public access

View all

4 articles

0 articles

available

not available

Based on funding mandates

Co-authors

Ethan PerezAnthropic; New York UniversityVerified email at anthropic.com
Marc DymetmanIndependent Researcher (Prev. Principal Scientist, NAVER Labs Europe)Verified email at naverlabs.com
Germán KruszewskiSenior Scientist @ Naver Labs Europe; MSCA Postdoctoral Researcher @ UPFVerified email at naverlabs.com
Samuel R. BowmanNYU and AnthropicVerified email at nyu.edu
Hady ElsaharResearch Scientist at Meta AIVerified email at meta.com
Kyunghyun ChoNew York University, GenentechVerified email at nyu.edu
Joanna Rączaszek-LeonardiProfessor, University of WarsawVerified email at psych.uw.edu.pl
Owain EvansResearch Associate, University of OxfordVerified email at philosophy.ox.ac.uk
Jason PhangNew York UniversityVerified email at nyu.edu
Anil SethSussex UniversityVerified email at sussex.ac.uk
David Scott KruegerUniversity Assistant Professor, University of CambridgeVerified email at cam.ac.uk

Tomasz Korbak

Anthropic

Verified email at anthropic.com - Homepage

language models AI alignment reinforcement learning Bayesian inference ML safety


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023	259	2023
Pretraining language models with human preferences T Korbak, K Shi, A Chen, RV Bhalerao, C Buckley, J Phang, SR Bowman, ... International Conference on Machine Learning, 17506-17533, 2023	133	2023
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ... arXiv preprint arXiv:2309.12288, 2023	125*	2023
Inverse scaling: When bigger isn't better IR McKenzie, A Lyzhov, M Pieler, A Parrish, A Mueller, A Prabhu, ... arXiv preprint arXiv:2306.09479, 2023	86*	2023
Towards understanding sycophancy in language models M Sharma, M Tong, T Korbak, D Duvenaud, A Askell, SR Bowman, ... arXiv preprint arXiv:2310.13548, 2023	72	2023
Training language models with language feedback at scale J Scheurer, JA Campos, T Korbak, JS Chan, A Chen, K Cho, E Perez arXiv preprint arXiv:2303.16755, 2023	71	2023
Improving code generation by training with natural language feedback A Chen, J Scheurer, T Korbak, JA Campos, JS Chan, SR Bowman, K Cho, ... arXiv preprint arXiv:2303.16749, 2023	42	2023
Aligning language models with preferences through f-divergence minimization D Go, T Korbak, G Kruszewski, J Rozen, N Ryu, M Dymetman arXiv preprint arXiv:2302.08215, 2023	42	2023
RL with KL penalties is better viewed as Bayesian inference T Korbak, E Perez, CL Buckley arXiv preprint arXiv:2205.11275, 2022	38*	2022
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024	34	2024
On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting T Korbak, H Elsahar, G Kruszewski, M Dymetman Advances in Neural Information Processing Systems 35, 16203-16220, 2022	32	2022
Computational enactivism under the free energy principle T Korbak Synthese 198 (3), 2743-2763, 2021	31	2021
Taken out of context: On measuring situational awareness in LLMs L Berglund, AC Stickland, M Balesni, M Kaufmann, M Tong, T Korbak, ... arXiv preprint arXiv:2309.00667, 2023	29*	2023
Controlling conditional language models without catastrophic forgetting T Korbak, H Elsahar, G Kruszewski, M Dymetman International Conference on Machine Learning, 11499-11528, 2022	26	2022
Many-shot jailbreaking C Anil, E Durmus, M Sharma, J Benton, S Kundu, J Batson, N Rimsky, ... Anthropic, April, 2024	19*	2024
Interaction history as a source of compositionality in emergent communication T Korbak, J Zubek, Ł Kuciński, P Miłoś, J Rączaszek-Leonardi Interaction Studies 22 (2), 212-243, 2021	17*	2021
Catalytic role of noise and necessity of inductive biases in the emergence of compositional communication Ł Kuciński, T Korbak, P Kołodziej, P Miłoś Advances in neural information processing systems 34, 23075-23088, 2021	14	2021
Scaffolded minds and the evolution of content in signaling pathways T Korbak Studies in Logic, Grammar and Rhetoric 41 (1), 89-103, 2015	10	2015
Measuring non-trivial compositionality in emergent communication T Korbak, J Zubek, J Rączaszek-Leonardi arXiv preprint arXiv:2010.15058, 2020	9	2020
Energy-based models for code generation under compilability constraints T Korbak, H Elsahar, M Dymetman, G Kruszewski arXiv preprint arXiv:2106.04985, 2021	8	2021

The system can't perform the operation now. Try again later.

Articles 1–20

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors