The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python AVM Barone*, F Barez*, I Konstas, SB Cohen The 61st Annual Meeting Of The Association For Computational Linguistics, 2023 | 23* | 2023 |
PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration P Li, H Tang, T Yang, X Hao, T Sang, Y Zheng, J Hao, ME Taylor, Z Wang, ... arXiv preprint arXiv:2203.08553, 2022 | 23 | 2022 |
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark J Hoelscher-Obermaier*, J Persson*, E Kran, I Konstas, F Barez* Findings of the Association for Computational Linguistics 2023, 11548–11559, 2023 | 22 | 2023 |
Neuron to Graph: Interpreting Language Model Neurons at Scale A Foote*, N Nanda, E Kran, I Konstas, S Cohen, F Barez* arXiv preprint arXiv:2305.19911, 2023 | 10 | 2023 |
Sleeper agents: Training deceptive llms that persist through safety training E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... arXiv preprint arXiv:2401.05566, 2024 | 9 | 2024 |
Understanding Addition in Transformers P Quirke, F Barez International Conference on Learning Representations (ICLR), 2023 | 5 | 2023 |
System III: Learning with Domain Knowledge for Safety Constraints F Barez, H Hasanbieg, A Abbate NeurIPS ML Safety Workshop, 2022 | 5 | 2022 |
Benchmarking specialized databases for high-frequency data F Barez, P Bilokon, R Xiong arXiv preprint arXiv:2301.12561, 2023 | 4 | 2023 |
Discovering topics and trends in the UK Government web archive D Beavan, F Barez, M Bel, J Fitzgerald, E Goudarouli, K Kollnig, ... Data Study Group Final Report. Alan Turing Institute, London, 2021 | 4* | 2021 |
Large language models relearn removed concepts M Lo, SB Cohen, F Barez arXiv preprint arXiv:2401.01814, 2024 | 3 | 2024 |
Exploring the advantages of transformers for high-frequency trading F Barez, P Bilokon, A Gervais, N Lisitsyn arXiv preprint arXiv:2302.13850, 2023 | 3 | 2023 |
Identifying a preliminary circuit for predicting gendered pronouns in gpt-2 small C Mathwin, G Corlouer, E Kran, F Barez, N Nanda URL: https://itch. io/jam/mechint/rate/1889871, 2023 | 3 | 2023 |
Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models M Luke, A Amir, N Clement, A Rauno, T Philip, B Fazl https://arxiv.org/abs/2310.08164, 2024 | 2* | 2024 |
Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model M Lan, F Barez https://arxiv.org/abs/2311.04131, 2023 | 2* | 2023 |
Increasing Trust in Language Models through the Reuse of Verified Circuits P Quirke, C Neo, F Barez arXiv preprint arXiv:2402.02619, 2024 | 1 | 2024 |
Measuring Value Alignment F Barez, P Torr arXiv preprint arXiv:2312.15241, 2023 | 1 | 2023 |
AI Systems of Concern K Matteucci, S Avin, F Barez, SÓ hÉigeartaigh arXiv preprint arXiv:2310.05876, 2023 | 1 | 2023 |
ED2: an environment dynamics decomposition framework for world model construction C Wang, T Yang, J Hao, Y Zheng, H Tang, F Barez, J Liu, J Peng, H Piao, ... arXiv preprint arXiv:2112.02817, 2021 | 1 | 2021 |
Near to Mid-term Risks and Opportunities of Open Source Generative AI F Eiras, A Petrov, B Vidgen, CS de Witt, F Pizzati, K Elkins, ... arXiv preprint arXiv:2404.17047, 2024 | | 2024 |
The Scaling Behavior of Large Language Models AV Miceli-Barone, F Barez, SB Cohen, E Voita, U Germann, M Lukasik Proceedings of the First edition of the Workshop on the Scaling Behavior of …, 2024 | | 2024 |