Estimating the reproducibility of psychological science Open Science Collaboration Science 349 (6251), aac4716, 2015 | 9575 | 2015 |
Dota 2 with large scale deep reinforcement learning C Berner, G Brockman, B Chan, V Cheung, P Dębiak, C Dennison, ... arXiv preprint arXiv:1912.06680, 2019 | 1883 | 2019 |
Training a helpful and harmless assistant with reinforcement learning from human feedback Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ... arXiv preprint arXiv:2204.05862, 2022 | 1099 | 2022 |
Constitutional ai: Harmlessness from ai feedback Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ... arXiv preprint arXiv:2212.08073, 2022 | 875 | 2022 |
An open, large-scale, collaborative effort to estimate the reproducibility of psychological science Open Science Collaboration Perspectives on Psychological Science 7, 657-660, 2012 | 754 | 2012 |
Tensorfuzz: Debugging neural networks with coverage-guided fuzzing A Odena, C Olsson, D Andersen, I Goodfellow International Conference on Machine Learning, 4901-4911, 2019 | 372 | 2019 |
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai, S Kadavath, B Mann, ... arXiv preprint arXiv:2209.07858, 2022 | 341 | 2022 |
A general language assistant as a laboratory for alignment A Askell, Y Bai, A Chen, D Drain, D Ganguli, T Henighan, A Jones, ... arXiv preprint arXiv:2112.00861, 2021 | 303 | 2021 |
In-context learning and induction heads C Olsson, N Elhage, N Nanda, N Joseph, N DasSarma, T Henighan, ... arXiv preprint arXiv:2209.11895, 2022 | 256 | 2022 |
Predictability and surprise in large generative models D Ganguli, D Hernandez, L Lovitt, A Askell, Y Bai, A Chen, T Conerly, ... Proceedings of the 2022 ACM Conference on Fairness, Accountability, and …, 2022 | 242 | 2022 |
A mathematical framework for transformer circuits N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, ... Transformer Circuits Thread 1 (1), 12, 2021 | 226 | 2021 |
Toy models of superposition N Elhage, T Hume, C Olsson, N Schiefer, T Henighan, S Kravec, ... arXiv preprint arXiv:2209.10652, 2022 | 201 | 2022 |
Discovering language model behaviors with model-written evaluations E Perez, S Ringer, K Lukošiūtė, K Nguyen, E Chen, S Heiner, C Pettit, ... arXiv preprint arXiv:2212.09251, 2022 | 181 | 2022 |
Discriminator rejection sampling S Azadi, C Olsson, T Darrell, I Goodfellow, A Odena arXiv preprint arXiv:1810.06758, 2018 | 155 | 2018 |
Dawn Drain N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, ... Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson …, 2021 | 152 | 2021 |
Is generator conditioning causally related to GAN performance? A Odena, J Buckman, C Olsson, T Brown, C Olah, C Raffel, I Goodfellow International conference on machine learning, 3849-3858, 2018 | 142 | 2018 |
Dawn Drain C Olsson, N Elhage, NJ Neel Nanda, N DasSarma, T Henighan, B Mann, ... Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy …, 2022 | 138 | 2022 |
The capacity for moral self-correction in large language models D Ganguli, A Askell, N Schiefer, TI Liao, K Lukošiūtė, A Chen, A Goldie, ... arXiv preprint arXiv:2302.07459, 2023 | 125 | 2023 |
Language models (mostly) know what they know S Kadavath, T Conerly, A Askell, T Henighan, D Drain, E Perez, ... arXiv preprint arXiv:2207.05221, 2022 | 121 | 2022 |
Dota 2 with large scale deep reinforcement learning CB OpenAI, G Brockman, B Chan, V Cheung, P Debiak, C Dennison, ... arXiv preprint arXiv:1912.06680 2, 2019 | 115 | 2019 |