Syllabus
The syllabus is preliminary and subject to change.
Lecture notes
Links (=optional)
Lecture notes
Links (=optional)
Lecture notes
Links (=optional)
Lecture notes
Links (=optional)
Lecture notes
Links (=optional)
Lecture notes
Links (=optional)
Lecture notes
- Lecture notes
- Transformer code tutorial
- Attention is all you need
(Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin)
- Transformer: A Novel Neural Network Architecture for Language Understanding
(posted by Jakob Uszkoreit)
- Stanford cs224n lecture notes on self attention
(John Hewitt)
- The Annotated Transformer
(Austin Huang, Suraj Subramanian, Jonathan Sum, Khalid Almubarak, and Stella Biderman. Original by Sasha Rush)
- The Illustrated Transformer
(Jay Alammar)
Links (=optional)
-
RASPy: Think like a Transformer.
Sasha Rush.
-
Convolutional Sequence to Sequence Learning.
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin.
-
Layer Normalization.
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton.
-
Dropout: A Simple Way to Prevent Neural Networks from Overfitting.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov.
-
Improving Transformer Optimization Through Better Initialization.
Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs.
-
A Mathematical Framework for Transformer Circuits.
Anthropic.
-
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov.
Lecture notes
Links (=optional)
Lecture notes
- Lecture notes
- Measuring Massive Multitask Language Understanding
(Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt)
- MTEB: Massive Text Embedding Benchmark
(Niklas Muennighoff, Nouamane Tazi, Loïc Magne, Nils Reimers)
- AI and the Everything in the Whole Wide World Benchmark
(Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, Amandalynne Paullada)
- Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data
(Emily M. Bender, Alexander Koller)
Links (=optional)
Lecture notes
Links (=optional)
-
Semi-supervised Sequence Learning.
Andrew M. Dai, Quoc V. Le.
-
Deep contextualized word representations.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer.
-
RoBERTa: A Robustly Optimized BERT Pretraining Approach.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.
Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
Lecture notes
- Lecture notes
- Prefix-Tuning: Optimizing Continuous Prompts for Generation
(Xiang Lisa Li, Percy Liang)
- AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning
(Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, Jianfeng Gao)
- LoRA: Low-Rank Adaptation of Large Language Models
(Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen)
- Adapter methods
(docs.adapterhub.ml)
Links (=optional)
-
AdapterHub: A Framework for Adapting Transformers.
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, Iryna Gurevych.
-
Parameter-Efficient Transfer Learning for NLP.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly.
-
Simple, Scalable Adaptation for Neural Machine Translation.
Ankur Bapna, Naveen Arivazhagan, Orhan Firat.
-
AdapterFusion: Non-Destructive Task Composition for Transfer Learning.
Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, Iryna Gurevych.
-
Parameter-Efficient Tuning with Special Token Adaptation.
Xiaocong Yang, James Y. Huang, Wenxuan Zhou, Muhao Chen.
Lecture notes
Links (=optional)
-
In-context Examples Selection for Machine Translation.
Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke Zettlemoyer, Marjan Ghazvininejad.
-
How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation.
Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, Hany Hassan Awadalla.
Lecture notes
Links (=optional)
-
Scaling Instruction-Finetuned Language Models.
Google.
-
Proximal Policy Optimization Algorithms.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov.
-
LIMA: Less Is More for Alignment.
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy.
Lecture notes
Links (=optional)
Lecture notes
Links (=optional)
Lecture notes
Links (=optional)
Lecture notes
- Do We Need to Create Big Datasets to Learn a Task?
(Swaroop Mishra, Bhavdeep Singh Sachdeva)
- Shortformer: Better Language Modeling using Shorter Inputs
(Ofir Press, Noah A. Smith, Mike Lewis)
- Active Learning for BERT: An Empirical Study
(Liat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch, Lena Dankin, Leshem Choshen, Marina Danilevsky, Ranit Aharonov, Yoav Katz, Noam Slonim)
Lecture notes
- Lecture notes
- Efficient Transformers: A Survey
(Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler)
- Reformer: The Efficient Transformer
(Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya)
- Rethinking Attention with Performers
(Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, Adrian Weller)
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
(Ofir Press, Noah A. Smith, Mike Lewis)
- Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
(Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, Ruslan Salakhutdinov)
- ∞-former: Infinite Memory Transformer
(Pedro Henrique Martins, Zita Marinho, Andre Martins)
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
(Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré)
Lecture notes
Links (=optional)
Lecture notes
- Efficient Methods for Natural Language Processing: A Survey
(Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, Roy Schwartz)
- Efficient Transformers: A Survey
(Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler)