SFU NLP class: Syllabus

Semi-supervised Sequence Learning. Andrew M. Dai, Quoc V. Le.
Deep contextualized word representations. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer.
RoBERTa: A Robustly Optimized BERT Pretraining Approach. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.

Parameter-efficient Fine Tuning

Lecture notes

Lecture notes
Prefix-Tuning: Optimizing Continuous Prompts for Generation (Xiang Lisa Li, Percy Liang)
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning (Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, Jianfeng Gao)
LoRA: Low-Rank Adaptation of Large Language Models (Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen)
Adapter methods (docs.adapterhub.ml)

Links (=optional)

AdapterHub: A Framework for Adapting Transformers. Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, Iryna Gurevych.
Parameter-Efficient Transfer Learning for NLP. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly.
Simple, Scalable Adaptation for Neural Machine Translation. Ankur Bapna, Naveen Arivazhagan, Orhan Firat.
AdapterFusion: Non-Destructive Task Composition for Transfer Learning. Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, Iryna Gurevych.
Parameter-Efficient Tuning with Special Token Adaptation. Xiaocong Yang, James Y. Huang, Wenxuan Zhou, Muhao Chen.

Few-shot and in-context learning

Lecture notes

Lecture notes
Language Models are Unsupervised Multitask Learners (Open AI)
Language Models are Few-Shot Learners (Open AI)
GPT-4 Technical Report (Open AI)
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer)
In-Context Learning Learns Label Relationships but Is Not Conventional Learning (Jannik Kossen, Yarin Gal, Tom Rainforth)
Metaprompting (Anthropic.com)

Links (=optional)

In-context Examples Selection for Machine Translation. Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke Zettlemoyer, Marjan Ghazvininejad.
How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation. Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, Hany Hassan Awadalla.

Instruction tuning and Preference Optimization

Lecture notes

Lecture notes
ORPO: Monolithic Preference Optimization without Reference Model (Jiwoo Hong, Noah Lee, James Thorne)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn)
Training Verifiers to Solve Math Word Problems (Cobbe et al. (OpenAI))

Links (=optional)

Scaling Instruction-Finetuned Language Models. Google.
Proximal Policy Optimization Algorithms. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov.
LIMA: Less Is More for Alignment. Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy.

Retrieval Augmented Generation; kNN-LMs

Lecture notes

Lecture notes
Generalization through Memorization: Nearest Neighbor Language Models (Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis)
Improving language models by retrieving from trillions of tokens (Deepmind)

Links (=optional)

Nearest Neighbor Knowledge Distillation for Neural Machine Translation. Zhixian Yang, Renliang Sun, Xiaojun Wan.

Model Compression

Lecture notes

Lecture notes
Distilling the Knowledge in a Neural Network (Geoffrey Hinton, Oriol Vinyals, Jeff Dean)

Links (=optional)

Sequence-Level Knowledge Distillation. Yoon Kim, Alexander Rush.
Dark Knowledge. Geoffrey Hinton.
DistilBERT. Huggingface.
Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, Deming Chen, Marianne Winslett.

Scaling Laws for LLMs

Lecture notes

Lecture notes
Scaling Laws for Neural Language Models (Open AI)
Training Compute-Optimal Large Language Models (Google Deepmind)
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (Google)
Scaling Scaling Laws with Board Games (Andy L. Jones)
PaLM: Scaling Language Modeling with Pathways (Google)

Links (=optional)

AI and Compute. OpenAI.
Scaling laws notebook. Andrej Karpathy.
The FLOPs Calculus of Language Model Training. Dzmitry Bahdanau.
The AI Brick Wall – A Practical Limit For Scaling Dense Transformer Models, and How GPT 4 Will Break Past It. Dylan Patel.

Data Efficiency

Lecture notes

Do We Need to Create Big Datasets to Learn a Task? (Swaroop Mishra, Bhavdeep Singh Sachdeva)
Shortformer: Better Language Modeling using Shorter Inputs (Ofir Press, Noah A. Smith, Mike Lewis)
Active Learning for BERT: An Empirical Study (Liat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch, Lena Dankin, Leshem Choshen, Marina Danilevsky, Ranit Aharonov, Yoav Katz, Noam Slonim)

Fast Attention and Context Length

Lecture notes

Lecture notes
Efficient Transformers: A Survey (Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler)
Reformer: The Efficient Transformer (Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya)
Rethinking Attention with Performers (Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, Adrian Weller)
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (Ofir Press, Noah A. Smith, Mike Lewis)
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context (Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, Ruslan Salakhutdinov)
∞-former: Infinite Memory Transformer (Pedro Henrique Martins, Zita Marinho, Andre Martins)
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré)

Variational Auto-encoding for Language

Lecture notes

Lecture notes
Tutorial - What is a variational autoencoder? (Jaan Altosaar)
Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space (Chunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiujun Li, Yizhe Zhang, Jianfeng Gao)
Finetuning Pretrained Transformers into Variational Autoencoders (Seongmin Park, Jihwa Lee)

Links (=optional)

Neural Variational Inference for Text Processing. Yishu Miao, Lei Yu, Phil Blunsom.
Improved Variational Autoencoders for Text Modeling using Dilated Convolutions. Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, Taylor Berg-Kirkpatrick.
Glow: Better reversible generative models. Prafulla Dhariwal, Durk Kingma.
Glow: Generative Flow with Invertible 1x1 Convolutions. Prafulla Dhariwal, Durk Kingma.
Digit Fantasies by a Deep Generative Model. Durk Kingma.

Summary

Lecture notes

Efficient Methods for Natural Language Processing: A Survey (Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, Roy Schwartz)
Efficient Transformers: A Survey (Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler)