About Our Paper Club

Introduction

These are our notes for the LLM Paper Club (Asia Edition!) run out of the latent-space discord channel. If you're interested in attending, you can check out the event calender here (opens in a new tab) of all the latent space event.

Simple Rules

In short, each week we discuss a LLM paper. We'd appreciate if you could read the paper beforehand so that when we work through it, it'd be a fruitful discussion for everyone. I'd like to emphasize here that this is very beginner friendly at the same time.

For each paper, we will discuss the following questions:

  1. Big Idea: What is the big idea for this paper? Or the main point the authors are trying to get across?
  2. Relevance: Why does this research matter? I.e. Has this progressed anything else forward?
  3. Open Questions: What open-ended questions do you have after reading this piece?

At the end of each discussion, we choose the next paper based on a vote. Anyone can suggest a paper and notes will be uploaded after each session.

We've started recording some sessions. You can see the recordings here

Timeline

Currently here are the papers which we've covered or are covering

  1. Attention Is All You Need: This was the initial paper which launched the LLM revolution with its newly introduced Attention Mechanism. It uses a decoder-encoder architecture in contract to the decoder only architectures today but is an interesting read.
  2. Self Rewarding Language Models: How we can improve model performance on tasks by conditioning on their ability to evaluate and generate their outputs.
  3. BERT : BERT is a encoder-only model that aims to encode rich contextual understanding of a sentence. It was released in 2018 and was crucial in helping improve the performance of Google's search engine significantly.
  4. T5: T5 was a transformer model that was trained on the C4 corpus. The paper itself explores how we can exploit contextual representations learned by language models trained using an unsupervised approach for specific tasks with transfer learning. It also introduces the C4 dataset.
  5. A Comprehensive Overview of Large Language Models : A guide to the major advancements and changes that have occured in the language modelling space
  6. GPT-2 : Utilizing GPT-2, this paper matched serveral state of the art models across various benchmark without using custom architecture or dataset adjustments, solely through ample model capacity and data.
  7. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models : An introduction to the DeepSeek MOE model which beats LLama2-7B with only 40% of equivalent computation.
  8. Improving Language Understanding by Generative Pre-Training : This was the original GPT-1 paper by Alec Radford that showed how we can use generative pre-trained models to adapt to a variety of different tasks without having to create a task-specific architecture.
  9. Mixture Of Depths: We train transformer models to be able to dynamically route tokens on a layer level so that we reduce the computation required for the attention computation
  10. Mamba : An introduction to a new transformer alternative using State Space Models. See notes here (opens in a new tab)
  11. Medusa : How to train new language modelling heads for speculative decoding and distil a dataset from the original model to prevent model drift while training the new model.
  12. Let's Think Dot By Dot (opens in a new tab) : Do transformers really need Chain Of Thought tokens or are filler tokens all they need?
  13. Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach: How can we determine the uncertainty from an LLM's response using the hidden states directly as a feature?
  14. Monosemanticity: Using Sparse Autoencoders to extract and understand what individual transformer features mean.
  15. ORPO: This paper introduces a new method of fine-tuning models by directly manipulating a new loss function using the odds likelihood of a generated sequence. By doing so, we can condense RLHF and SFT into a single step, and match the original performance. Read the notes here (opens in a new tab)
  16. LoRA learns less and forgets less : This is an ablation study of the performance difference between LoRA and standard fine-tuning when it comes to target domain adaptation + recall of original domain knowledge. Read the notes here (opens in a new tab)