Просмотр исходного кода

dataset cleaning, visualizations

Eric Wang 3 лет назад
Родитель
Сommit
f7044049ab
3 измененных файлов с 1344 добавлено и 4 удалено
  1. 12 1
      README.md
  2. 1324 0
      alpaca_data_cleaned.json
  3. 8 3
      finetune.py

+ 12 - 1
README.md

@@ -40,9 +40,20 @@ PRs adapting this code to multi-GPU setups and larger models are always welcome.
 This file contains a script to convert the LoRA back into a standard PyTorch model checkpoint,
 This file contains a script to convert the LoRA back into a standard PyTorch model checkpoint,
 which should help users who want to use the model with projects like [llama.cpp](https://github.com/ggerganov/llama.cpp).
 which should help users who want to use the model with projects like [llama.cpp](https://github.com/ggerganov/llama.cpp).
 
 
+### Dataset
+
+In addition to `alpaca_data.json`, which contains the original Stanford Alpaca dataset,
+we also include `alpaca_data_cleaned.json`, which has been [stripped of various tokenization artifacts](https://github.com/tloen/alpaca-lora/pull/32)
+with the help of @gururise.
+This file is now used by default in the training script.
+
+@AndriyMulyar has also provided interactive, embedding-based visualizations of the original dataset's [instructions](https://atlas.nomic.ai/map/alpaca_instructions)
+and [outputs](https://atlas.nomic.ai/map/alpaca_outputs),
+as well as [clusters of bad examples](https://atlas.nomic.ai/map/d2139cc3-bc1c-441c-8d6f-3e6ffbbc2eda/838019ff-8fe2-42ba-809a-d86d2b98cd50/-18.11668742841587/-11.348087116836096/-20.88850316347706/-17.680468640801223/774455612).
+
 ### Notes
 ### Notes
 
 
-- Before we try to tune the weights on 13B+ models, we should note (sorry Tatsu) that [the quality of the Stanford Alpaca dataset is not very good](https://github.com/tloen/alpaca-lora/pull/32). We can likely improve our model performance significantly if we combed through the data and fixed bad examples; in fact, dataset quality might be our bottleneck. _The most impactful contribution anyone can make to this project is to provide a way to systematically iterate on the training data._
+- We can likely improve our model performance significantly if we combed through the data and fixed bad examples; in fact, dataset quality might be our bottleneck.
 - We're continually fixing bugs and conducting training runs, and the weights on the Hugging Face Hub are being updated accordingly. In particular, those facing issues with response lengths should make sure that they have the latest version of the weights and code.
 - We're continually fixing bugs and conducting training runs, and the weights on the Hugging Face Hub are being updated accordingly. In particular, those facing issues with response lengths should make sure that they have the latest version of the weights and code.
 
 
 
 

Разница между файлами не показана из-за своего большого размера
+ 1324 - 0
alpaca_data_cleaned.json


+ 8 - 3
finetune.py

@@ -23,13 +23,18 @@ from peft import (
 MICRO_BATCH_SIZE = 4  # this could actually be 5 but i like powers of 2
 MICRO_BATCH_SIZE = 4  # this could actually be 5 but i like powers of 2
 BATCH_SIZE = 128
 BATCH_SIZE = 128
 GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
 GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
-EPOCHS = 3  # we don't need 3 tbh
+EPOCHS = 3  # we don't always need 3 tbh
 LEARNING_RATE = 3e-4  # the Karpathy constant
 LEARNING_RATE = 3e-4  # the Karpathy constant
 CUTOFF_LEN = 256  # 256 accounts for about 96% of the data
 CUTOFF_LEN = 256  # 256 accounts for about 96% of the data
 LORA_R = 8
 LORA_R = 8
 LORA_ALPHA = 16
 LORA_ALPHA = 16
 LORA_DROPOUT = 0.05
 LORA_DROPOUT = 0.05
 VAL_SET_SIZE = 2000
 VAL_SET_SIZE = 2000
+TARGET_MODULES = [
+    "q_proj",
+    "v_proj",
+]
+DATA_PATH = "alpaca_data_cleaned.json"
 
 
 model = LlamaForCausalLM.from_pretrained(
 model = LlamaForCausalLM.from_pretrained(
     "decapoda-research/llama-7b-hf",
     "decapoda-research/llama-7b-hf",
@@ -45,14 +50,14 @@ model = prepare_model_for_int8_training(model)
 config = LoraConfig(
 config = LoraConfig(
     r=LORA_R,
     r=LORA_R,
     lora_alpha=LORA_ALPHA,
     lora_alpha=LORA_ALPHA,
-    target_modules=["q_proj", "v_proj"],
+    target_modules=TARGET_MODULES,
     lora_dropout=LORA_DROPOUT,
     lora_dropout=LORA_DROPOUT,
     bias="none",
     bias="none",
     task_type="CAUSAL_LM",
     task_type="CAUSAL_LM",
 )
 )
 model = get_peft_model(model, config)
 model = get_peft_model(model, config)
 tokenizer.pad_token_id = 0  # unk. we want this to be different from the eos token
 tokenizer.pad_token_id = 0  # unk. we want this to be different from the eos token
-data = load_dataset("json", data_files="alpaca_data.json")
+data = load_dataset("json", data_files=DATA_PATH)
 
 
 train_val = data["train"].train_test_split(
 train_val = data["train"].train_test_split(
     test_size=VAL_SET_SIZE, shuffle=True, seed=42
     test_size=VAL_SET_SIZE, shuffle=True, seed=42

Некоторые файлы не были показаны из-за большого количества измененных файлов