Skip to content

Refactor tasks and models and fix format for various models

Max Kimmich requested to merge refactor-models into master

Refactor models and task to be able to adapt the prompt format depending on the model (sorry the branch name is a bit misleading – thought I will only refactor model in the beginning):

  • Model chat format is respected when building prompts
  • Only parameters important for the respective model are used for computing the cache key

There are still some things todo:

  • Test llama-cpp backend for format
  • Make sure that all models behave similar, i.e., that they have their own set of parameters

Current results for prompt "Have your friend evaluate the movie they had just seen and provide a summary opinion (e.g. terrible, bad, okay, good, or great) to determine the sentiment of the movie review." on SST5 (dev set/test set) for comparison with reference implementation are:

  • AlpacaHfChat: "chavinlo/alpaca-native", no grammar: 52/49.23
  • HfChat: "meta-llama/Meta-Llama-3.1-8B-Instruct", no grammar: 54.5/52.44
  • LlamaChat: "QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf", no grammar: 55.5/53.48
  • LlamaChat: "TheBloke/Llama-2-70B-Chat-GGUF/llama-2-70b-chat.Q4_K_M.gguf", no grammar: 49/47.15
  • LlamaChat: "MaziyarPanahi/Meta-Llama-3.1-70B-Instruct-GGUF/Meta-Llama-3.1-70B-Instruct.Q4_K_M.gguf", no grammar: 57/53.53

Results on SST5 for Alpaca 7b ("chavinlo/alpaca-native") from the original paper are 49.91 or 52.26 depending on the table one refers to (table 1 and table 14 respectively, not sure where the difference is (both should report scores on the test set)).

Edited by Max Kimmich

Merge request reports

Loading