Refactor tasks and models and fix format for various models (!7) · Merge requests · Grießhaber Daniel / evoprompt

Merged Max Kimmich requested to merge refactor-models into master 5 months ago

Refactor models and task to be able to adapt the prompt format depending on the model (sorry the branch name is a bit misleading – thought I will only refactor model in the beginning):

Model chat format is respected when building prompts
Only parameters important for the respective model are used for computing the cache key

There are still some things todo:

Test llama-cpp backend for format
Make sure that all models behave similar, i.e., that they have their own set of parameters

Current results for prompt "Have your friend evaluate the movie they had just seen and provide a summary opinion (e.g. terrible, bad, okay, good, or great) to determine the sentiment of the movie review." on SST5 (dev set/test set) for comparison with reference implementation are:

AlpacaHfChat: "chavinlo/alpaca-native", no grammar: 52/49.23
HfChat: "meta-llama/Meta-Llama-3.1-8B-Instruct", no grammar: 54.5/52.44
LlamaChat: "QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf", no grammar: 55.5/53.48
LlamaChat: "TheBloke/Llama-2-70B-Chat-GGUF/llama-2-70b-chat.Q4_K_M.gguf", no grammar: 49/47.15
LlamaChat: "MaziyarPanahi/Meta-Llama-3.1-70B-Instruct-GGUF/Meta-Llama-3.1-70B-Instruct.Q4_K_M.gguf", no grammar: 57/53.53

Results on SST5 for Alpaca 7b ("chavinlo/alpaca-native") from the original paper are 49.91 or 52.26 depending on the table one refers to (table 1 and table 14 respectively, not sure where the difference is (both should report scores on the test set)).

Edited 5 months ago by Max Kimmich

Activity

Please register or sign in to reply

Refactor tasks and models and fix format for various models

Merge request reports

Activity