Refactor models and task to be able to adapt the prompt format depending on the model (sorry the branch name is a bit misleading – thought I will only refactor model in the beginning):
There are still some things todo:
Current results for prompt "Have your friend evaluate the movie they had just seen and provide a summary opinion (e.g. terrible, bad, okay, good, or great) to determine the sentiment of the movie review." on SST5 (dev set/test set) for comparison with reference implementation are:
Results on SST5 for Alpaca 7b ("chavinlo/alpaca-native") from the original paper are 49.91 or 52.26 depending on the table one refers to (table 1 and table 14 respectively, not sure where the difference is (both should report scores on the test set)).