Use correct format for demonstration samples for evaluation and evolution
The goal is to use chat-style format for demonstration samples for both evaluation and evolution.
Todos:
-
demonstration data for evaluation -
demonstration data for evolution -
base prompt generation
Current results for prompt "Have your friend evaluate the movie they had just seen and provide a summary opinion (e.g. terrible, bad, okay, good, or great) to determine the sentiment of the movie review." on SST5 (dev set/test set) using 1 demonstration sample per class for comparison with reference implementation are:
- AlpacaHfChat: "chavinlo/alpaca-native", no grammar: 56.5/50.7
- HfChat: "meta-llama/Meta-Llama-3.1-8B-Instruct", no grammar: 53.5/56.24
- LlamaChat: "QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf", no grammar: 56.5/55.7
- LlamaChat: "MaziyarPanahi/Meta-Llama-3.1-70B-Instruct-GGUF/Meta-Llama-3.1-70B-Instruct.Q4_K_M.gguf", no grammar: 56.5/56.52
Results on SST5 for Alpaca 7b ("chavinlo/alpaca-native") from the original paper are 49.91 or 52.26 depending on the table one refers to (table 1 and table 14 respectively, not sure where the difference is (both should report scores on the test set)).
The same for AG's News with prompt "Assess the entire concept of the news story and choose from the World, Sports, Business or Tech categories to categorize it into the correct category.":
- AlpacaHfChat: "chavinlo/alpaca-native", no grammar: 73.5/72.33
Merge request reports
Activity
requested review from @griesshaber
assigned to @maximilian.kimmich
82 82 system_message=None, 83 83 prompt=RUN_NAME_PROMPT, 84 84 use_randomness=True, 85 # a bit more randomness for the name is okay 86 temperature=1.2, The names did still repeat quite often
Doesn't 1.2 look convincing?Edited by Max Kimmich
added 1 commit
- 49a22b31 - Fix not showing all inputs in ResponseEditor
mentioned in commit 5271cae4