Skip to content

Use correct format for demonstration samples for evaluation and evolution

Max Kimmich requested to merge refactor-models into master

The goal is to use chat-style format for demonstration samples for both evaluation and evolution.

Todos:

  • demonstration data for evaluation
  • demonstration data for evolution
  • base prompt generation

Current results for prompt "Have your friend evaluate the movie they had just seen and provide a summary opinion (e.g. terrible, bad, okay, good, or great) to determine the sentiment of the movie review." on SST5 (dev set/test set) using 1 demonstration sample per class for comparison with reference implementation are:

  • AlpacaHfChat: "chavinlo/alpaca-native", no grammar: 56.5/50.7
  • HfChat: "meta-llama/Meta-Llama-3.1-8B-Instruct", no grammar: 53.5/56.24
  • LlamaChat: "QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf", no grammar: 56.5/55.7
  • LlamaChat: "MaziyarPanahi/Meta-Llama-3.1-70B-Instruct-GGUF/Meta-Llama-3.1-70B-Instruct.Q4_K_M.gguf", no grammar: 56.5/56.52

Results on SST5 for Alpaca 7b ("chavinlo/alpaca-native") from the original paper are 49.91 or 52.26 depending on the table one refers to (table 1 and table 14 respectively, not sure where the difference is (both should report scores on the test set)).

The same for AG's News with prompt "Assess the entire concept of the news story and choose from the World, Sports, Business or Tech categories to categorize it into the correct category.":

  • AlpacaHfChat: "chavinlo/alpaca-native", no grammar: 73.5/72.33
Edited by Max Kimmich

Merge request reports

Loading