Use correct format for demonstration samples for evaluation and evolution

Review changes
Download
Patches
Plain diff

Merged Use correct format for demonstration samples for evaluation and evolution

refactor-models into master

Overview 2
Commits 12
Changes 16

Merged Max Kimmich requested to merge refactor-models into master 5 months ago

Overview 2
Commits 12
Changes 16

The goal is to use chat-style format for demonstration samples for both evaluation and evolution.

Todos:

demonstration data for evaluation
demonstration data for evolution
base prompt generation

Current results for prompt "Have your friend evaluate the movie they had just seen and provide a summary opinion (e.g. terrible, bad, okay, good, or great) to determine the sentiment of the movie review." on SST5 (dev set/test set) using 1 demonstration sample per class for comparison with reference implementation are:

AlpacaHfChat: "chavinlo/alpaca-native", no grammar: 56.5/50.7
HfChat: "meta-llama/Meta-Llama-3.1-8B-Instruct", no grammar: 53.5/56.24
LlamaChat: "QuantFactory/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf", no grammar: 56.5/55.7
LlamaChat: "MaziyarPanahi/Meta-Llama-3.1-70B-Instruct-GGUF/Meta-Llama-3.1-70B-Instruct.Q4_K_M.gguf", no grammar: 56.5/56.52

Results on SST5 for Alpaca 7b ("chavinlo/alpaca-native") from the original paper are 49.91 or 52.26 depending on the table one refers to (table 1 and table 14 respectively, not sure where the difference is (both should report scores on the test set)).

The same for AG's News with prompt "Assess the entire concept of the news story and choose from the World, Sports, Business or Tech categories to categorize it into the correct category.":

AlpacaHfChat: "chavinlo/alpaca-native", no grammar: 73.5/72.33

Edited 5 months ago by Max Kimmich

Merge request reports

Activity

Filter activity

Approvals
Assignees & reviewers
Comments (from bots)
Comments (from users)
Commits & branches
Edits
Labels
Lock status
Mentions
Merge request status
Tracking

Please register or sign in to reply

Assignee

0 Reviewers

Request review from

Labels

None

Select labels

Manage project labels

Milestone

None

Time tracking

Participants