Implemented adding demonstration data for evaluation (can be used on the CLI via -n-evaluation-demo n which adds n samples per class as demonstration to the prompt)
This leads to a new response format for the LM which needed a new answer parsing
Currently only implemented for text classification tasks
Needs more testing though, especially the new answer parsing