from Hacker News

Show HN: Use Purple LLaMA to test ChatGPT safeguards

by saqadri on 12/11/23, 3:31 PM with 1 comments

I spent some time this weekend playing with LLaMA Guard, a fine-tuned LLaMA-7B model by Meta that lets you add guardrails around generative AI. I recorded a quick demo showing what it does and how to use it.

The best part is that you can define your own “safety taxonomy” with it — custom policies for what is safe vs unsafe interactions between humans (prompts) and AI (responses).

I wanted to see how “safe” conversations with OpenAI’s ChatGPT were, so I ran a bunch of prompts (a mixture of innocuous and inappropriate) and asked LLaMA Guard to classify the interactions as safe/unsafe.

My key takeaways from the exercise: 1. OpenAI has done a good job of adding guardrails for its models. LLaMA Guard helped confirm this.

2. What makes this really cool is I may have a very specific set of policies I want to enforce ON TOP of the standard guardrails that a model ships with. LLaMA Guard makes this possible.

3. This kind of model chaining — passing responses from OpenAI models to LLaMA is becoming increasingly common, and I think we’ll have even more complex pipelines in the near future. It helped to have a consistent interface to store this multi-model pipeline as a config, especially because that same config also contains my safety taxonomy.

Try it out yourself:

GitHub: https://github.com/lastmile-ai/aiconfig/tree/main/cookbooks/LLaMA-Guard

Colab: https://colab.research.google.com/drive/1CfF0Bzzkd5VETmhsniksSpekpS-LKYtX

YouTube: <https://www.youtube.com/watch?v=XxggqoqIVdg>

Would love the community's feedback on the overall approach.

by saqadri on 12/11/23, 3:34 PM
Happy to answer any questions on the approach here! One thing I was slightly disappointed by was the instruction fine-tuning of LLaMA Guard was good for conversations, but not for declarative statements. So framing things as questions flagged the safeguards, but other styles of interactions didn't.
I wonder if it'll be better with LLaMA-13B instead of 7B.
Also link doesn't render nicely in the text above -- here it is: https://github.com/lastmile-ai/aiconfig/tree/main/cookbooks/...