by lappa on 12/13/24, 1:54 AM with 143 comments
by simonw on 12/15/24, 11:32 PM
Microsoft haven't officially released the weights yet but there are unofficial GGUFs up on Hugging Face already. I tried this one: https://huggingface.co/matteogeniaccio/phi-4/tree/main
I got it working with my LLM tool like this:
llm install llm-gguf
llm gguf download-model https://huggingface.co/matteogeniaccio/phi-4/resolve/main/phi-4-Q4_K_M.gguf
llm chat -m gguf/phi-4-Q4_K_M
Here are some initial transcripts: https://gist.github.com/simonw/0235fd9f8c7809d0ae078495dd630...More of my notes on Phi-4 here: https://simonwillison.net/2024/Dec/15/phi-4-technical-report...
by thot_experiment on 12/15/24, 11:46 PM
by xeckr on 12/13/24, 5:04 AM
How far are we from running a GPT-3/GPT-4 level LLM on regular consumer hardware, like a MacBook Pro?
by excerionsforte on 12/16/24, 4:33 AM
by jsight on 12/16/24, 4:06 AM
I'm not sure how I can be impressed by a 14B Phi-4. That isn't really small any more, and I doubt it will be significantly better than llama 3 or Mistral at this point. Maybe that will be wrong, but I don't have high hopes.
by travisgriggs on 12/16/24, 1:45 AM
by mupuff1234 on 12/16/24, 8:36 AM
I wonder what will be next month's buzzphrase.
by zurfer on 12/16/24, 9:16 AM
The worst was the gpt4o update in November. Basically a 2 liner on what it is better at and in reality it regressed in multiple benchmarks.
Here we just get MMLU, which is widely known to be saturated and knowing they trained on synthetic data, we have no idea how much "weight" was given to having MMLU like training data.
Benchmarks are not perfect, but they give me context to build upon. ---
edit: the benchmarks are covered in the paper: https://arxiv.org/pdf/2412.08905
by PoignardAzur on 12/16/24, 2:11 PM
by ai_biden on 12/16/24, 6:13 AM
Microsoft Research just dropped Phi-4 14B, an open-source model that’s turning heads. It claims to rival Llama 3.3 70B with a fraction of the parameters — 5x fewer, to be exact.
What’s the secret? Synthetic data. -> Higher quality, Less misinformation, More diversity
But the Phi models always have great benchmark scores, but they always disappoint me in real-world use cases.
Phi series is famous for to be trained on benchmarks.
I tried again with the hashtag#phi4 through Ollama - but its not satisfactory.
To me, at the moment - IFEval is the most important llm benchmark.
But look the smart business strategy of Microsoft:
have unlimited access to gpt-4 the input prompt it to generate 30B tokens train a 1B parameter model call it phi-1 show benchmarks beating models 10x the size never release the data never detail how to generate the data( this time they told in very high level) claim victory over small models
by liminal on 12/16/24, 6:17 PM
by parmesean on 12/13/24, 2:18 AM