from Hacker News

OpenAI GPT-4 vs. Groq Mistral-8x7B

by tanyongsheng on 3/22/24, 8:44 AM with 133 comments

by wruza on 3/22/24, 11:16 AM

The prompt, for those interested. I find it pretty underspecified, but maybe that's the point. For example, "Business operating hours" could be expanded a little, because "Closed - Opens at XX" is still non-processable in both cases.

  You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.

  Data to scrape: 
  title: Name of the business
  type: The business nature like Cafe, Coffee Shop, many others
  phone: The phone number of the business
  address: Address of the business, can be a state, country or a full address
  years_in_business: Number of years since the business started
  hours: Business operating hours
  rating: Rating of the business
  reviews: Number of reviews on the business
  price: Typical spending on the business
  description: Extra information that is not mentioned yet in any of the data
  service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
  is_operating: Whether the business is operating
  
  HTML: 
  {html}

by feintruled on 3/22/24, 10:20 AM
Brave new world, where our machines are sometimes wrong but by gum they are quick about it.
by RUnconcerned on 3/22/24, 10:18 AM
Finally, something more offensive than parsing HTML with regular expressions: parsing HTML with LLMs.
by retrac98 on 3/22/24, 10:15 AM
There are so many applications for LLMs where having a perfect score is much more important than speed, because getting it wrong is so expensive, damaging, or time consuming to resolve for an organisation.
by infecto on 3/22/24, 12:19 PM
This test is interesting from a general high level metric/test but overall the way they are extracting data using a LLM is suboptimal so I don't think the takeaway means much. You could extract this type of data using a low-end model like 8x7B with a high degree of accuracy.
by emporas on 3/22/24, 12:59 PM
Mixtral works very well with json output in my personal experience. Gpt family are excellent of course, and i would bet Claude and Gemini are pretty good. Mixtral however is the smallest of the models and the most efficient.
Especially running on Groq's infrastructure it's blazing fast. Some examples i ran on Groq's API, the query was completed in 70ms. Groq has released API libraries for Python and Javascript, i wrote a simple Rust example here, of how to use the API [1].
Groq's API documents how long it takes to generate the tokens for each request. 70ms for a page of document, are well over 100 times faster than GPT, and the fastest of every other capable model. Accounting for internet's latency and some queue that might exist, then the user receives the request in a second, but how fast would this model run locally? Fast enough to generate natural language tokens, generate a synthetic voice, listen again and decode the next request the user might talk to it, all in real time.
With a technology like that, why not talk to internet services with just APIs and no web interface at all? Just functions exposed on the internet, take json as an input, validate it, and send the json back to the user? Or every other interface and button around. Why pressing buttons for every electric appliance, and not just talk to the machine using a json schema? Why should users on an internet forum, every time a comment is added, have to press the add comment button, instead of just talking and saying "post it"? Pretty annoying actually.
[1] https://github.com/pramatias/groq_test
by imaurer on 3/22/24, 5:12 PM
Groq will soon support function calling. At that point, you would want to describe your data specification and use function calling to do extraction. Tools such as Pydantic and Instructor are good starting points.
I am collecting these approaches and tools here: https://github.com/imaurer/awesome-llm-json
by bambax on 3/22/24, 10:59 AM
Interesting post, but the prompt is missing? How do the LLMs generate the keys? It's likely the mistakes could be corrected with a better prompt or a post check?
Also, Google SERP page is deterministic (always has the same structure for the same kind of queries), so it would probably be much more effective to use AI to write a parser, and then refine it and use that?
by tosh on 3/22/24, 10:30 AM
I initially thought the blog post is about scraping using screenshots and multi-modal llms.
Scraping is quite complex by now (front-end JS, deep and irregular nesting, obfuscated html, …).
by crowdyriver on 3/22/24, 11:53 AM
There's lots of comments here about how stupid is to parse html using llms.
Have you ever had to scrape multiple sites with variadic html?
by malux85 on 3/22/24, 10:10 AM
Sorry to be nit-picky but thats the essence of these benchmarks - Mistral putting "N/A" for not available is weird - N/A is not applicable, in every use I have ever seen, and they DONT mean the same thing. I would expect null for not available and N/A for not applicable
Impressive inference speed difference though
by huqedato on 3/22/24, 10:29 AM
Can somebody explain why this Grok is more performant than Microsoft infrastructure ? LPU better than TPU/GPU ?
by ttrrooppeerr on 3/22/24, 10:17 AM
A bit off-topic but maybe not? Any words on GPT-5? Is that coming? Or is OpenAI just focusing on the Sora model?
by dns_snek on 3/22/24, 11:12 AM
For all the posturing and crypto hate on HN, we're entering a world where it's socially acceptable to use 1000W of computing power and 5 seconds of inference time to parse a tiny HTML fragment which would take microseconds with traditional methods - and people are cheering about it. Time for some self-reflection? That's not very green.