by tanyongsheng on 3/22/24, 8:44 AM with 133 comments
by wruza on 3/22/24, 11:16 AM
You are an expert in Web Scraping, so you are capable to find the information in HTML and label them accordingly. Please return the final result in JSON.
Data to scrape:
title: Name of the business
type: The business nature like Cafe, Coffee Shop, many others
phone: The phone number of the business
address: Address of the business, can be a state, country or a full address
years_in_business: Number of years since the business started
hours: Business operating hours
rating: Rating of the business
reviews: Number of reviews on the business
price: Typical spending on the business
description: Extra information that is not mentioned yet in any of the data
service_options: Array of shopping options from the business, for example, in store shopping, delivery and many others. It should be in format -> option_name: true
is_operating: Whether the business is operating
HTML:
{html}
by feintruled on 3/22/24, 10:20 AM
by RUnconcerned on 3/22/24, 10:18 AM
by retrac98 on 3/22/24, 10:15 AM
by infecto on 3/22/24, 12:19 PM
by emporas on 3/22/24, 12:59 PM
Especially running on Groq's infrastructure it's blazing fast. Some examples i ran on Groq's API, the query was completed in 70ms. Groq has released API libraries for Python and Javascript, i wrote a simple Rust example here, of how to use the API [1].
Groq's API documents how long it takes to generate the tokens for each request. 70ms for a page of document, are well over 100 times faster than GPT, and the fastest of every other capable model. Accounting for internet's latency and some queue that might exist, then the user receives the request in a second, but how fast would this model run locally? Fast enough to generate natural language tokens, generate a synthetic voice, listen again and decode the next request the user might talk to it, all in real time.
With a technology like that, why not talk to internet services with just APIs and no web interface at all? Just functions exposed on the internet, take json as an input, validate it, and send the json back to the user? Or every other interface and button around. Why pressing buttons for every electric appliance, and not just talk to the machine using a json schema? Why should users on an internet forum, every time a comment is added, have to press the add comment button, instead of just talking and saying "post it"? Pretty annoying actually.
by imaurer on 3/22/24, 5:12 PM
I am collecting these approaches and tools here: https://github.com/imaurer/awesome-llm-json
by bambax on 3/22/24, 10:59 AM
Also, Google SERP page is deterministic (always has the same structure for the same kind of queries), so it would probably be much more effective to use AI to write a parser, and then refine it and use that?
by tosh on 3/22/24, 10:30 AM
Scraping is quite complex by now (front-end JS, deep and irregular nesting, obfuscated html, …).
by crowdyriver on 3/22/24, 11:53 AM
Have you ever had to scrape multiple sites with variadic html?
by malux85 on 3/22/24, 10:10 AM
Impressive inference speed difference though
by huqedato on 3/22/24, 10:29 AM
by ttrrooppeerr on 3/22/24, 10:17 AM
by dns_snek on 3/22/24, 11:12 AM