from Hacker News

Ask HN: Do you know a good resource for large data scraping job?

by hugo31370 on 2/9/12, 9:15 PM with 10 comments

My company, Easy Vino (easyvino.com), is gearing up for beta release and we need to populate our database with wine lists. The job consists of extracting information from wine lists (which we have and are usually PDF, HTML or Pictures) to put it into our database.

We have a simple back office that connects to a wine API to search for wine info and we need help inputing the data. I'd rather have the same person (or team) doing this as the learning curve is significant.

Does anyone know a cheap resource for this type of task? Any help or reference is appreciated.

Thanks a lot!

  • by devs1010 on 2/9/12, 11:13 PM

    I'm not sure exactly what sort of answer you are expecting. Unless the data you want is in a standardized format (such as a standardized XML schema), any effort to extract data would require writing custom parsers for each set of data that has a different structure. I'm not sure if you are asking for advice on which technology stack to use for writing this or are looking for a pre-made tool that can extract this for you? There may be some tools that can "attempt" to do this without requiring you to write custom code but I am not sure how effective they would be.
  • by ig1 on 2/10/12, 12:31 AM

    The typical way of doing this is to use mechanical turk, there are some third party services (their name escapes me) which are built on top of mturk to provide reliability.

    The typical way they do this is to have two different people enter the data and when there's a mismatch have a supervisor decide which is right.

  • by polyfractal on 2/10/12, 4:03 AM

    You might have good luck just hiring some cheap Virtual Assistants to do this work for you. oDesk or elance are pretty good for these types of administrative tasks