by kolchinski on 2/4/24, 11:56 PM with 47 comments
Originally this was a Christmas present for my fiancée, who loves books but has an eye problem that makes it hard for her to read more than a few pages at a time. She mostly listens to audiobooks while following along with the paper book, but some books aren't available in audiobook or even e-book form, and all of the existing apps we tried were surprisingly bad at scanning paper books into audio — they make lots of mistakes, include footnotes and page numbers, etc., in a way that really degrades the experience.
Being an AI-oriented engineer by training, I had a crack at solving the problem myself, and was pleasantly surprised at how well the proof of concept worked. I then had some time free while shutting down my previous company (Mezli, YC W21), during which I polished up the app to the point you see it at now.
The way it works:
On the front end, it's a SwiftUI app (mostly written by ChatGPT!) that consists mostly of a document scanner (VNDocumentCameraViewController) and a custom-built audio player.
The back end is more complex — book photos are first sent to an OCR API, then some custom code I wrote does a first pass at stitching together and correcting the results. Then, the corrected OCR results are sent to GPT-3.5-turbo for further post-processing and re-stitching together, and finally to a text-to-speech API for conversion to audio.
The hardest part of this process was actually getting the GPT calls right — I ended up writing a custom LLM eval framework for making sure the LLM wasn't making edits relative to the true text of the book.
A few issues remain, which I'll work on fixing if the app gets a significant amount of traction, including:
1) It can take multiple minutes to get audio back from a scan, especially if it's on the longer side (10+ pages). I'll be able to bring this down by spinning up dedicated servers for the OCR and TTS back-end.
2) The LLM sometimes does TOO good of a job at correcting "mistakes" in book text. This issue crops up particularly often when an author deliberately uses improper grammar, e.g. in dialogue.
The app is priced at $9.99/month for up to 250 pages/month right now, which I estimate will just about cover the costs of API calls. I'll be bringing the price point down as the pricing of the required AI APIs comes down. There's also a 3-day free trial if you want to try it out.
If you do find this useful, or know somebody who might, I'd appreciate you giving it a try or letting them know! And please let me know if you have any feedback, including issues or feature requests.
by spacemanspiff01 on 2/7/24, 1:22 AM
1 scanning the books to text.
2 reading text to the user.
3 having a good interface.
Number 1 seems to be where you put the most effort, along with 3.
I guess at least for me, there are often digital copies of books, either in epub or Kindle. When that's available those should be used.
And if it is not available, wouldn't it make more sense to have document scanner to epub?
I guess I'm just thinking that it is relatively rare that you really need to document scanning in order to get an audio book. Since most of the cost seems to be from document scanner side, it seems worthwhile to split them up.
And also seems like it would make sense to think of these as 2 separate products. Specialized document scanning, and audio generation. I can definitely see uses for one without the other.
by LeoNatan25 on 2/7/24, 11:27 AM
I’m sorry, but LOL. Not even a full book.
That has to be one of the most terrible business models. I guess it’s in line with most app subscription models these days, only much worse. And if the excuse is “well it costs me too much on Azure and the phone native APIs are not good enough”, perhaps the answer is “don’t do it then”. No thanks.
by broth on 2/6/24, 11:57 PM
by moritz64 on 2/7/24, 9:20 AM
All apps that I know of use iOS internal TTS (sounds awful, not as good as Siri). Then is also Voice Dream Reader and even with the paid premium voices it is still not pleasant to listen to. Siri-grade TTS or Elevenlabs would be pleasant enough, though.
by ummonk on 2/6/24, 11:47 PM
by ssttoo on 2/7/24, 9:37 AM
I recently read an Isaac Asimov book where he was describing a device that takes a book and acts it out for you. Made me think we’re probably pretty close.
by closetkantian on 2/5/24, 8:24 AM
by carbone_12 on 2/9/24, 5:50 AM
by rickcarlino on 2/7/24, 1:46 PM
Very excited to see all the cool things people publish once LLM pricing drops.
by aryamaan on 2/5/24, 7:32 AM
by blatherard on 2/7/24, 2:23 AM
by Gys on 2/6/24, 11:30 PM
by tamimio on 2/5/24, 12:24 AM
English book.
by quickthrower2 on 2/7/24, 9:50 AM