from Hacker News

Video Surveillance with YOLO+llava

by psychip on 10/8/24, 12:21 AM with 68 comments

by 01100011 on 10/8/24, 2:54 PM
If you're interested in DIY security+AI, check out Frigate NVR(https://frigate.video/), Scrypted(https://www.scrypted.app/) and Viseron(https://viseron.netlify.app/).
by yu3zhou4 on 10/8/24, 6:13 AM
Congrats! What hardware you use to run the inference 24/7? I built a simpler version for running on low end hardware [0] for recognizing if there’s a person on my parcel, so I know someone have trespassed and I can launch siren, lights etc.
https://github.com/jmaczan/yolov3-tiny-openvino
by pmontra on 10/8/24, 6:38 AM
This runs with a Geforce GTX 1060. By a quick search it's 120 W. Maybe it's only the peak power consumption but it's still a lot. Do commercial products, if there are any, consume that much power?
by rocauc on 10/8/24, 5:12 AM
A suggestion: I'd swap llava for Florence-2 for your open set text description. Florence-2 seems uniformly more descriptive in its outputs.
by xrd on 10/8/24, 11:53 AM
I'm confused about why you need yolo and llava. Can't you simply use yolo without a multimodal LLM? What does that add? You can use yolo to detect and grab screen coordinates on its own, right?
by vaylian on 10/8/24, 9:53 AM
Hello from the privacy crowd! Please use this responsibly. Tech can be a lot of fun and I encourage you to play around with things and I appreciate it when you push the boundaries of what is technically feasible. But please be mindful that surveillance tech can also be used to oppress people and infringe on their freedoms. Use tech for good!
by matrik on 10/8/24, 5:11 PM
MobileNetV3 and EfficientDet are othwr possible alternatives to YOLO. I was able to get higher than 1.5 FPS on Raspberry Pi Zero 2W which draws 1W on average. With efficient queuing approach, one can eliminate all bottlenecks.
by ferar on 10/8/24, 4:09 AM
Can you specify ideal hardware (camera, computer) to deploy the solution? Thanks
by doctorhandshake on 10/8/24, 10:38 AM
>> It calculates the center of every detection box, pinpoint on screen and gives 16px tolerance on all directions. Script tries to find closest object as fallback and creates a new object in memory in last resort. You can observe persistent objects in /elements folder
I’ve never implemented this kind of object persistence algo - is this a good approach? Seems naive but maybe that’s just because it’s simple.
by nikolayasdf123 on 10/8/24, 6:37 AM
how about llama3.2 vision? should it get better performance?
by _giorgio_ on 10/8/24, 2:53 AM
All I see, usually, is some AI YOLO algorithm applied to an offline video.
This is the first time that I've seen a "complete" setup. Any info to learn more on applying YOLO and similar models to real time streams (whatever the format)?
by anshumankmr on 10/8/24, 1:28 PM
Could try with Florence by Microsoft instead of Yolo and Llava, though the results are not going to be as great. Florence will do the inference on CPU. This is just for fun.