from Hacker News

Object Detection from 9 FPS to 650 FPS

by briggers on 10/10/20, 1:24 PM with 38 comments

by lostdog on 10/10/20, 5:55 PM
This is such a great post. It really shows how much room for improvement there is in all released deep learning code. Almost none of the open source work is really production ready for fast inference, and tuning the systems requires a good working knowledge of the GPU.
The article does skip the most important step for getting great inference speeds: Drop Python and move fully into C++.
by t-vi on 10/10/20, 6:54 PM
> The solution to Python’s GIL bottleneck is not some trick, it is to stop using Python for data-path code.
At least for the PyTorch bits of it, using the PyTorch JIT works well. When you run PyTorch code through Python, the intermediate results will be created as Python objects (with GIL and all) while when you run it in TorchScript, the intermediates will only be in C++ PyTorch Tensors, all without the GIL. We have a small comment about it in our PyTorch book in the section on what improvements to expect from the PyTorch JIT and it seems rather relevant in practice.
by nraynaud on 10/10/20, 11:15 PM
How do you keep track of the shutter clock in this kind of system? For example the camera clocks at 60fps, but the image processing is a few frames late, the gyroscope clocks at 4kHz, the accelerometer way slower, lidar is a slug, etc. Then you have to get all that stuff in your kalman filter to estimate the state and the central question is: “when did you collect this data?” I guess “no clue it comes from USB then disappeared into a GPU pipeline” is not a scientifically sound answer, you want to know if it goes before or after sample no 3864 of the gyroscope.
Long story short, that’s good, you’ve used a neural net to avoid using a human or an animal as a pose estimation datum, how do you correlate that to the rest of the sensor suite?
by NikolaeVarius on 10/10/20, 4:58 PM
I've been trying to coax better performance out of a Jetson nano camera, currently using Python's Open CV lib, with some threading, and can only manage at best about 29fps.
I would love an alternative that is reasonably simple to implement. I dislike having to handle raw bits.
by vj44 on 10/11/20, 3:29 AM
Good job digging into all of this Paul! At my company (onspecta.com) we solve similar problems (and more!) to accelerate AI/deep learning/computer vision problems, across both CPUs, GPUs as well as other types of chips.
This is a fascinating space, and there are tons of speed up opportunities. Depending on the type of the workload you're running, you might be able to ditch the GPU entirely and run everything just on the CPU, greatly reducing cost & deployment complexity. Or, at the very least, improve SLAs and 10x decrease the GPU (or CPU) cost.
I've seen this over and over again. Glad someone's documenting this publicly :-) If any one of you readers have more questions about this I'm happy to discuss in the comments here. Or you can reach out to me at victor at onspecta dot com.
by spockz on 10/10/20, 8:44 PM
I think this is a great explanation. Are this kind of manual optimisations still needed when using the higher level frameworks? Or at least those should make it clear in the types when a pipeline moves from cpu to gpu and vice versa.
by threatripper on 10/10/20, 8:31 PM
How would one accelerate object tracking on a video stream where each frame depends on the result of the previous one? Batching and multi-threading doesn't work here.
Are there some CNN-libraries that have way less overhead for small batch sizes? Tensorflow (GPU accelerated) seems to go down from 10000 fps on large batches to 200 fps for single frames for a small CNN.
by O5vYtytb on 10/10/20, 4:31 PM
> The solution to Python’s GIL bottleneck is not some trick, it is to stop using Python for data-path code.
What about using pytorch multiprocessing[1]?
[1] https://pytorch.org/docs/stable/notes/multiprocessing.html
by andrewbridger on 10/11/20, 12:57 AM
Has anyone looked at Julia? It’s claim is C like performance with the ease of use of a language like python.
by mleonhard on 10/13/20, 11:36 AM
Has any company tried putting the GPU and CPU in the same chip, sharing the same data caches? That could greatly increase the performance of the CPU-GPU data transfers.
by egberts1 on 10/10/20, 5:41 PM
Try this one.
https://github.com/streamlit/demo-self-driving
It uses StreamLit
https://github.com/streamlit/streamlit