from Hacker News

What is the best way to programatically detect porn images? (2009)

by romain_g on 10/16/13, 11:55 AM with 104 comments

by nailer on 10/16/13, 2:18 PM
I used pifilter (now WeSEE:Filter , http://corp.wesee.com/products/filter/) for a production, realtime, anonymous confession site (imeveryone.com) in 2010.
It cost, IIRC, a tenth of a cent per image URL. Rather than being based on skin tone, it was created based on algos to specifically identify labia, anuses, penises, etc. REST API: send a URL, get back a yes/no/maybe. You decided what to do with the maybes.
My experience:
- Before launch, I tested it with 4chan b as a feed, and was able to produce a mostly clean version of b with the exception of cartoon imagery.
- It could catch most of the stuff people tried to post to the site. Small breasted women (being that breasts are considered 'adult' in the US) was the only thing that would get through and wasn't a huge concern. Completely unmaintained public hair (as revealing as a black bikini) would also get through.
- Since people didn't know what I was testing with they didn't work around it (so nobody tried posting drawings or cartoons), but I imagine eg a photo of a prolapse might not trigger the anus detection as the shape would be too different.
- pifilter erred on the side of false negative, but one notable false positive: a pastrami sandwich.
by Theodores on 10/16/13, 1:41 PM
In The Olden Pre-Digital Days porn was either in print or on a television screen. Back then (we are talking two whole decades ago) experienced broadcast engineers could instantly spot porn just by catching a look at an oscilloscope (of which there were usually many in a machine room).
Notionally the oscilloscope would be there to show that the luminance and chroma was okay in the signal (i.e. it could be broadcast over the airwaves to look as intended at the other end - PAL/NTSC), however, porn and anything likely to be porn had a distinctive pattern on the oscilloscope screen. Should porn be suspected then the source material would obviously be patched through to a monitor 'just in case'.
Note that the oscilloscope was analog and that the image would be changing 25/30 times a second. Also, back then there were not so many false positives on broadcast TV, e.g. pop videos etc. where today's audience deems them artful rather than porn.
If I had to solve the problem programatically I would find a retired broadcast engineer and start from there, with what can be learned from a 'scope.
by adorable on 10/16/13, 2:10 PM
I have developed an algorithm to detect such images, based on several articles published by research teams all over the world (it's incredible to see how many teams have tried to solve this problem!).
I found out that no single technique works great. If you want an efficient algorithm, you probably have to blend different ideas and compute a "nudity score" for each image. That's at least what I do.
I'd be happy to discuss how it works. Here are a few techniques used:
- color recognition (as discussed in other comments)
- haar-wavelets to detect specific shapes (that's what Facebook and others use to detect faces for example)
- texture recognition (skin and wood may have the same colors but not the same texture)
- shape/contour recognition (machine learning of course)
- matching with a growing database of NSFW images
The algorithm is open for test here: http://sightengine.com It works OK right now but once version 2 is out it should really be great.
by asolove on 10/16/13, 12:56 PM
Amazon Mechanical Turk has an adult-content marker specifically for this purpose. Lots of people have done the paperwork to qualify for adult-content jobs and the cost of having humans do it at scale is very low: https://requester.mturk.com/help/faq#can_explicit_offensive
Source: I helped implement a MT job to filter adult content for a large hosting company.
by ma2rten on 10/16/13, 3:03 PM
I did this for my bachelor thesis for a company that shall remain unnamed. I am pretty confident that my approach works better than any of the answer posted on stackoverflow.
I used the so called Bag of Visual Words approach. At that time the state of the art in image recognition (now it's neural networks). You can read about on Wikipeida. The only main change from the standard approach (SHIFT + k-means + histograms + SVM + chi2 kernel) was that I used a version of SHIFT that uses color features. In addition to this I used a second machine learning classifier based on the context of the picture. Who posted it? Is it a new user? What are the words in the title? How many view does the picture have....
In combination the two classifiers worked nearly flawless.
Shortly after that, chat roulette has having it's porn problem and it was in the media that the founder was working on a porn filter. I send an email to offer my help, but didn't get an reaction.
by VLM on 10/16/13, 1:23 PM
This is probably going to get downvoted, but if lots of people are not overzealous puritans and want some skin, the best overall system design that maximizes happiness and profit is probably sharding into
puritanweirdos.example.com with no skin showing between toes and top of turtleneck (edited to add no pokies either)
and
normalpeople.example.com with 99% of the human race
The best solution to a problem involving computers is sometimes computer related, but sometimes is social. The puritans are never going to get along with the normal people anyway, so its not like sharding them is going to hurt.
Another way to hack the system is not to hire or accept holier than thou puritans. Personality doesn't mesh with the team, doesn't fit culture, etc. You have to draw the line somewhere, and weirdos on either end should get cut, so no CP or animals at one extreme, and no holy rollers on the other extreme.
The final social hack is its kind of like dealing with bullies via appeasement. So they're blocking reasonable stuff today, tomorrow they want to block all women not wearing burkhas or depictions of women damaging their ovaries by driving. Appeasing bullies never really works in the long run, so why bother starting. "If you claim not to like it, or at least enjoy telling everyone else repeatedly how you claim not to like it, stop looking at it so much, case closed"
by _mulder_ on 10/16/13, 1:53 PM
Here's an idea...
Develop a bot to trawl NSFW sites and hash each image (combined with the 'skin detecting' algorithms detailed previously). Then compare the user uploaded image hash with those in the NSFW database.
This technique relies on the assumption that NSFW images that are spammed onto social media sites will use images that already exist on NSFW sites (or are very similar to). Then it simply becomes a case of pattern recognition, much like SoundHound for audio, or Google Image search.
It wouldn't reliably detect 'original' NSFW material, but given enough cock shots as source material, it could probably find a common pattern over time.
edit: I've just noticed rfusca in the OP suggests a similar method
by mixmax on 10/16/13, 12:53 PM
detecting all porn seems to be an almost impossible problem. Many kinds of advanced porn (BDSM, etc.) don't have much skin - often the actors are in latex, tied up, or whatever. It's obviously porn when you see it, but detecting it seems incredibly hard.
Detecting smurf-porn(1) (yes that's a thing...) is even harder since all the actors are blue.
http://pinporngifs.blogspot.dk/2012/09/smurfs-porn.html?zx=7... - obviously very NSFW, but quite funny.
by eksith on 10/16/13, 12:45 PM
To this day, I believe the best method for picking out these images is a human censor (with appropriate, company provided, counseling afterward).
Edit: No shortage of stock image reviewer jobs https://google.com/search?hl=en&q=%22image%20reviewer%22
I'm trying to find an interview of one of these people describing what it's like on the other end. It wasn't a pleasant story. These folks are employed by the likes of Facebook, Photobucket etc... Most are outsourced, obviously, and they all have very high turnover.
by VLM on 10/16/13, 1:29 PM
Nobody has discussed i18n and l10n issues? What passes for pr0n in SF is a bit different than tx.us and thats different from .eu and from .sa (sa is saudi arabia not south africa, although they've probably got some interesting cultural norms too)
If you're trying for "must not offend any human being on the planet" then you've got an AI problem that exceeds even my own human intelligence problem to figure out. Especially when it extends past pr0n and into stuff like satire, is that just some dudes weird self portrait, or a satire of the prophet, and are you qualified to figure it out?
by betterunix on 10/16/13, 1:36 PM
How about a picture of a woman's breasts? What about an erect penis? Sounds like porn, but you might also see these things in the context of health-related pictures or some other educational material.
The classic problem of trying to filter pornography is trying to separate it from information about human bodies. I suspect that doing this with images will be even harder than doing it with text.
by quarterto on 10/16/13, 12:37 PM
Google reverse image search can come up with a search likely to return the given image. Perhaps this can be used for porn classification.
by nathanb on 10/16/13, 2:43 PM
Seems like we were having this same problem with email spam, and Bayesian-based learning filters revolutionized the spam filtering landscape. Has anyone tried throwing computer learning at this problem?
We as humans can readily classify images into three vague categories: clean, questionable, and pornographic. The problem of classification is not only one of determining which bucket an image falls into but also one of determining where the boundaries between buckets are. Is a topless woman pornographic? A topless man? A painting of a topless woman created centuries ago by a well-recognized artist? A painting of a topless woman done yesterday by a relatively unknown artist? An infant being bathed? A woman breastfeeding her baby? Reasonable people may disagree on which bucket these examples fall in.
So what if I create three filter sets: restrictive, moderate, and permissive, and then categorize 1,000 sample images as one of those three categories for each filter set (restrictive could be equal to moderate but filter questionable images as well as pornographic ones).
Assuming that the learning algorithm was programmed to look at a sufficiently large number of image attributes, this approach should easily be capable of creating the most robust (and learning!) filter to date.
Has anyone done this?
by Houshalter on 10/16/13, 8:47 PM
Everyone is focusing on the machine vision problem but the OP had a good idea:
>There are already a few image based search engines as well as face recognition stuff available so I am assuming it wouldn't be rocket science and it could be done.
Just do a reverse image search for the image, see if it comes up on any porn sites or is associated with porn words.
by lectrick on 10/16/13, 7:38 PM
Relevant:
http://en.wikipedia.org/wiki/I_know_it_when_I_see_it
Basically, it's impossible to completely accurately identify pornography without a human actor in the mix, due to the subjectivity... and especially considering that not all nudity is pornographic.
by primaryobjects on 10/16/13, 2:07 PM
This is a classic categorical problem for machine learning. I'm surprised so many suggestions have involved formulating some sort of clever algorithm like skin detection, colors, etc. You could certainly use one of those for a baseline, but I'd bet machine learning would out-score most human-derived algorithms.
Take a look at the scores for classifying dogs vs cats with 97% accuracy http://www.kaggle.com/c/dogs-vs-cats/leaderboard. You could use a technique of digitizing the image pixels and feeding to a learning algorithm, similar to http://www.primaryobjects.com/CMS/Article154.aspx.
by denzil_correa on 10/16/13, 9:05 PM
I am aware of some nice scholarly work in this space. You may find Shih et al. approach of particular interest [0]. Their approach is very straight forward and based on image retrieval. They have also reported an accuracy of 99.54% for Adult image detection in their dataset.
[0] Shih, J. L., Lee, C. H., & Yang, C. S. (2007). An adult image identification system employing image retrieval technique. Pattern Recognition Letters, 28(16), 2367-2374. Chicago
http://sjl.csie.chu.edu.tw/sjl/albums/userpics/10001/An_adul...
by jmngomes on 10/16/13, 1:19 PM
I came across nude.js (http://www.patrick-wied.at/static/nudejs/) when researching for a social network project, seems quite nice and is Javascript based.
by racbart on 10/16/13, 1:43 PM
Wouldn't testing for skin colors produce far too many false positives to be useful? All these beach photos, fashion lingerie photos, even close portraits. And how about half of music stars these days who seem to try to never get caught more clothed than half naked?
Nudity != porn and certainly half-nudity != porn.
I'd rather go for pattern recognition. There's lot of image recognition software these days that can distinguish the Eiffel Tower from the Statue of Liberty and it might be useful to detect certain body parts and certain body configurations (for these shots that don't contain any private body part but there are two bodies in an unambiguous configuration).
by hugofirth on 10/16/13, 1:23 PM
Whilst I agree that programmatically eliminating porn images is a very hard problem. Programmatically filtering porn websites might be easier, beyond just a simple key word search and whitelist.
If you assume that porn tends to cluster, rather than exist in isolation, then a crawl of other images on the source pages , applying computer vision techniques, should allow you to block pages that score above a threshold number of positive results (thus accounting for inaccuracy and false positives).
by ismaelc on 10/16/13, 2:47 PM
You can use APIs like these to do nude detection - https://www.mashape.com/search?query=nude
by unoti on 10/16/13, 6:42 PM
If you're interested in Machine Learning, the outstanding Coursera course on machine learning just started a couple of days ago. It covers a variety of machine learning topics, including image recognition. The first assignment isn't due for a couple of weeks, so it's a perfect time to jump in and take the machine learning course!
https://www.coursera.org/course/ml
by beat on 10/18/13, 4:46 AM
Algorithmic solutions will always be hard. "I know it when I see it" is hard to program.
Depending on the site, I'd go to a trust-based solution. New users get their images approved by a human censor (pr0n == spambot in most cases). Established users can add images without approval.
If you're going to try software, try something that errs on the side of caution, and send everything to a human for final decision-making, just like spam filters.
by npatten on 10/16/13, 7:00 PM
"You can programatically detect skin tones - and porn images tend to have a lot of skin. This will create false positives but if this is a problem you can pass images so detected through actual moderation. This not only greatly reduces the the work for moderators but also gives you lots of free porn. It's win-win."
hilarious!
by hcarvalhoalves on 10/16/13, 5:28 PM
Pornography is so creative that I find it hard to have one algorithm that can detect it all. Looking for features certainly wouldn't catch the more weird stuff.
Maybe a good approach is an image lookup, trying to find the image on the web and seeing if it appears on a porn site, or a pornographic context.
by jcfiala on 10/16/13, 3:01 PM
It seems to me that if you could somehow solicit comments on the picture, you then could do text analysis on the comments to see if someone thought they were porn or not. (Well, I'm being a little silly, but there's a germ of an idea there.)
by nate510 on 10/16/13, 10:13 PM
A corollary of Rule 34 is that an algorithm to classify porn is NP-Hard.
Um, so to speak.
by singlow on 10/16/13, 3:32 PM
So, who's going to write the ROT13 algorithm for images. Just call it ROT128 and rotate the color value of the bits and use a ROT128 image viewer to view the original image.
by wehadfun on 10/16/13, 1:27 PM
Probably the easiest way is with motion and sound. Checking for skin would be hard depending on the type of content as mixmax pointed out
by djent on 10/16/13, 3:44 PM
It wouldn't solve the entire problem, but you could look for the watermarks that major porn networks stamp on their images.
by dschiptsov on 10/16/13, 2:00 PM
by filename ,)
by bedhead on 10/16/13, 2:40 PM
Maybe we can channel Potter Stweart into an algorithm somehow?
by level09 on 10/16/13, 3:41 PM
would any one be interested in purchasing an API subscription for this kind of service ? IMO, a pipeline of AI filters can be efficient to some extent.
by digitalsushi on 10/16/13, 3:39 PM
Invent an algorithm that can calculate humanity's creative thoughts.
by bicknergseng on 10/16/13, 4:58 PM
Use CrowdFlower.
by bachback on 10/16/13, 12:55 PM
machine learning? because you also want to filter cats.