from Hacker News

Using Doc2Vec to Suggest SubReddits

by jmportilla on 8/25/15, 5:17 AM with 12 comments

  • by sdrothrock on 8/25/15, 7:31 AM

    This is pretty neat, but the biggest problem for me is the case sensitivity; reddit itself doesn't use case sensitivity, so it's hard to remember the exact capitalization of a subreddit name.
  • by utunga on 8/25/15, 8:52 AM

    Hi!

    Great work. I guess my question is - do you use 'averaging' of word vectors or the Chinese Restaurant process - to get to sub reddit vectors. You describe the Chinese Restaurant process as a "more sophisticated method" that you "can" use, but in my experiments with word2vec and reddit (https://github.com/utunga/gensimred) I quickly discovered that simple averaging just does not work. Averaging has this awful 'revert to mean' thing that turns all the paragraph vectors into a sort of bland gray goo where they are all the same.

    If you did use Chinese Restaurant process (I love that phrase - brings back memories of an occasion at a Dim Sum restaurant where this almost literally happened) it'd be great to see any source code you may feel like releasing ;_) ... well, it can't hurt to ask..

  • by joelthelion on 8/25/15, 3:33 PM

    Very cool. Little tip: use "-funny" to get high-quality subs :)
  • by Yadi on 8/25/15, 7:57 AM

    Awesome seeing someone use the reddit dataset :)!

    Wouldn't a w2v as a recommender for the user might have been better?

    Taking user's comments/likes/subreddits as a feature.

  • by riffraff on 8/25/15, 9:28 AM

    neat, I'd suggest considering spaces as "+" i.e. "cats awww" should be the same as "cats+awww" I guess :)
  • by haxiomic on 8/25/15, 9:35 AM

    Nice idea :), works well. Spotted a small typo in the examples:

    pcmasterace+mac should be pcmasterrace+mac (missing an r)