from Hacker News

mrjob: Yelp open sources its Elastic MapReduce framework for Python

by pretz on 10/29/10, 9:26 PM with 13 comments

  • by stevejohnson on 10/30/10, 12:31 AM

    This past week I started working on a Python 3 port of this, mostly to learn. No EMR unfortunately, but Hadoop should be possible. I just got back from a trip, so it's still not very far along, just runs the "local" version, but it should get a bit farther next week.

    I can confirm that it is a great way to learn about MapReduce.

    Link: http://github.com/irskep/mrjob/tree/py3k

    I will likely totally restart the py3k port now that I know what I am doing a bit better. I've been writing Python 3 for about, oh, two weeks.

  • by ashika on 10/30/10, 3:10 AM

    Amazon EMR is an amazing value proposition for virtually any research need, and it's very cool to see wrapper frameworks targeting it directly. Still, for anyone managing their own compute clusters and wanting to do MR in python, I'd suggest checking out Disco.

    Disco (http://discoproject.org) is a really elegant MR framework implemented in erlang and python, with additional support for jobs in C and Java. I've used it for a little over a year and am convinced it is the superior MR platform (Hadoop's terasort victories notwithstanding). New features are being integrated very quickly, the core platform is rock solid, management is simple and it's extremely flexible.

  • by derwiki on 10/29/10, 10:10 PM

    this was a game changer for us -- instead of everyone contending for the Hadoop cluster, each developer has their own personal arsenal of Hadoop clusters. huge win.
  • by deathflute on 10/30/10, 7:27 PM

    On this note, does anyone know a good tutorial on map reduce for experienced programmers? Basically, I want to learn how to frame advanced problems in terms of MR - I am particularly interested in expressing my discrete event simulation in terms of MR.
  • by FraaJad on 10/29/10, 10:56 PM

    Nice to see one more production use of Cython.
  • by LiveTheDream on 10/30/10, 8:54 PM

    So does most of your data live in S3 in JSON format?