from Hacker News

Pypeline: A Python library for creating concurrent data pipelines

by cgarciae on 9/24/18, 4:21 AM with 46 comments

  • by anentropic on 9/24/18, 11:43 AM

    Too much abbreviation!

    pypeline --> pypeln

    multiprocessing pipeline --> pr

    threads pipeline --> th

    asyncio pipeline --> io

    this is totally unnecessary

    If I want to use short abbreviated names in my code I can always `from pypeline import multiprocess_pipeline as pr`

    Your library shouldn't export them like this as the default.

    `io` is especially bad since this overshadows the `io` module in the Python stdlib

  • by elsherbini on 9/24/18, 12:50 PM

    Snakemake [0] is a tool worth checking out. You can use it to create declarative workflows, and similar to make, it creates a DAG of dependencies when you give it your desired output. Each rule can specify how many threads it needs and other arbitrary resources and the scheduler uses that to constrain execution. Workflows are architecture independent - you should be able to execute a snakemake workflow on a laptop, in the cloud, or on an HPC cluster.

    It also allows you to use UNIX pipes with your dependent jobs when that is appropriate [1].

    [0] https://snakemake.readthedocs.io/en/stable/index.html

    [1] https://snakemake.readthedocs.io/en/stable/snakefiles/rules....

  • by somewhatoff on 9/24/18, 10:30 AM

    I wonder if you might compare this to Bonobo [https://www.bonobo-project.org/] which I think has similar design goals?
  • by adamcharnock on 9/24/18, 9:12 AM

        Pypeline was designed to solve simple medium 
        data tasks that require concurrency 
        and parallelism but where using frameworks 
        like Spark or Dask feel exaggerated or unnatural.
    
    This is exactly what I was looking for very recently. Thank you for writing this, I'll certainly look into it.
  • by chrisjc on 9/24/18, 3:25 PM

    Seems like a good time to link to this curated list of pipeline toolkits (not all python).

    https://github.com/pditommaso/awesome-pipeline/blob/master/R...

  • by snidane on 9/24/18, 7:50 PM

    From my experience building similar pipelining and reverse polish function application tooling in python.

    Piping using the | operator can make tracebacks pretty ugly with some operators.

    If you want to keep the code still somewhat 'pythonic' without introducing the syntax magic using |, you can do it similarly:

      range(10)
      | pp.flatmap(lambda x: [x + 1, x + 2])
      | pp.map(lambda x: x * x)
      ...
    
    You can do this instead:

      xs = range(10)
      xs = pp.flatmap(xs, lambda x: [x + 1, x + 2])
      xs = pp.map(xs, lambda x: x * x)
      ...
    
    It helps to keep the operand as first argument, instead of last, because those lambdas are best kept at the end.

    So instead of

      map(fn, xs)
    
    do

      map(xs, fn)
  • by roel_v on 9/24/18, 1:48 PM

    None of these frameworks (there are many) seem to have support for repeating a certain target multiple times, with different arguments. For example, say you have a data set with per-country data; how do you repeat the same analysis on each country? This simple example is easy with a loop, but when you have multiple dimensions like this, you want to call each target with all possible permutations, depending on which type of dimension is actually relevant for that target. Does any ETL framework support that?

    (I was actually just writing a spec for a new tool that does just this this afternoon because I can't find anything suitable)

  • by bayesian_horse on 9/24/18, 5:20 PM

    Dask is relatively lightweight actually, because it is pure Python.

    Also, there is "Streamz" which solves a similar problem, seems more mature and can work with or without Dask or Dask-Distributed.

  • by timkpaine on 9/24/18, 3:05 PM

    Similar to a library I've been working on as well: https://github.com/timkpaine/tributary
  • by TBastiani on 9/24/18, 12:11 PM

    mpipe might also be of interest.

    http://vmlaker.github.io/mpipe/

  • by davidnet on 9/24/18, 4:41 PM

    Wow. It seems to save a lot of boilerplate code for ETL.
  • by make3 on 9/24/18, 3:50 PM

    Looks similar to what tf.data does for Tensorflow