from Hacker News

Bad Data and Data Engineering: Dissecting Google Play Music Takeout Data

by otter-in-a-suit on 12/25/21, 10:17 PM with 15 comments

  • by faizshah on 12/26/21, 1:23 PM

    Great post, for this pipeline I would have probably used a makefile for the batch pipeline instead of airflow just to keep it simple. I would also make my sink a SQLite database so that you can easily search through it with a web interface using datasette.

    For the places where bash was used I would just use python and any cli tools you want to call I just use subprocess. It’s much simpler and I can run the scripts in a repl and execute cells in Jupyter or just normal pycharm so its quick and interactive.

    Love that you included something on building a data dictionary, I am honestly guilty of in the past not including a good data dictionary for the source data. I would just leave in the output of df.describe() or df.info() at the top of the jupyter notebook where you restructure the source data before processing it. I now think you should include and save as a CSV a data dictionary of the source data and the final data as it’s more maintainable or at least leave a comment in your script.

    Otherwise everything else is pretty similar to what I would do, I just went to my google takeout and apparently all my google play data and songs are gone so I guess I can’t try this myself…

  • by progbits on 12/26/21, 1:23 PM

    So are the mp3 files not the same as what the author uploaded? I could imagine weird organization for tracks from the service but for self-uploaded data I would be surprised if they didn't just give them back the same.

    The article never mentioned how this showed up in the GPM app itself which feels lacking.

    Otherwise a nice article but it reminds me why I long ago gave up on media metadata organization. So much work, so much mess...

  • by wodenokoto on 12/26/21, 4:50 PM

    > The script should be decently self-explanatory [...] Please note that this is all single-threaded, which I don’t recommend - with nohup and the like, you can trivially parallelize this.

    How do you parallelize a loop in bash without getting all the echo's intertwined and jumbled together?