from Hacker News

Ask HN: As a data scientist, what should be in my toolkit in 2018?

by mxgr on 2/20/18, 7:51 AM with 169 comments

  • by ms013 on 2/20/18, 10:29 AM

    Mathematics. Which branch of math is domain dependent. Stats come up everywhere. Graphs do too. In addition to baseline math, you really need to understand the problem domain and goals of the analysis.

    Languages and libraries are just tools: knowing APIs doesn’t tell you at all how to solve a problem. They just give you things to throw at a problem. You need to know a few tools, but to be honest, they’re easy and you can go surprisingly far with few and relatively simple ones. Knowing how, when, and where to apply them is the hard part: and that often boils down to understanding the mathematics and domain you are working in.

    And don’t over use viz. Pictures do effectively communicate, but often people visualize without understanding. The result is pretty pictures that eventually people realize communicate little effective domain insight. You’d be surprised that sometimes simple and ugly pictures communicate more insight than beautiful ones do.

    My arsenal of tools: python, scipy/matplotlib, Mathematica, Matlab, various specialized solvers (eg, CPLEX, Z3). Mathematical arsenal: stats, probability, calculus, Fourier analysis, graph theory, PDEs, combinatorics.

    (Context: Been doing data work for decades, before it got its recent “data science” name.)

  • by elsherbini on 2/20/18, 3:38 PM

    I'm a scientist (PhD student in microbiolgy) that works with lots of data. My data is on the order of hundreds of gigabytes (genome collections and other sequencing data) or megabytes (flat files).

    I use the `tidyverse` from R[0] for everything people use `pandas` for. I think the syntax is soooo much more pleasant to use. It's declarative and because of pipes and "quosures" is highly readable. Combined with the power of `broom`,fitting simple models to the data and working with the results is really nice. Add to that that `ggplot` (+ any sane styling defaults like `cowplot`) is the fastest way to iterate on data visualizations that I've ever found. "R for Data Science" [1] is great free resource for getting started.

    Snakemake [2] is a pipeline tool that submits steps of the pipeline to a cluster and handles waiting for steps to finish before submitting dependent steps. As a result, my pipelines have very little boilerplate, they are self documented, and the cluster is abstracted away so the same pipeline can work on a cluster or a laptop.

    [0] https://www.tidyverse.org/

    [1] http://r4ds.had.co.nz/

    [2] http://snakemake.readthedocs.io/en/stable/

  • by Xcelerate on 2/20/18, 1:13 PM

    As a data scientist who has been using the language for 5 years now, Julia is by far the best programming language for analyzing and processing data. That said, it’s common to find many Julia packages that are only half-maintained and don’t really work anymore. (I still don’t know how to connect to Postgres in a bug-free way using Julia.) And you’d be hard pressed to find teams of data scientists that use Julia. So in that sense, Python has much more mature and stable libraries, and it’s used everywhere. (But I really hope Julia overtakes it in the next couple of years because it’s such a well-designed language.)

    Aside from programming languages, Jupyter notebooks and interactive workflows are invaluable, along with maintaining reproducible coding environments using Docker.

    I think memorizing basic stats knowledge is not as useful as understanding deeper concepts like information theory, because most statistical tests can easily be performed nowadays using a library call. No one asks people to program in assembler to prove they can program anymore, so why would you memorize 30 different frequentist statistical tests and all of the assumptions that go along with each? Concepts like algorithmic complexity, minimum description length, and model selection are much more valuable.

  • by chewxy on 2/20/18, 12:22 PM

    My toolkit hasn't changed since 2016:

    - Jupyter + Pandas for exploratory work, quickly define a model

    - Go (Gonum/Gorgonia) for production quality work. (here's a cheatsheet: https://www.cheatography.com/chewxy/cheat-sheets/data-scienc... . Additional write-up on why Go: https://blog.chewxy.com/2017/11/02/go-for-data-science/)

    I echo ms013's comment very much. Everything is just tools. More important to understand the math and domain

  • by trevz on 2/20/18, 8:56 AM

    A couple of thoughts, off the top of my head:

    Programming languages:

      - python (for general purpose programming)
      - R (for statistics)
      - bash (for cleaning up files)
      - SQL (for querying databases)
    
    Tools:

      - Pandas (for Python)
      - RStudio (for R)
      - Postgres (for SQL)
      - Excel (the format your customers will want ;-) )
    
    Libraries:

      - SciPy (ecosystem for scientific computing)
      - NLTK (for natural language)
      - D3.js (for rendering results online)
  • by xitrium on 2/20/18, 4:01 PM

    If you care about quantifying uncertainty, knowing about Bayesian methods is a good idea I don't see represented here yet. I care so much about uncertainty quantification and propagation that I work on the Stan project[0] which has an extremely complete manual (600+ pages) and many case studies illustrating different problems. Full Bayesian inference such as that provided by Stan's Hamiltonian Monte Carlo inference algorithm is fairly computationally expensive so if you have more data than fits into RAM on a large server, you might be better served by some approximate methods (but note the required assumptions) like INLA[1].

    [0] http://mc-stan.org/ [1] http://www.r-inla.org/

  • by piqufoh on 2/20/18, 9:31 AM

    > what tools should be in my arsenal

    A sound understanding of mathematics, in particular statistics.

    It's amazing how many people will talk endlessly about the latest python/R packages (with interactive charting!!!) who can't explain the student's t-test.

  • by justusw on 2/20/18, 11:48 AM

    Dealing with large data processing problems my main tools are as follows:

    Libs: - Dask for distributed processing - matplotlib/seaborn for graphing - IPython/Jupyter for creating shareable data analyses

    Environment: - S3 for data warehousing, I mainly use parquet files with pyarrow/fastparquet - EC2 for Dask clustering - Ansible for EC2 setup

    My problems usually can be solved by 2 memory-heavy EC2 instances. This setup works really well for me. Reading and writing intermediate results to S3 is blazing fast, especially when partitioning data by days if you work with time series.

    Lots of difficult problems require custom mapping functions. I usually use them together with dask.dataframe.map_partitions, which is still extremely fast.

    The most time-consuming activity is usually nunique/unique counting across large time series. For this, Dask offers hyperloglog based approximations.

    To sum it up, Dask alone makes all the difference for me!

  • by trollied on 2/20/18, 10:30 AM

    What does "Data Scientist" actually mean these days? Does it mean "Write 10 lines of Python or R, and not fully understand what it actually does"? Or something else?

    I just see the term flinged around so much recently, and applied to so many different roles, it has all become a tad blurred.

    Maybe we need a Data Scientist to work out what a Data Scientist is?

  • by schaunwheeler on 2/20/18, 1:08 PM

    A lot of people in this thread are focusing on technical tools, which is normal for a discussion of this type, but I think that focus is misplaced. Most technical tools are easily learnable and are not the limiting factor is creating good data science products.

    https://towardsdatascience.com/data-is-a-stakeholder-31bfdb6...

    (Disclaimer: I wrote the post at the above link).

    If you have a sound design you can still create a huge amount of value even with a very simple technical toolset. By the same token, you can have the biggest, baddest toolset in the world and still end up with a failed implementation if you have bad design.

    There are resources out there for learning good design. This is a great introduction and points to many other good materials:

    https://www.amazon.com/Design-Essays-Computer-Scientist/dp/0...

  • by severo on 2/20/18, 11:36 AM

    I'd say:

    1. You need research skills that will allow you to ask the right questions, define the problem and put it in a mathematical framework.

    2. Familiarity with math (which? depends on what you are doing) to the point where you can read articles that may have a solution to your problem and the ability to propose changes, creating proprietary algorithms.

    3. Some scripting language (Python, R, w/e)

    4. (optional) Software Engineering skills. Can you put your model into production? Will your algorithm scale? Etc.

  • by dxbydt on 2/21/18, 12:00 AM

    > What’s the fizzbuzz test for data scientists anyway?

    Here's 3 questions I was recently asked on a bunch of DS interviews in the Valley.

    1. Probability of seeing a whale in the first hour is 80%. What's the probability you'll see one by the next hour ? Next two hours ?

    2. In closely contested election with 2 parties, what's the chance only one person will swing the vote, if there are n=5 voters ? n = 10 ? n = 100 ?

    3. Difference between Adam and SGD.

  • by ever1 on 2/20/18, 9:22 AM

    Python: Jupyter, pandas, numpy, scipy, scikit-learn

    Numba for custom algorithms.

    Dataiku (amazing tool for preprocessing and complex flows)

    Amazon RDS (postgress), but thinking about redshift.

    Spark

    Tableau or plotly/seaborn

  • by closed on 2/20/18, 11:42 AM

    I would think about which of these you see yourself doing more..

    * statistical methods (more math)

    * big, in-production model fitting (more python)

    * quick, scrappy data analyses for internal use (more R)

    For example, I would feel weird writing a robust web server in R, but it's straightforward in python. On the other hand R's shiny lets you put up quick, interactive web dashboards (that I wouldn't trust in exposing to users).

  • by greyman on 2/20/18, 9:09 AM

    If you will work in some bigger company doing data analytics, you can also come across Tableau instead of Excel. Apart from SQL, if there is more data, you might want to use Bigquery or something similar.
  • by kmax12 on 2/20/18, 5:50 PM

    One crucial skill you will need is feature engineering. Formal methods for it aren’t typically in data science classes. Still, it’s worth understanding in order to build ML applications. Unfortunately, there aren't many available tools today, but I expect that to change this year.

    Deep learning addresses it to some extent, but isn’t always the best choice if you don’t have image / text data (eg tabular datasets from databases, log files) or a lot of training examples.

    I’m the developer of a library called Featuretools (https://github.com/Featuretools/featuretools) which is a good tool to know for automated feature engineering. Our demos are also a useful resource to learn using some interesting datasets and problems: https://www.featuretools.com/demos

  • by fredley on 2/20/18, 10:23 AM

    IPython/Jupyter, Pandas/Numpy and Python will get you everywhere you need to go. Currently, until maybe Go gets decent DataFrame support, in terms of the total time to get to your solution, I'd be amazed if any other setup got you there quicker.
  • by cwyers on 2/20/18, 3:49 PM

    You can get a lot of mileage out of just using R, dplyr, ggplot2 and lm/glm. OLS still performs well in a lot of problem spaces. Understanding your data is the key there, and a lot of exploratory visualization there will help a lot.
  • by innovather on 2/21/18, 3:08 PM

    Hey everyone, I'm not a data scientist or a developer but I work with a lot of them. My company, Introspective Systems, recently released xGraph, an executable graph framework for intelligent and collaborative edge computing that solves big problems: those that have massive decision spaces, tons of data, are highly distributed, dynamically reconfigure, and need instantaneous decision making. It's great for the modeling work that data scientists do. Comment if you want more info.
  • by drej on 2/20/18, 11:59 AM

    grep, cut, cat, tee, awk, sed, head, tail, g(un)zip, sort, uniq, split; curl; jq, python3
  • by Jeff_Brown on 2/20/18, 5:27 PM

    Static typing lets you catch errors before running the code.

    Pattern matching helps you write code faster (that is, spending less human time).

    Algebraic data types, particularly sum types, let you represent complicated kinds of data concisely.

    Coconut is an extension of Python that offers all of those.

    Test driven development also helps you write more correct code.

  • by ChrisRackauckas on 2/20/18, 12:48 PM

    A good understanding of calculus (probability), linear algebra, and your dataset/domain. Anything else can be picked up as you need it. Oh, and test-driven development in some programming language, otherwise you can't develop code you know is correct.
  • by ak_yo on 2/20/18, 3:26 PM

    Experimental design and observational causal inference would be excellent skills to have. Especially if you’re working with people who are asking you “why” questions, ML is helpful but isn’t going to cut it alone.
  • by pentium10 on 2/20/18, 12:44 PM

    As 1TB is free for processing every month, using SQL 2011 standard + combined with Javascript UDFs, the winner solution is Google BigQuery for us, combined with Dataprep
  • by bitL on 2/20/18, 1:20 PM

    Spark + MLlib, Python + Pandas + NumPy + Keras + TensorFlow + PyTorch, R, SQL, top placement in some Kaggle competitions. This would get you long way.
  • by larrykwg on 2/20/18, 8:26 PM

    Nobody mentioned this yet: ETE: http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.ht...

    a fantastic tree visualization framework, its intended for phylogenetic analysis but can really be used for any type of tree/hierarchical structure

  • by nrjames on 2/20/18, 1:47 PM

    There are two "poles" in data science: math/modeling and backend/data-wrangling. Most of the time, the backend/data-wrangling piece is a prerequisite to the math/modeling. The vast majority of small and medium sized companies have not set up the systems they would need to support a data scientist who knows only math/modeling. Depending on the domain, it's not uncommon to find that a small/medium company outsourced analytics to Firebase, Flurry, etc...

    That's fine, but when it comes time to create some customer segmentation models (or whatever) the data scientist they hire is going to need to know how to get the raw data. Questions become: how do I write code to talk to this API? How do I download 6 months of data, normalize it (if needed) and store it in a database? Those questions flow over into: how do I set up a hosted database with a cloud provider? What happens if I can't use the COPY command to load in huge CSV files? How do I tee up 5 TB of data so that I can extract from it what I need to do the modeling? Then you start looking at BigQuery or Hadoop or Kafka or NiFi or Flink and you drown for a while in the Apache ecosystem.

    If you take a job at a place that has those needs, be prepared to spend months or even up to a year to set up processes that allow you to access the data you need for modeling without going through a painful 75 step process each time.

    Case in point: I recently worked on a project where the raw data came to me in 1500 different Excel workbooks, each of which had 2-7 worksheets. All of the data was in 25-30 different schemas, in Arabic, and the Arabic was encoded with different codepages, depending on whether it came from Jordan, Lebanon, Turkey, or Syria. My engagement was to do modeling with the data and, as is par for the course, it was an expectation that I would get the data organized. Well - to be more straightforward, the team with the data did not even know that the source format would present a problem. There were ~7500 worksheets, all riddled with spelling errors and the type of things that happen when humans interact with Excel: added/deleted columns, blank rows with ID numbers, comments, different date formats, PII scattered everywhere, etc.

    A data scientist's toolkit needs to be flexible. If you have in mind that you want to do financial modeling with an airline or a bank, then you probably can focus on the mathematics and forget the data wrangling. If you want the flexibility to move around, you're going to have to learn both. The only way to really learn data wrangling is through experience, though, since almost every project is fundamentally different. From that perspective, having a rock solid understanding of some key backend technologies is important. You'll need to know Postgres (or some SQL database) up and down; how to install, configure, deploy, secure, access, query, tweak, delete, etc. You really need to know a very flexible programming language that comes with a lot of libraries for working with data of all formats. My choice there was Python. Not only do you need to know the language well, you need to know the common libraries you can use for wrangling data quickly and then also for modeling.

    IMO, job descriptions for "Data Scientist" positions cover too broad of a range, often because the people hiring have just heard that they need to hire one. Think about where you want to work and/or the type of business. Is it established? New? Do they have a history of modeling? Are you their first "Data Scientist?" All of these questions will help you determine where to focus first with your skill development.

  • by in9 on 2/20/18, 5:54 PM

    I saw a simple tool somewhere a while ago (maybe a month or so ago) of a simple cli for data inspection in the terminal. It seemed very useful for inspecting data ssh'ed into a machine.

    However, I can't seem to recall the name. Has any one seen what I'm talking about?

  • by anc84 on 2/20/18, 9:59 AM

    Any programming language that you are proficient in. A solid understanding how a computer works. Solid basis of statistics. Anything else is just sprinkles, trends and field-specific.
  • by eggie5 on 2/20/18, 9:05 AM

    a lot of people using spark?
  • by latenightcoding on 2/20/18, 9:17 AM

    If you use Python: scikit-learn, Pandas, NumPy, Tensorflow or PyTorch

    Language agnostic: XGBoost, LibLinear, Apache Arrow, MXNet

  • by spdustin on 2/20/18, 9:23 PM

    OpenRefine (openrefine.org) is definitely a handy (and automate-able) part of my data-cleansing workflow.
  • by eps on 2/20/18, 9:27 AM

    You probably mean "data analyst".

    "Data scientist" title would apply only if you are applying scientific method to discover new fact about natural world exclusively through data analysis (as opposed to observation and experiments).

  • by sdfjkl on 2/20/18, 12:45 PM

    numpy, Jupyter (formerly IPython Notebook) and probably Mathematica anyways.
  • by amelius on 2/20/18, 10:30 AM

    Any book recommendations?
  • by ellisv on 2/20/18, 3:58 PM

    Counting and dividing.
  • by topologie on 2/26/18, 5:14 PM

    Random Matrix Theory.
  • by kome on 2/20/18, 9:28 AM

    Excel, VBA, SPSS ;)