by mxgr on 2/20/18, 7:51 AM with 169 comments
by ms013 on 2/20/18, 10:29 AM
Languages and libraries are just tools: knowing APIs doesn’t tell you at all how to solve a problem. They just give you things to throw at a problem. You need to know a few tools, but to be honest, they’re easy and you can go surprisingly far with few and relatively simple ones. Knowing how, when, and where to apply them is the hard part: and that often boils down to understanding the mathematics and domain you are working in.
And don’t over use viz. Pictures do effectively communicate, but often people visualize without understanding. The result is pretty pictures that eventually people realize communicate little effective domain insight. You’d be surprised that sometimes simple and ugly pictures communicate more insight than beautiful ones do.
My arsenal of tools: python, scipy/matplotlib, Mathematica, Matlab, various specialized solvers (eg, CPLEX, Z3). Mathematical arsenal: stats, probability, calculus, Fourier analysis, graph theory, PDEs, combinatorics.
(Context: Been doing data work for decades, before it got its recent “data science” name.)
by elsherbini on 2/20/18, 3:38 PM
I use the `tidyverse` from R[0] for everything people use `pandas` for. I think the syntax is soooo much more pleasant to use. It's declarative and because of pipes and "quosures" is highly readable. Combined with the power of `broom`,fitting simple models to the data and working with the results is really nice. Add to that that `ggplot` (+ any sane styling defaults like `cowplot`) is the fastest way to iterate on data visualizations that I've ever found. "R for Data Science" [1] is great free resource for getting started.
Snakemake [2] is a pipeline tool that submits steps of the pipeline to a cluster and handles waiting for steps to finish before submitting dependent steps. As a result, my pipelines have very little boilerplate, they are self documented, and the cluster is abstracted away so the same pipeline can work on a cluster or a laptop.
by Xcelerate on 2/20/18, 1:13 PM
Aside from programming languages, Jupyter notebooks and interactive workflows are invaluable, along with maintaining reproducible coding environments using Docker.
I think memorizing basic stats knowledge is not as useful as understanding deeper concepts like information theory, because most statistical tests can easily be performed nowadays using a library call. No one asks people to program in assembler to prove they can program anymore, so why would you memorize 30 different frequentist statistical tests and all of the assumptions that go along with each? Concepts like algorithmic complexity, minimum description length, and model selection are much more valuable.
by chewxy on 2/20/18, 12:22 PM
- Jupyter + Pandas for exploratory work, quickly define a model
- Go (Gonum/Gorgonia) for production quality work. (here's a cheatsheet: https://www.cheatography.com/chewxy/cheat-sheets/data-scienc... . Additional write-up on why Go: https://blog.chewxy.com/2017/11/02/go-for-data-science/)
I echo ms013's comment very much. Everything is just tools. More important to understand the math and domain
by trevz on 2/20/18, 8:56 AM
Programming languages:
- python (for general purpose programming)
- R (for statistics)
- bash (for cleaning up files)
- SQL (for querying databases)
Tools: - Pandas (for Python)
- RStudio (for R)
- Postgres (for SQL)
- Excel (the format your customers will want ;-) )
Libraries: - SciPy (ecosystem for scientific computing)
- NLTK (for natural language)
- D3.js (for rendering results online)
by xitrium on 2/20/18, 4:01 PM
by piqufoh on 2/20/18, 9:31 AM
A sound understanding of mathematics, in particular statistics.
It's amazing how many people will talk endlessly about the latest python/R packages (with interactive charting!!!) who can't explain the student's t-test.
by justusw on 2/20/18, 11:48 AM
Libs: - Dask for distributed processing - matplotlib/seaborn for graphing - IPython/Jupyter for creating shareable data analyses
Environment: - S3 for data warehousing, I mainly use parquet files with pyarrow/fastparquet - EC2 for Dask clustering - Ansible for EC2 setup
My problems usually can be solved by 2 memory-heavy EC2 instances. This setup works really well for me. Reading and writing intermediate results to S3 is blazing fast, especially when partitioning data by days if you work with time series.
Lots of difficult problems require custom mapping functions. I usually use them together with dask.dataframe.map_partitions, which is still extremely fast.
The most time-consuming activity is usually nunique/unique counting across large time series. For this, Dask offers hyperloglog based approximations.
To sum it up, Dask alone makes all the difference for me!
by trollied on 2/20/18, 10:30 AM
I just see the term flinged around so much recently, and applied to so many different roles, it has all become a tad blurred.
Maybe we need a Data Scientist to work out what a Data Scientist is?
by schaunwheeler on 2/20/18, 1:08 PM
https://towardsdatascience.com/data-is-a-stakeholder-31bfdb6...
(Disclaimer: I wrote the post at the above link).
If you have a sound design you can still create a huge amount of value even with a very simple technical toolset. By the same token, you can have the biggest, baddest toolset in the world and still end up with a failed implementation if you have bad design.
There are resources out there for learning good design. This is a great introduction and points to many other good materials:
https://www.amazon.com/Design-Essays-Computer-Scientist/dp/0...
by severo on 2/20/18, 11:36 AM
1. You need research skills that will allow you to ask the right questions, define the problem and put it in a mathematical framework.
2. Familiarity with math (which? depends on what you are doing) to the point where you can read articles that may have a solution to your problem and the ability to propose changes, creating proprietary algorithms.
3. Some scripting language (Python, R, w/e)
4. (optional) Software Engineering skills. Can you put your model into production? Will your algorithm scale? Etc.
by dxbydt on 2/21/18, 12:00 AM
Here's 3 questions I was recently asked on a bunch of DS interviews in the Valley.
1. Probability of seeing a whale in the first hour is 80%. What's the probability you'll see one by the next hour ? Next two hours ?
2. In closely contested election with 2 parties, what's the chance only one person will swing the vote, if there are n=5 voters ? n = 10 ? n = 100 ?
3. Difference between Adam and SGD.
by ever1 on 2/20/18, 9:22 AM
Numba for custom algorithms.
Dataiku (amazing tool for preprocessing and complex flows)
Amazon RDS (postgress), but thinking about redshift.
Spark
Tableau or plotly/seaborn
by closed on 2/20/18, 11:42 AM
* statistical methods (more math)
* big, in-production model fitting (more python)
* quick, scrappy data analyses for internal use (more R)
For example, I would feel weird writing a robust web server in R, but it's straightforward in python. On the other hand R's shiny lets you put up quick, interactive web dashboards (that I wouldn't trust in exposing to users).
by greyman on 2/20/18, 9:09 AM
by kmax12 on 2/20/18, 5:50 PM
Deep learning addresses it to some extent, but isn’t always the best choice if you don’t have image / text data (eg tabular datasets from databases, log files) or a lot of training examples.
I’m the developer of a library called Featuretools (https://github.com/Featuretools/featuretools) which is a good tool to know for automated feature engineering. Our demos are also a useful resource to learn using some interesting datasets and problems: https://www.featuretools.com/demos
by fredley on 2/20/18, 10:23 AM
by cwyers on 2/20/18, 3:49 PM
by innovather on 2/21/18, 3:08 PM
by drej on 2/20/18, 11:59 AM
by Jeff_Brown on 2/20/18, 5:27 PM
Pattern matching helps you write code faster (that is, spending less human time).
Algebraic data types, particularly sum types, let you represent complicated kinds of data concisely.
Coconut is an extension of Python that offers all of those.
Test driven development also helps you write more correct code.
by ChrisRackauckas on 2/20/18, 12:48 PM
by ak_yo on 2/20/18, 3:26 PM
by pentium10 on 2/20/18, 12:44 PM
by bitL on 2/20/18, 1:20 PM
by larrykwg on 2/20/18, 8:26 PM
a fantastic tree visualization framework, its intended for phylogenetic analysis but can really be used for any type of tree/hierarchical structure
by nrjames on 2/20/18, 1:47 PM
That's fine, but when it comes time to create some customer segmentation models (or whatever) the data scientist they hire is going to need to know how to get the raw data. Questions become: how do I write code to talk to this API? How do I download 6 months of data, normalize it (if needed) and store it in a database? Those questions flow over into: how do I set up a hosted database with a cloud provider? What happens if I can't use the COPY command to load in huge CSV files? How do I tee up 5 TB of data so that I can extract from it what I need to do the modeling? Then you start looking at BigQuery or Hadoop or Kafka or NiFi or Flink and you drown for a while in the Apache ecosystem.
If you take a job at a place that has those needs, be prepared to spend months or even up to a year to set up processes that allow you to access the data you need for modeling without going through a painful 75 step process each time.
Case in point: I recently worked on a project where the raw data came to me in 1500 different Excel workbooks, each of which had 2-7 worksheets. All of the data was in 25-30 different schemas, in Arabic, and the Arabic was encoded with different codepages, depending on whether it came from Jordan, Lebanon, Turkey, or Syria. My engagement was to do modeling with the data and, as is par for the course, it was an expectation that I would get the data organized. Well - to be more straightforward, the team with the data did not even know that the source format would present a problem. There were ~7500 worksheets, all riddled with spelling errors and the type of things that happen when humans interact with Excel: added/deleted columns, blank rows with ID numbers, comments, different date formats, PII scattered everywhere, etc.
A data scientist's toolkit needs to be flexible. If you have in mind that you want to do financial modeling with an airline or a bank, then you probably can focus on the mathematics and forget the data wrangling. If you want the flexibility to move around, you're going to have to learn both. The only way to really learn data wrangling is through experience, though, since almost every project is fundamentally different. From that perspective, having a rock solid understanding of some key backend technologies is important. You'll need to know Postgres (or some SQL database) up and down; how to install, configure, deploy, secure, access, query, tweak, delete, etc. You really need to know a very flexible programming language that comes with a lot of libraries for working with data of all formats. My choice there was Python. Not only do you need to know the language well, you need to know the common libraries you can use for wrangling data quickly and then also for modeling.
IMO, job descriptions for "Data Scientist" positions cover too broad of a range, often because the people hiring have just heard that they need to hire one. Think about where you want to work and/or the type of business. Is it established? New? Do they have a history of modeling? Are you their first "Data Scientist?" All of these questions will help you determine where to focus first with your skill development.
by in9 on 2/20/18, 5:54 PM
However, I can't seem to recall the name. Has any one seen what I'm talking about?
by anc84 on 2/20/18, 9:59 AM
by eggie5 on 2/20/18, 9:05 AM
by latenightcoding on 2/20/18, 9:17 AM
Language agnostic: XGBoost, LibLinear, Apache Arrow, MXNet
by spdustin on 2/20/18, 9:23 PM
by eps on 2/20/18, 9:27 AM
"Data scientist" title would apply only if you are applying scientific method to discover new fact about natural world exclusively through data analysis (as opposed to observation and experiments).
by sdfjkl on 2/20/18, 12:45 PM
by amelius on 2/20/18, 10:30 AM
by ellisv on 2/20/18, 3:58 PM
by topologie on 2/26/18, 5:14 PM
by kome on 2/20/18, 9:28 AM