from Hacker News

Data science is different now

by AhtiK on 2/14/19, 1:09 PM with 52 comments

by lordnacho on 2/14/19, 7:53 PM
There's something very relevant in this story. Data Science is too glamorous a term. There's an implication that the DS person is some sort of magician who maybe isn't as good at general coding but has special data magic skills, making them more valuable than your average grunt.
In my years in finance, there was a similar problem. One guy in particular I worked with reckoned himself an "ideas guy" and would simply spout out gibberish that he expected the rest of us to implement. He could barely use excel himself, let alone code.
The fact is the best coders I met never fancied themselves as specialists. They could certainly fit some models for you, but they could also write some SQL, set up replication and other maintenance, write cron jobs, set up ssh keys, merge some git branches, and write front and back end code in several different languages, declarative and imperative. I always put it down to a mix of curiosity and humility, giving these people a very good grasp of the fundamentals plus a foothold in almost every area of coding that I could think of.
by perturbation on 2/14/19, 6:11 PM
I have been a data scientist for the last 4 years.
I think (one of) the problems with the data science career field is that there are a lot of juniors who want to run sklearn and call it a day, following the tutorials that seem to 'just work' that real-world data doesn't without a fight.
To get value out of the work, you have to be methodical, careful, and really dig into the data. The observation that 85% of the time is cleaning doesn't eliminate the need to know what you're doing, what approaches to use, how to judge success, how to communicate results, etc.
Another thing to consider: I've found big, boring companies are usually better to do DS at than small ones. Big, boring companies have better discipline in collecting and managing data. Also, a 1% improvement to an existing process matters a lot at BigCo, and very little at a startup - and a lot of DS models are that sort of incremental progress over rules engines or heuristics.
by rjbwork on 2/14/19, 5:36 PM
According to this, I'm a data scientist. I've done and do everything on that list except for "put python in production" and "Scaling sharing of Jupyter notebooks". I've put R in production (albeit not my code, I am responsible for making sure it runs correctly and surfacing errors to the system/developer of that code). I maintain a data lake, multiple SQL servers, deal with gobs of json, version control my SQL Schemas, vc our data types (admittedly they change quite rarely), etc. etc.
But I'm really just a developer who's good at databases and ETL, along with my regular tasks of writing near-realtime background processing systems, web api's, SQL, etc.
I think the data science industry seems to have been massively overhyped, and now they want people who can use AI and statistical learning methods and all this other stuff I don't know to do plain old data engineer work.
A sad outcome for a discipline that once held so much promise.
by gipp on 2/14/19, 4:09 PM
I worked in 3 DS roles over ~5 years, and recently made the "official" jump to SWE. I've also interviewed dozens of candidates for several openings during that time.
This post rings extremely true to my experience, and largely aligns with what I've been telling people for the last couple of years. I see so many bootcamp or Masters grads with a wildly skewed understanding of what the job entails. I also see a lot of MBA types diluting the meaning of the DS term as a whole.
A "data science" curriculum as such will basically prepare you only for an analyst role. You're not going to be able to compete with the glut of science PhDs flooding every open role, either. DS may be your title but you will not be doing any of the exciting things you want to be doing. To differentiate yourself you need to specialize, and good engineering skills are a prime way to do that.
by itronitron on 2/14/19, 3:52 PM
>> ...in the past 2 years, % of any given project that involves ML: 15%, that involves moving, monitoring, and counting data to feed ML: 85%
As it should be. In order to have confidence in your ML you need to really understand your data and data processing.
by twic on 2/14/19, 5:17 PM
Couple of notes.
Be prepared for most of your data scientist work to not be data science. Adjust your skillset for that.
Same in real science - for every minute you spend thinking about what nature might be doing, you spend tens of hours carrying things around, mixing things, checking things, repeating things, etc. This is how all real work is.
Most modern languages are procedural: Java, Python, Scala, R, Go, etc.
If someone has a friend who does Scala, can they read them this quote and film the reaction? Thanks.
by alexgmcm on 2/14/19, 1:47 PM
As someone working in DS for the last 4 years this is pretty accurate.
If you have a good academic background it can be possible to enter a DS role immediately but often you will be doing work far more towards the Business Intelligence end of things rather than deploying Deep Neural Nets in production or whatever.
I have friends who transitioned into Data Engineering and it does seem like the outlook is better there.
It's an excellent post.
by minimaxir on 2/14/19, 5:07 PM
There's nothing wrong with the data science industry becoming different, as long as expectations are managed. Specifically, as this article notes, the probability of getting hired due to the increased competition, and the realities of the real-world job.
Both are currently not transparent enough for the data science newbies; which is why on my end I try to be transparent as possible whenever the topic comes up (I wrote a post similar to the OP last year: https://minimaxir.com/2018/10/data-science-protips/).
by binalpatel on 2/14/19, 5:14 PM
The market value (i.e. the big bucks) I think will shift into Data Engineering and the role that's abstractly called "Machine Learning Engineer".
Reliably getting any data science analysis or model running in a real world setting is a demand that's naturally going to follow from the Data Science glut.
by wirrbel on 2/14/19, 6:27 PM
When I started my first data-science role, the role description of my company sounded a bit like "software engineer who happens to know stats and ml". The description was fairly specific on the fact that data scientists would build and deploy models and services. Nowadays it seems not to fall under the software engineering umbrella. And I do think the change started with the deep learning craze. It distorted a lot in the field. Nowadays I see so many overfitting and complicated models that cannot be operated in production. But they sure make impressive slides and reports.
by Mortiffer on 2/14/19, 5:28 PM
totally agree. Have been consulting in the data world for some years now. Most companies want to do data science but they have so many low hanging fruit that it makes no sense to do any ML. If they actually manage to get a senior data scientist hired then they typically torture them with boring BI dashboard creation.
by wdavidw on 2/14/19, 7:39 PM
I have been dispatching the same arguments for the last 3 years. Schools have all engaged in Data Science programming flowing the market with statisticians reconverted into data science with basic programming skills, even lighter notion of data engineering, DevOps tooling and operational understanding. In 2015, our Big Data major was renamed Data Science, no matter if we are still teaching NoSQL, Hadoop, Spark... I've been careful to never engage Adaltas on the road of DS not because we didn't like it but because of the hype around it and the created market distortion. I tell my customers that we have Data Engineering who can excel in Machine Learning if needed, placing their models in streaming processing with Spark or Flink and pushing it into production with the expectation of operational constraints. Lately, we just engaged a young Data Scientist consultant with the right resume supporting it, first we did was to place him on a 4 months diet to teach him about how to deploy and secure a platform as an InfraOps and how to write data ingestion as a Data Engineer.
by pooya13 on 2/15/19, 2:05 AM
“In those early years[2012], there was no real formalized way to learn “data science,”
Yeah they were called quants (aka mathematics/statistics graduates).
by jillesvangurp on 2/14/19, 5:46 PM
I'm not a data scientist but I've worked with a few over the past 10 years and I strongly agree with this article that the work has changed a lot over that time.
The first generation machine learning experts were proper scientists with proper Ph. D. degrees, academic track records, etc. that would typically be very opinionated on what algorithms (and quite possibly wrote a few of their own) to use but not necessarily experienced engineers. I saw a lot of clumsy engineering and convoluted testing and evaluation processes.
This explains a lot about the current state of the art which involves a lot of tools that are aimed at people who are not primarily engineers and need to be shielded from complex infrastructure and code but do know a lot about statistics, machine learning algorithms, and all the stuff that first generation machine learning experts would know.
The second generation of machine learning experts is basically riding an ongoing commoditization boom. They use toolkits from Google, Facebook and others pretty much as is. These tools are easy to use for them but not necessarily for non expert engineers that know a lot about pumping data around but not necessarily about machine learning algorithms. This is getting a lot easier. I've heard of high school kids getting ML jobs with no college training whatsoever and just high school math and a bit of online training. My impression is that you can get nice results with a little effort.
The next generation of machine learning engineers won't be scientists and they'll indeed mostly work on manipulating data. All the machine learning algorithms will be provided in the form of black box libraries and tools that will mostly work in a fully automated mode. IMHO the whole point of deep learning is that the algorithms figure things out by themselves. Even the job of picking the right algortithms and configuring them is ultimately going to be something that machine learning algorithms will be better at than a junior engineer with no relevant scientific background.
Or indeed an experienced software engineer with a classic computer science background, like myself. I have no clue what e.g. a tensor is. articles on the topic seem to be very math heavy and tend to give me headaches. But should I even have to care to be able to configure some black boxes that process data and produce models that I can plug into my runtime? My pet theory is that we're already past that point and that lots of companies are getting decent results not having to care about the underlying algorithms already.
I went to a great meetup at Soundcloud last week about how they used off the shelf machine learning tooling to improve their saerch ranking in elasticsearch. It was all about the training data, the parameters in the search query that they wanted to machine learn, their tooling for evaluating model performance in terms of being able to rank real queries against real data, tooling for annotating training data, integrating models with their software, the devops for retraining the models, etc.
My experience working with the machine learing team search group in Nokia Maps (now Here) eight years ago was that the tools were an obstacle to getting results fast and that iterations on model improvements were measured in months. A lot of engineering went into things like feature extraction, model tuning, and other stuff that scientists do as well as building essentially all of the tools from the ground up so that models could actually be generated evaluated, and integrated. Only problem: many of these people weren't experienced engineers so the tools were kind of clunky and there were lots of integration headaches, insanely long integration cycles, and lots of missed opportunities to fix (rather obvious) data problems due to a bias towards endless tweaking of algorithms instead of applying pragmatic fixes to the data. It kind of worked and the search wasn't horrible but the biggest problem was that the underlying data wasn't great to begin with (mis-categorized, full of duplicates, incomplete/stale, etc.).
The people at Soundcloud got it down to iterating in hours with a few months of engineering. That's from idea to proof of concept to having code in production that outperformed a manually crafted query.
That sounds like something I could do but it also sounds like a greenfield for proper tools to emerge that make all of this a lot less painful than it currently is. The next generation hopefully won't have to build a lot of in house tooling and reinvent a lot of wheels while doing so.
by TrackerFF on 2/15/19, 12:01 PM
Yeah - tons of traditional analyst jobs (Business Intelligence / Analysis, Marketing analyst, etc.) have been re-labeled as Data Science.
I'd be amazed if even 10% of the people are able to do anything more than just import scikit-learn, and train a classifier through tutorials.
This is IMO no different than when the software dev. craze started, and people with 3 weeks of coding experience started applying for entry-level jobs. You start interviewing them, and they can't even explain the difference between a for or while loop-
In the end, there's just more noise. You need to find a good way to cut through this noise, both qualified candidates and employers
by anotheryou on 2/14/19, 7:26 PM
fast.ai youtube lesson view numbers:
1. Lesson: 355k
2. Lesson: 144k
7. Lesson: 34k
Surprisingly close to those 7%.
by tanilama on 2/14/19, 6:15 PM
This is pretty honest and acute description of the industry landscape and prediction going forward.
I think DS has been abused by some people as an umbrella to not produce qualify code, yet they somehow they put themselves in higher regards in the value chain.
However I do see there is a real position for DS in the industry, but it should be a specialization of senior SDE when they decide to further their career, not its own job family. Otherwise it should be renamed as data analyst for clarity.
by triplee on 2/15/19, 4:01 PM
I loved the tone of this article because it's fairly relevant, and with a small facelift, could have been advice to web developers circa the early 2000s.
Data science is still a thing, and it's maturing in the way that applied sciences do when they get to the point of needing a little more engineering background. Tech. just is never that glamorous, but the dirty secret is that only people in tech. seem to really get that, so we have this hype cycle every few years.