by AhtiK on 2/14/19, 1:09 PM with 52 comments
by lordnacho on 2/14/19, 7:53 PM
In my years in finance, there was a similar problem. One guy in particular I worked with reckoned himself an "ideas guy" and would simply spout out gibberish that he expected the rest of us to implement. He could barely use excel himself, let alone code.
The fact is the best coders I met never fancied themselves as specialists. They could certainly fit some models for you, but they could also write some SQL, set up replication and other maintenance, write cron jobs, set up ssh keys, merge some git branches, and write front and back end code in several different languages, declarative and imperative. I always put it down to a mix of curiosity and humility, giving these people a very good grasp of the fundamentals plus a foothold in almost every area of coding that I could think of.
by perturbation on 2/14/19, 6:11 PM
I think (one of) the problems with the data science career field is that there are a lot of juniors who want to run sklearn and call it a day, following the tutorials that seem to 'just work' that real-world data doesn't without a fight.
To get value out of the work, you have to be methodical, careful, and really dig into the data. The observation that 85% of the time is cleaning doesn't eliminate the need to know what you're doing, what approaches to use, how to judge success, how to communicate results, etc.
Another thing to consider: I've found big, boring companies are usually better to do DS at than small ones. Big, boring companies have better discipline in collecting and managing data. Also, a 1% improvement to an existing process matters a lot at BigCo, and very little at a startup - and a lot of DS models are that sort of incremental progress over rules engines or heuristics.
by rjbwork on 2/14/19, 5:36 PM
But I'm really just a developer who's good at databases and ETL, along with my regular tasks of writing near-realtime background processing systems, web api's, SQL, etc.
I think the data science industry seems to have been massively overhyped, and now they want people who can use AI and statistical learning methods and all this other stuff I don't know to do plain old data engineer work.
A sad outcome for a discipline that once held so much promise.
by gipp on 2/14/19, 4:09 PM
This post rings extremely true to my experience, and largely aligns with what I've been telling people for the last couple of years. I see so many bootcamp or Masters grads with a wildly skewed understanding of what the job entails. I also see a lot of MBA types diluting the meaning of the DS term as a whole.
A "data science" curriculum as such will basically prepare you only for an analyst role. You're not going to be able to compete with the glut of science PhDs flooding every open role, either. DS may be your title but you will not be doing any of the exciting things you want to be doing. To differentiate yourself you need to specialize, and good engineering skills are a prime way to do that.
by itronitron on 2/14/19, 3:52 PM
As it should be. In order to have confidence in your ML you need to really understand your data and data processing.
by twic on 2/14/19, 5:17 PM
Be prepared for most of your data scientist work to not be data science. Adjust your skillset for that.
Same in real science - for every minute you spend thinking about what nature might be doing, you spend tens of hours carrying things around, mixing things, checking things, repeating things, etc. This is how all real work is.
Most modern languages are procedural: Java, Python, Scala, R, Go, etc.
If someone has a friend who does Scala, can they read them this quote and film the reaction? Thanks.
by alexgmcm on 2/14/19, 1:47 PM
If you have a good academic background it can be possible to enter a DS role immediately but often you will be doing work far more towards the Business Intelligence end of things rather than deploying Deep Neural Nets in production or whatever.
I have friends who transitioned into Data Engineering and it does seem like the outlook is better there.
It's an excellent post.
by minimaxir on 2/14/19, 5:07 PM
Both are currently not transparent enough for the data science newbies; which is why on my end I try to be transparent as possible whenever the topic comes up (I wrote a post similar to the OP last year: https://minimaxir.com/2018/10/data-science-protips/).
by binalpatel on 2/14/19, 5:14 PM
Reliably getting any data science analysis or model running in a real world setting is a demand that's naturally going to follow from the Data Science glut.
by wirrbel on 2/14/19, 6:27 PM
by Mortiffer on 2/14/19, 5:28 PM
by wdavidw on 2/14/19, 7:39 PM
by pooya13 on 2/15/19, 2:05 AM
Yeah they were called quants (aka mathematics/statistics graduates).
by jillesvangurp on 2/14/19, 5:46 PM
The first generation machine learning experts were proper scientists with proper Ph. D. degrees, academic track records, etc. that would typically be very opinionated on what algorithms (and quite possibly wrote a few of their own) to use but not necessarily experienced engineers. I saw a lot of clumsy engineering and convoluted testing and evaluation processes.
This explains a lot about the current state of the art which involves a lot of tools that are aimed at people who are not primarily engineers and need to be shielded from complex infrastructure and code but do know a lot about statistics, machine learning algorithms, and all the stuff that first generation machine learning experts would know.
The second generation of machine learning experts is basically riding an ongoing commoditization boom. They use toolkits from Google, Facebook and others pretty much as is. These tools are easy to use for them but not necessarily for non expert engineers that know a lot about pumping data around but not necessarily about machine learning algorithms. This is getting a lot easier. I've heard of high school kids getting ML jobs with no college training whatsoever and just high school math and a bit of online training. My impression is that you can get nice results with a little effort.
The next generation of machine learning engineers won't be scientists and they'll indeed mostly work on manipulating data. All the machine learning algorithms will be provided in the form of black box libraries and tools that will mostly work in a fully automated mode. IMHO the whole point of deep learning is that the algorithms figure things out by themselves. Even the job of picking the right algortithms and configuring them is ultimately going to be something that machine learning algorithms will be better at than a junior engineer with no relevant scientific background.
Or indeed an experienced software engineer with a classic computer science background, like myself. I have no clue what e.g. a tensor is. articles on the topic seem to be very math heavy and tend to give me headaches. But should I even have to care to be able to configure some black boxes that process data and produce models that I can plug into my runtime? My pet theory is that we're already past that point and that lots of companies are getting decent results not having to care about the underlying algorithms already.
I went to a great meetup at Soundcloud last week about how they used off the shelf machine learning tooling to improve their saerch ranking in elasticsearch. It was all about the training data, the parameters in the search query that they wanted to machine learn, their tooling for evaluating model performance in terms of being able to rank real queries against real data, tooling for annotating training data, integrating models with their software, the devops for retraining the models, etc.
My experience working with the machine learing team search group in Nokia Maps (now Here) eight years ago was that the tools were an obstacle to getting results fast and that iterations on model improvements were measured in months. A lot of engineering went into things like feature extraction, model tuning, and other stuff that scientists do as well as building essentially all of the tools from the ground up so that models could actually be generated evaluated, and integrated. Only problem: many of these people weren't experienced engineers so the tools were kind of clunky and there were lots of integration headaches, insanely long integration cycles, and lots of missed opportunities to fix (rather obvious) data problems due to a bias towards endless tweaking of algorithms instead of applying pragmatic fixes to the data. It kind of worked and the search wasn't horrible but the biggest problem was that the underlying data wasn't great to begin with (mis-categorized, full of duplicates, incomplete/stale, etc.).
The people at Soundcloud got it down to iterating in hours with a few months of engineering. That's from idea to proof of concept to having code in production that outperformed a manually crafted query.
That sounds like something I could do but it also sounds like a greenfield for proper tools to emerge that make all of this a lot less painful than it currently is. The next generation hopefully won't have to build a lot of in house tooling and reinvent a lot of wheels while doing so.
by TrackerFF on 2/15/19, 12:01 PM
I'd be amazed if even 10% of the people are able to do anything more than just import scikit-learn, and train a classifier through tutorials.
This is IMO no different than when the software dev. craze started, and people with 3 weeks of coding experience started applying for entry-level jobs. You start interviewing them, and they can't even explain the difference between a for or while loop-
In the end, there's just more noise. You need to find a good way to cut through this noise, both qualified candidates and employers
by anotheryou on 2/14/19, 7:26 PM
1. Lesson: 355k
2. Lesson: 144k
7. Lesson: 34k
Surprisingly close to those 7%.
by tanilama on 2/14/19, 6:15 PM
I think DS has been abused by some people as an umbrella to not produce qualify code, yet they somehow they put themselves in higher regards in the value chain.
However I do see there is a real position for DS in the industry, but it should be a specialization of senior SDE when they decide to further their career, not its own job family. Otherwise it should be renamed as data analyst for clarity.
by triplee on 2/15/19, 4:01 PM
Data science is still a thing, and it's maturing in the way that applied sciences do when they get to the point of needing a little more engineering background. Tech. just is never that glamorous, but the dirty secret is that only people in tech. seem to really get that, so we have this hype cycle every few years.