by danielfriedman on 4/7/20, 7:11 PM with 163 comments
by hectormalot on 4/7/20, 8:06 PM
This. Universities and online challenges provide clean labeled data, and score on model performance. The real world will provide you... “real data” and score you (hopefully) by impact. Real data work requires much more than modeling. Understanding the data, the business and value you create are important.
As per #6, better data and model infrastructure is crucial in keeping the time spent on these activities manageable, but I do think they’re important parts of the job.
I’ve seen data science teams at other companies working for years on topics that never see production because they only saw modeling as their responsibility. Even the best data and infrastructure in the world won’t help if data scientists do not feel co-responsible for the realization of measurable value for their business.
Training integrative data professionals could be a great opportunity for bootcamps. Universities will (understandably) focus on the academically interesting topic of models, while companies will increasingly realize they need people with skills across the data value chain. I know I would be interested in such profiles. :)
by throwaway713 on 4/7/20, 10:29 PM
The really interesting research type work (Bayesian modeling, convolutional neural networks, etc.) takes a long time to implement and may produce no useful results, which is a really bad outcome at a company that measures performance in six month units of work and highly values scheduled deliverables and concrete impact. Many of the data scientists I work with tend to stick to methods that are actually quite simple (e.g., logistic regression, ARIMA) because these at least deliver something quickly, despite the fact that many of my coworkers come from research-heavy backgrounds.
In my org, there's nothing stopping anyone from pursuing advanced machine learning; for the most part we set our own agenda (in fact, determining priorities is part of the job role). And some people do in fact go after state-of-the-art ML, with some really cool results to show for it. But in terms of career progression and job safety, the risk is just way too high, at least for me personally. I save the highly mathematical stuff for a hobby.
Edit: while this may sound a bit negative, I will add that my description of data science isn't a complaint per se; I am mainly trying to inform those who are seeking a career in data science of what to expect compared to what is often promised. The work that is most valuable to a business is not exciting all of the time, but I don't think there is another job in the tech industry that I would find more enjoyable than my current one at the moment.
by resolaibohp on 4/7/20, 9:33 PM
Apply for and interviewing for data science jobs is a total nightmare. You are competing against 100s or even 1000s of applicants for every job posting because someone said it was one of the sexiest careers of the 21st century. Further exacerbating this, Everyone believes that data is the new oil, and large profit multipliers are just waiting to be discovered in this virgin data that companies are sitting on. All that is missing is someone to run some neural network, or deep learning algo on it to discover the insights that nobody else can see.
The reality is that there is an army of people who know how to run these algos. MOOC's, blogs, youtube, etc have been teaching everyone how to use these python/R packages for years. The lucky few who get that coveted data science job can't wait to apply these libraries to the virgin data only to find that they have to do all kinda of data manipulating to make the algos even work, which takes days and weeks of mundane work. Finally they find out the data is so lacking that their deep learning model does very little in providing actual business value. It is overly complicated, computationally expensive, and in the back of your mind know you can get the same results using some simple logic.
Managers who don't understand data science fundamentals learn from the news and have their data scientist implement those buzz words so they can look good in front of their bosses.
I think there is a place for data scientists who understand the fundamentals of the models out there, and know when you should not use them. Data science is also increasingly a subset of software engineering and a good data science in a tech company should be able to code well. I also think that there is not some huge unmet demand for data scientists. Just a huge amount of hype and managers wanting to look good by saying they managed a data science team.
by wenc on 4/7/20, 10:05 PM
You see, decision-making involves (1) getting data, (2) summarizing and predicting, and (3) taking action. Continuous decision-making -- the kind that leads to impact -- involves doing this repeatedly in a principled fashion, which means creating a system around the decision process.
For systems thinkers, this is analogous to a feedback control loop which includes sensor measurements + filters, controllers and actuators.
(1) involves programmers/data engineers who have to create/manage/monitor data pipelines (that often break). This the sensor + filters part, which is ~40% of the system.
(2) involves data scientists creating a model that guides the decision-making process. This is the model of the controller (not even the controller itself!), which is ~20% of the system. Having the right model is great, but as most control engineers will tell you, even having the wrong model is not as terrible as most people think because the feedback loop is self-correcting. A good-enough model is all you need.
(3) involves business/front-line people who actually implement decisions in real-life. This is where impact is delivered. ~40% of the system. This is the controller + actuator part, which makes the decisions and carries them out.
Most data scientists think their value is in creating the most accurate model possible in Jupyter. This is nice, but in real-life not really that critical because the feedback-loop inherently moderates the error when deployed in a complex, stochastic environment. The right level of optimization would be to optimize the entire decision-making control feedback loop instead of just the small part that is "data science".
p.s. data scientists who have particularly low-impact are those who focus on producing once-off reports (like consultant reports). Reports are rarely read, and often forgotten. Real impact comes from continuous decision-making and implementing actions with feedback.
Source: practicing data scientist
by danmostudco on 4/7/20, 9:36 PM
The worst cases I have seen is when executives take a problem and ask data scientists to "do some of that data science" on the problem, looking for trends, patterns, automating workflows, making recommendations, etc. This is high-level pie in the sky stuff that works well in pitch meetings and client meetings, but when it comes down to brass tacks this leaves very little vision of what is trying to be achieved and even less on a viable execution path.
More successful deployments have had a few items in common
1. A reasonably solid understanding of what the data could and couldn't do. What can we actually expect our data to achieve? What does it do well? What does it do poorly? Will we need to add other data sets? Propagate new data? How will we get or generate that data?
2. The business case or user problem was understood up front. In our most successful project, we saw users continuously miscategorized items on input and built a model to make recommendations. It greatly improved the efficacy of our ingested user data.
3. Break it into small chunks and wins. Promising a mega-model that will do all the things is never a good way to deliver aspirational data goals. Little model wins were celebrated regularly and we found homes and utility for those wins in our codebase along the way.
4. Make is accessible to other members of the company. We always ensure our models have an API that can be accessed by any other services in our ecosystem, so other feature teams can tap into data science work. There's a big difference between "I can run this model on my computer, let me output the results" and "this model can be called anywhere at any time."
While not exhaustive, a few solid fundamentals like the above I think align data science capabilities to business objectives and let the organization get "smarter" as time goes on as to what is possible and not possible.
by kristjansson on 4/7/20, 8:06 PM
The point of data scientists and the related roles listed in the article are not to just churn out the fun stuff, but to wade through the institutional and technical muck and mire it takes to bring the fun stuff to bear on a relevant business problem and to communicate the results in a way that people of all walks can understand.
by Barrin92 on 4/8/20, 4:31 AM
The problem with all this data talk isn't just about implementation or bad structure, the limitations of putting all your bets on inductive reasoning are systemic.
The insights that economists had in the 70s and 80s was that reasoning from aggregated quantities is extremely limited. Without understanding at a structural level the generators of your data, trying to create policy based on outputs is like trying to reason about inhabitants of a city by looking at light pollution from the sky.
My guess why data science so rarely delivers what it promises is because you can't get any value from historical data if your circumstances change to the point where past data is irrelevant. Which in the world of business happens pretty quickly. To have a competitive advantage, one needs to figure out what has not been seen yet.
And trying to exploit signals suffers from the issue laid out above. There was a funny case of an AI hiring startup trying to predict good applicants, and the result was people putting "Oxford" in their application in a font matching the background color
by analog31 on 4/7/20, 10:30 PM
This was certainly the vibe that I got from "design of experiments" when it was the statistical method du jour. Then from "Bayesian everything" and now "data science." I remember "design of experiments" studies being conducted with great fanfare and success theater, while producing zero results.
The long term theme is that science is hard for reasons that managers don't understand, can't manage, and are reluctant to reward.
by rafiki6 on 4/7/20, 9:04 PM
Some other industries have been doing "data science" for ages. Credit Risk Modelling, insurance and so on.
Every time I read one of these articles, I feel it's just an individual who entered a kind of crummy situation and they're learning what it means to work in a corporate environment. Some are better than others. Some are more motivated than others. Some have better cultures than others. Some are more willing to make technology a key part of their business strategy. Some are more data driven than others.
My recommendation is to always ask the fundamental question before joining: what are you trying to achieve with data science, and is it actually achievable?
by Optimal_Persona on 4/7/20, 8:25 PM
- It's essential to have/develop domain expertise in your industry.
- Beware plausible, but incorrect (or poorly interpreted) data that supports yours (or others') assumptions/biases.
- Add on to #4 - at least as bad as this is having well-intentioned people on your team who "know enough (a bit of SQL or low/no-code data tool") to be dangerous. Um, why are you joining unnecessary tables, or using a different alias for the same columns/tables in different queries, with no comments or standard formatting?
- Hold your nose, but anything you do in SQL/R/Python/even fancier programming tool/language is going to pass through MS Excel at least once sooner or later which can irreversibly bastardize CSVs (even just opening without saving!), truncate precision to 15 digits, change data types, etc.
- So glad for the callout in #7 - there are clearly devs/data folks out there who are happy to take on an "interesting programming project at a great paying job" - that isn't serving the best interests of humanity.
by Icathian on 4/7/20, 7:59 PM
I'll just add one: the business absolutely doesn't care how you get your answer, only if they're reliable enough (hand grenade close is better than most companies have today).
While this seems obvious enough to anyone with a few years under their belt, to the new DS grad who has their time series analysis canned in favor of slapping a simple moving average in place and shipping it can be rather disillusioning.
by Vaslo on 4/8/20, 12:42 AM
1). I am hearing about Data Science Teams being furloughed during these times. That isn’t happening in my function (Corporate Finance). I am glad to be secure even though I enjoy much of the data sci work.
2) I’m able to apply Data Science concepts in my current role, and it’s adding a lot of job security and providing me with exposure. I am much less interested now in moving to straight Data Science and instead am applying my learnings in my current role as a sort of in-house Data Science guy. But I have a lot to learn to be honest.
3). There seem to be a lot of “thought leaders” acting like they are big experts in the area and really don’t know anything many of us amateur scientists don’t know. They pull perfect clean datasets and show these magic transformations they just copy from others to get YouTube hits or Twitter followers. That just never happens in real life, and many leaders are seeing this and losing interest in this function in the returns they are getting from sole data science folks.
by s1t5 on 4/7/20, 9:26 PM
Neither of those quite match the articles title, perhaps it just refers to the author's personal expectations. Neither of them seem that specific to data science, or without parallels in other software jobs. And neither of the points read like a slight towards data science to me, like some of the other commenters here suggest.
by UweSchmidt on 4/7/20, 8:48 PM
Progress may only come slowly, ideally through products bought from 3rd parties whose results are understood and controlled by management.
by mirimir on 4/8/20, 4:52 AM
-- write discovery requests
-- review production, and check out data and documentation
-- write supplementary discovery requests
-- review production, and check out data and documentation
[repeat as needed]
-- analyze data, and write deposition questions
-- help attorneys wring answers from deponents
[repeat as needed]
-- analyze data, and produce required output
-- write parts of briefs and expert reports
I generally did that in consultation with testimonial experts and their data analysts. Sometimes that didn't happen until we'd documented the case enough to know that it was worth it. And occasionally small cases settled with just me as the "expert".
It's a small industry, and not easy to get into, unless you know key players at key firms. But the money's pretty good, and the work can be exciting. I loved being that guy in depositions whispering questions to the attorneys :)
This all involved pretty simple calculation of damages, through comparing what actually happened vs what would have happened but for the illegal behavior. But-for models were typically based on benchmarks.
After data cleanup in UltraEdit, I did most of the analysis in SQL Server. I used Excel for charting and final calculations.
by avip on 4/7/20, 9:53 PM
by op03 on 4/7/20, 8:10 PM
If its just data folk by themselves getting dumped with org data and told to find pirate gold...then its a crap shoot.
by agentofoblivion on 4/7/20, 10:25 PM
I know this because I've been on that journey. But there's no reason to expect some department head that's never been exposed to DS to know this. They just copy/paste some other company's job req. If you're more junior, here are my tips:
- If it's a "new DS team" that supports a variety of teams: beware. Bolt-on DS doesn't work well, as it's really hard to build a meaningful solution that's not deeply integrated.
- If it's an old company or in a conservative industry: beware. There are likely to be data silos and difficult ownership models that make it nearly impossible to get and join the data you need.
- If it's a small company: beware. You're likely going to need a broad set of knowledge that's won with several years of experience to be able to build end-to-end solutions that are integrated into the rest of the tech stack.
- If it's not an engineering-driven culture: beware. DS will often be used to provide evidence to someone else whose already made up their mind and pretend they're being data-driven, and you'll be the disrespected nerd that's expected to do what it takes to deliver the answer they want. Most companies claim to be "data-driven", few are, and even fewer understand data-driven isn't always possible or desirable.
Industry is still trying to figure out how to use ML and are still learning that it's not as easy as hiring someone that knows about all the algorithms, but rather it takes deep technological changes to data infrastructure to enable the datasets that can then be used by the ML experts. But you don't have to be the person that helps them figure this out the hard way (i.e. by being paid to not accomplish much due to problems outside of your control). Better to find a place with a healthy data science team that can help you learn and contribute. They exist.
by deppp on 4/7/20, 8:45 PM
For example, I’m working on the tool to make data management easier and convert datasets into a structured representation. If you have experienced that you spend a lot of time on preparing and analyzing data, and it is tedious, please reach out to me michael at heartex.net, would love to get your feedback on the product we have built so far.
by AndrewKemendo on 4/7/20, 11:30 PM
You can't compress information until you have it in a format that is appropriate for compression.
That is:
You can't compress (apply/create algorithms) information (data) until you have it (instrumented data collection) in a format (schema) that is appropriate for efficient compression (structured logging/cleaning).
99% of that is Data Engineering and building good engineering practices which have good data practices as a priority.
For any organization that has more than a handful of employees and more than one product, that is a non trivial task and gets more difficult the larger the organization gets.
by minimaxir on 4/7/20, 8:54 PM
As noted in the submission, there's a lot of flexibility in what a "data scientist" is. Normally that's good and healthy for the industry. However, it contradicts a lot of optimistic bootcamps/Medium/YouTube videos, and many won't be prepared for the difference.
by stuxnet79 on 4/7/20, 8:10 PM
It's overall hurting my ability to build my personal brand and seek roles that are a fit for my existing skillset and aspirations.
What exactly does 'ML Engineer' communicate to employers in terms of baseline skills? Is the role closer to that of a data engineer or an analyst?
by arafa on 4/7/20, 7:46 PM
High Data Scientist salaries and expectations combined with a shortage of qualified people often mean you're expected to be a one-person band, which I find to be miserable.
by ajeet_dhaliwal on 4/7/20, 11:53 PM
by smitty1e on 4/7/20, 8:57 PM
via https://www.reddit.com/r/QuotesPorn/comments/b76ujr/if_you_t...
by FridgeSeal on 4/7/20, 11:42 PM
Nothing has summed up my entire working experience more than this, it’s almost painfully accurate.
On one hand it’s an exciting challenge, you learn a lot and you get good at adapting to these situations.
On the downside I have practically no senior data science people to turn to for help when I do need it, which is frustrating.
by mikorym on 4/8/20, 10:02 AM
I don't mean manufacturing (which is doing really well), but companies like Microsoft, Google, Facebook (and even Apple) and others do encourage you to try to compete against their founders (or maybe society does that) rather than focusing on being solid mathematically. Yes, Google pays people well with those skills, but movies portray mostly their founders, emphasising how rich they are, while mathematicians are generally portrayed as weird. Society as a whole puts more emphasis on Bill Gates than on fundamental researchers.
In fact, if you really want to have a rich representative, you can pick the Simons guy. (See, I don't even know his name.) His Medallion hedge fund was built on mathematics. Ironically, Bill Gates is these days one of the biggest financial supporters of people with science skills that he doesn't have.
It is a fad to be a techie. Mathematics is not a fad, although it does have internal fads.
by worik on 4/7/20, 8:43 PM
by erdos4d on 4/8/20, 4:56 AM
by zeveb on 4/8/20, 12:52 PM
The problem is that this can all too easily become motivated reasoning: one provides a stakeholder with support for the decision he already made. From his point of view, this is a valuable service, but it does the organisation a disservice: decisions should be made after considering the data, rather than consider only those data which support a decision.
Also, while ethical issues certainly arise, I think that Greyball is not a good example. Uber evading police enforcing the taxi monopoly is no more unethical than the Underground Railroad evading fugitive-slave agents. The taxi monopoly is itself unethical, and evading it increases the common good.
by sgt101 on 4/7/20, 9:11 PM
I have never had hopes about the potential impact of being a Data Scientist. I felt every company should be a “data company”, but everything I knew told me that companies are political institutions bounded by the pressures of late stage capitalism. Anyone who things different is dim, anyone who blogs about it is a moron.
My expectations did meet reality.
Where did my expectations come from?
I attended a four year Computer Science degree, followed by four and a half years of earning a Ph.D. I then spent 20 years in industry. 19 of the 20 weeks’ focus were not on machine learning (ML) and artificial intelligence (AI).
I figured I’d spend most of my time buried in code and data, I was right, I had to find shit buried in it, and dig it out with my teeth. Executives hated me because I was a threat, but they needed me so I continued to get paid. I continue to be able to create insight and predictions that almost no one else can, and until this stops I will get a 200k a year salary, benefits and a Tesla.
All of this happened, I can't be bothered to waste my time commenting on this moronic blog post.
by Ididntdothis on 4/7/20, 9:05 PM
by DrNuke on 4/7/20, 8:51 PM
by kovac on 4/8/20, 3:04 AM
I'm not fully convinced that data science with ML and more modern techniques are applicable across domains out of the box. I think there is value to be added if data scientists can specialise in domains.
If we take humans as an analogy, even with the kind of general intelligence we have, we need domain expertise to be able to have advanced intuitions and make predictions about the future. I believe this is true for data science as well.
by starchild_3001 on 4/8/20, 4:58 AM
by StonyRhetoric on 4/8/20, 1:59 AM
So here's what I think I did right:
1. Provide indisputable, obvious business value every month. You should consider yourself an in-house consultant to whichever cost center your salary is drawn from. If you're product development, prove value to them. If you're operations, or sales, or marketing, prove value to them. After about two months, you should be able to justify your existence in two sentences. Just remember, most of your company probably thinks of you as a optional add-on.
Your first few projects should attack high-impact pain points with the simplest solutions possible. My first projects were basically ETL into some basic regression into a dashboard. No machine learning required. But it was better then what they had (which was often nothing), and it was STABLE and RELIABLE. And that leads to the next point...
2. Build trust. With my dead-simple models, nothing ever blew up, there were no nonsensical answers, and there wasn't much brittleness when new categorical features or more cardinality was added. It mostly just worked. And that built my reputation for me. They didn't have to understand what was going on in the model, but they knew, from experience, that they could trust the result. Once I had the credibility, I could start building more complex, more elaborate models, and asked them to trust those as well. If they don't trust your models, then no business value has been created, and your job is worthless.
3. Recognize that data science is being done everywhere in the organization, and respect it. Every department has someone who has built a monster spreadsheet that contains more embedded domain knowledge then you could hope to learn in a month. As data scientists, we like to think that we're helping the organization by building critical metrics to improve performance. But here's the catch. If the metric was truly critical, someone has built it already. It might be ad-hoc, use poor-methodology, and be somewhat wrong, but it works and is good enough. You have to find that person, learn from them, and improve on it.
4. Be as self-contained as possible. Ideally, your critical path should not depend on other teams doing things for you (except for IT setting up data access). You should be able to do it all. From front-end dashboards, to ETL, to DevOps. Remember, you're an in-house consultancy. You should be able to take problems and just handle them, rather then be a perpetual bother and distraction to other teams.
There's more, but if you do these four things, I think you can build the reputation in your company for creating useful, accurate data tools that help other people do their jobs better. After that's achieved, people will breaking down your door to get your help. That's where my team is now - we've got a backlog for at least 18 months, with our work priorities often being set directly by the CEO.
by simonkafan on 4/7/20, 10:01 PM
In fact, they actually don't need a data scientist. At best they need someone who cleans data, creates pie charts or even worse, they relabel the database admin job as "Data scientist".
by Rainymood on 4/8/20, 7:23 AM
I personally love it but am doing more pure software engineering now as the infrastructure is not there and I need to build it myself.
by alixedi on 4/8/20, 3:31 PM
An astonishingly large fraction of Data Science output goes to die in pretty presentations.
From what's left, a large fraction ends up in Spreadsheets.
A disappointingly small fraction ends up in live services.
by new_learner on 4/10/20, 1:50 PM
by dzonga on 4/7/20, 8:53 PM
by mjparrott on 4/8/20, 4:44 AM
by graycat on 4/8/20, 1:44 AM
I view all of such work as applied math.
My experience is that applied math, from the fields I mentioned and some more recent ones, and more, with emphasis on the more, can be valuable and result in attention, usage, and maybe money.
I've had such good results and have seen more by others.
Some examples:
(1) Airline fleet scheduling and crew scheduling long were important, taken seriously, pursued heavily, with results visible and wanted all the way up to the C-suite.
(2) Similarly for optimization for operating oil refineries: So, here is the inventory of the crude oil inputs and the prices of the possible outputs. Now what outputs to make? The first cut, decades ago, was linear programming, and IBM sold some big blue boxes for that. More recently the work has been nonlinear programming.
(3) The rumors are, and I believe some of them, that linear programming is just accepted, used everyday, in mixing animal feed.
No surprise and common enough, IMHO what really talks is money. If can save significant bucks and clearly demonstrate that, then can be taken seriously.
But from 50,000 feet up, tough to get rich saving money for others. If they have a $100 million project and you save them $10 million, then maybe you will get a raise.
What's better, quite generally in US careers, is to start, own, and run a successful business. If that business is to supply the results of some applied math, and the results pass the KFC test, "finger lick'n good", then charge what the work is worth.
Maybe now Internet ad targeting is an example.
I'm doing a startup, a Web site. The crucial enabling core of what I'm doing has some advanced pure math and some applied math I derived. Users won't be aware of anything mathematical. But if users really like the site, then it will be mostly because of the math. So, it's some math -- not really statistics, operations research, optimization, machine learning, artificial intelligence, or management science -- it's just some math. The research libraries have rows and rows of racks of math; I'm using some of it and have derived some more.
Generally I found that the best customer for math is US national security, especially near DC. E.g., now some people are building models to predict the growth of COVID-19. Likely the core of that work is continuous time, discrete state space Markov processes, maybe subordinated to Poisson processes. Okay: One of the military projects I did was to evaluate the survivability of the US SSBN (ballistic missile firing submarines) under a special scenario of global nuclear war limited to sea -- a continuous time, discrete state space Markov process subordinated to a Poisson process. Another project was to measure the power spectra of ocean waves and, then, generate sample paths with that power spectrum -- for some submarines. There was some more applied math in nonlinear game theory of nuclear war.
Here's some applied math, curiously also related to the COVID-19 pandemic: Predict revenue for FedEx. So, for time t, let y(t) be the revenue per day at time t. Let b be the total market. Assume growth via virality, i.e., word of mouth advertising from current customers communicating with remaining target customers. So, ..., get the simple first order differential equation, for some k,
y'(t) = k y(t) (b - y(t))
where the solution is the logistic curve which can also be applied to make predictions for epidemics. This little puppy pleased the FedEx BoD and saved the company. Now, what was that, data science, AI, ML, OR, MS, optimization? Nope -- just some applied math.
I have high hopes for the importance, relevance, power, fortunes from applied math, but can't pick good applications like apples from a three.
by scottlocklin on 4/7/20, 9:16 PM
Yeah, well there's your problem, my dude. I've been doing what might be described as "data science" since I quit physics in 2004. Aka before the term existed. It's a great area to work in for intelligent people who want to use their brains to impact the real world; vastly better than what people get paid to do in physics. If customers don't know what the tools can do, it's because you as the data scientist have failed to explain it to the customer. If your work product isn't in front of the decision makers, you've also failed: they can tell the bottom line impact and will reward you accordingly. Sometimes there is no data in their data; they should know that up front.
As for whining about poor data quality: n00b. What do you think they're paying you for? Nobody gives a shit what people do in Kaggle competitions.
by codingslave on 4/7/20, 9:20 PM
by eanzenberg on 4/7/20, 10:46 PM
by johndoe42377 on 4/8/20, 2:55 AM
Korzybski formulated these principles, among other things.
Most of data science models are as wrong as astrology and numerology. They have no connection to reality, or rather inadequate.
This principle explains abysmal failures of all Model-based "sciences", stating from financial markets and up to virus spreading models.
Simulations of non-discrete, non-fully-observable (AI terminology) system has exactly the same relationships with underlying reality as a Disney cartoon to a real world.
This is why expectations will never be meet, except for natural (non-inaginary) pattern recognition.
A drop of proper philosophy worth years of virtue signalling.