from Hacker News

Ask HN: Upskilling as a Data Engineer

by thewhitetulip on 12/27/22, 9:39 AM with 88 comments

What should a data engineer learn as a part of Upskilling in 2022?

New languages like Rust/Ocaml/Nim.. if yes then which?

I don't think learning an ETL tool will be helpful because essentially they are all one and the same.

Any tips?

by slotrans on 12/27/22, 6:20 PM
> New languages like Rust/Ocaml/Nim.. if yes then which?
Completely irrelevant. DE is SQL, Python, sometimes Scala/Java.
Get really good at SQL. Learn relational fundamentals. Learn row- and column-store internals. Understand why databases are they way the are. Familiarize yourself with the differences between systems and why you need to care.
Get familiar with at least minimal Hadoop shit. Set up Spark from scratch on your machine, mess it up, do it again. Grapple with how absurdly complex it is.
Understand system-level architecture. Analytics systems tend to organize into ingest -> transform -> serving (BI tools etc)... why? There are good reasons! Non-analytics data systems have different patterns, but you will see echoes of this one.
Above all, spend time understanding the data itself. DE isn't just machinery. Semantics matter. What does the data "know"? What can it never know? Why? What mistakes do we make at the modeling step (often skipped) that result in major data shortcomings, or even permanently corrupt data that can never be salvaged? Talk to people building dashboards. Talk to data end-users. Build some reports end-to-end: understand the business problem, define the metrics, collect the data, compute the metrics, present them. The ultimate teacher.
(who am I? no one, just been doing data engineering since 2005, long before it was even a term in the industry)
by khaledh on 12/27/22, 2:13 PM
The most important DE skill you should learn is how to fix data problems when they happen. Data problems include: duplicate rows/events, upstream data errors that need to be corrected, error in your ETL logic, etc. You need to build tools to reason about the graph of data dependencies, i.e. downstream/upstream dependencies of datasets, and create a plan for repair. Datasets need to repaired in layers, starting from the ones closest to the source and working downstream to the leaves. Some datasets are self-healing or not (i.e. they're rebuilt from scratch when their inputs change), so you just let those rebuild. Incremental datasets are the worst to repair because they don't restate history and you have to fork them from the point when a problem happened and rebuild from that point onwards.
Without tools to help you repair data, sooner or later you'll run into a problem that will take you days (even weeks) to fix, and your stakeholders will be breathing down your neck.
by panda888888 on 12/27/22, 4:51 PM
I would focus on theory first, then tools. I recommend the following, in order:
1) Data modeling: Fully read Kimball's book, The Data Warehouse Toolkit, 3rd edition. It's outdated and boring but it's an excellent foundation, so don't skip it.
2) Data modeling: After #1 above, spend at least 30 minutes learning about Inmon's approach, the ER Model, 3NF, etc. You can find some decent YouTube videos. You don't need to learn this deeply, but you should understand how it's different than Kimball's approach.
3) Data warehousing & data lakes: Read the academic paper titled "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics." For an academic paper, it's surprisingly approachable.
4) If you don't know this already, do 15 minutes of googling to understand ETL vs ELT and the pros and cons of each approach.
5) Putting it all together: Once you've done the things above, then read Kleppman's book, Designing Data-Intensive Applications.
6) Focus on the future: Spend a little time learning about streaming architectures, including batch/microbatch/streaming and pros and cons for each. You may also want to learn about some of the popular tools, but you can easily google those.
by swyx on 12/27/22, 11:46 AM
the way I think about upskilling is it is essentially a high stakes bet on megatrends in the industry - you want to be early + right as that is how you get paid the most.
take your pick from https://airbyte.com/blog/becoming-a-data-engineer-2023 :
- Language: SQL and Python Continue to Dominate, With Rust Looking Promising
- Abstraction: Custom ETL Is Replaced by Data Connectors
- Latency: Better Tools Available In The Streaming Domain
- Architecture: The Lakehouse and the Semantic Layer Combine Strengths
- Trust: Data Quality and Observability Become an Essential
- Usability: The New Focus Is on Data Products and Data Contracts
- Openness: When It Comes to Data Tools, Open Source Is the Answer
- Standards: Database Open Standard Simplified by DuckDB
- Roles: Data Practitioners Expand Their Knowledge into Data Engineering
- Collaboration: DataOps Reduces Data Silos and Improves Teamwork
- Adoption: Data Engineering Is Key Regardless of Industry or Business Size
- Foundations: The Data Engineering Lifecycle is Evergreen
(not my article but i work at airbyte, this was the result of our 2023 trends review)
by skrtskrt on 12/27/22, 12:57 PM
Generalized data architecture, understanding the roles and challenges of turning data lakes into data warehouse, etc. including the user personas that use each and their needs and limitations. Different types of databases and data storage types and what they excel at. How to save money on storage. Considerations of idempotency, replay-ability, rollback-ability.
You say all workflow engines are the same but even just reading the Pachyderm docs will give an idea of modern data engineering best practices - data versioning and data lineage, incremental computation, etc.
Temporal also has a very cool, modern approach (distributed, robust, event-driven) to generalized workflow management (non big data specific)- if you’re used to stuff like Airflow, Temporal is a jump 10 years forward.
by rodrigodlu on 12/27/22, 11:25 AM
Reading Kleppmann's book and finding projects that are trying to implement some of the concepts there is a good idea, regardless the language and tool.
I'm just doing Python/Spark/AWS related tools. Most of my time is trying to break the bureaucratic ice between multiple layers from insane devops requirements to missing documentation on several "hidden requirements" (the ones that you find during the development process). So not much different than a lead role.
It's definitely different from a pure dev experience because it's expected that you lead changes inside the organization to make the pipelines work consistently. Without that (just plumbing) you can rely on "regular" backend devs.
This is obviously a high level data engineering perspective, not a low level like DB hacking or hyper optimizing existing pipelines and data transformations.
by Dowwie on 12/27/22, 11:58 AM
We are 5 days from the new year, so in 2023, a data engineer ought to take a serious look at Elixir.
I manage Elixir systems for data engineering related work, so I can attest to its use in this domain within a production environment. I also have used Rust for comparable data-engineering related systems.
Elixir is a phenomenal language for data engineering. I've written message processing pipelines in Rust and didn't get anywhere near the level of design considerations that Broadway / Genstage have. Some day, there may be robust open source offerings in Rust as there are in Elixir, but the ecosystem hasn't reached this state, yet. Rust asyncio is also in a minimum-viable-product condition, lacking the sweet structured concurrency that Erlang-OTP solved long ago and Elixir benefits by. Data pipeline processing in Elixir, utilizing all available cores exposed to the runtime, is straightforward and manageable. Telemetry patterns have been standardized across Elixir ecosystem. There are background worker processing libraries like Oban that help with audit trail/transparency. Smart, helpful developer communities.
Elixir is not going to beat Rust on performance. CPU-bound work is going to take orders of magnitude longer to complete with Elixir than Rust. You could extend Elixir with Rust in CPU-intensive situations using what is known as NIFs but you'll need to become familiar with the tradeoffs associated with using Rust NIFs.
Writing in Rust involves waiting for compilation. When a developer waits for compilation, they switch to something else and lose focus. You can use partial compilation for local development, and that speed things up. You also need to have a very modern workstation for development, preferably an M1 laptop or 16-core Ryzen, with at least 32GB ram and SSD. Elixir, however, has quick compile times as it doesn't do anywhere near the level of work that Rust compiler does. There is a tradeoff for that, though. Elixir is a dynamic language and consequently has all the problems that dynamic languages do that are automatically solved by a strongly-typed compiled language such as Rust. You also discover problems at runtime in Elixir that often would be caught by the Rust compiler.
One final mention is Elixir livebooks. Elixir has thrown down the gauntlet with livebooks. Coupling adhoc data scientist workflows in notebooks with robust data engineering backends makes a lot of sense. There are high-performance livebook dataframes backed by Rust. Elixir backed by Rust is beating Python backed by Rust from developer experience down.
by beckingz on 12/27/22, 2:33 PM
What type of Data Engineer do you want to be?
We're starting to see an unbundling of data engineering into analytics Engineering (BI developer on steroids), ML engineering (AutoML is good enough that if you can do good feature engineering and pipelines the marginal value of adding a data scientist is not worth the cost, and data platform engineering (K8s. K8s everywhere).
by ed_elliott_asc on 12/27/22, 11:36 AM
First think about your target company type/industry - the majority of corps will be using either azure, gcp, or aws - learn one (or more) of those and how they deploy using code etc.
Then think about that tools they use, in azure it will likely be ADF, Databricks, maybe Synapse.
Languages, python, sql, python, python, sql, some scala, more python, more sql, then get a general understanding of a few different languages (c, typescript, c#, Java, kotlin).
I’ve never seen a data engineering role asking for Rust/OCaml/Nim - I’m not saying they don’t exist but I’ve not seen them and I’ve rarely seen a data engineering role not asking for either python or scala.
by boringds on 12/27/22, 11:29 AM
FWIW I've found a lot of value in learning Typescript this year and interacting with the AWS CDK to build infrastructure as code. Upskilling my AWS knowledge also massively paid off, especially networking which I never truly understood before.
by sammyd56 on 12/27/22, 11:50 AM
What is your goal?
For short-term career growth, $YOUR_COMPANY's current preferred ETL tool will have the biggest ROI. Focus on design patterns: while APIs will come and go, the concepts, as you rightly say, are transferrable.
If you're looking to land a new role: the market says dbt, databricks and snowflake are pretty strong bets.
If it's personal interest, or a high-risk, high-reward long term play, take your pick from any of the new hotness!
by harlanji on 12/27/22, 11:34 AM
I found myself specialized with data eng + platform skills in Clojure + Kafka + K8s + Node and it's a liability if you want to work outside of Big Tech, because few can afford to use tools like that. Node's relatively pretty expensive these days, believe I've used it a lot and bought into JS everywhere. But it's a really hard ecosystem compared to Python.
Might be an out there take, but being able to develop a shoe string web app that can be maintained by a solopreneur might be a good skill. Should translate to rapid prototyping concepts for big corporate managers as well. I'd argue Python could eat PHP's old niche in these regards, because it's way easier to get rolling from scratch with Flask and Pip than PHP was as recently as 2016.
by yawaramin on 12/27/22, 11:34 AM
Just fyi, OCaml is about as old as Java (circa 1995) and its direct anscestors (Caml and ML) are about as old as C (1970s to '80s). Not exactly new ;-)
by lightbendover on 12/27/22, 3:17 PM
If you're at a huge company, you can up-level as a DE within that company in many ways: (1) understand the hundreds of common ubiquitous datasets in great detail (what they are, their business purpose, how to access them, permissions, team ownership, etc..), (2) moving to a control plane team that works on infrastructure to better understand, (3) experiment with the broad array of ways to solve the same problem (e.g. think through how to optimize for cost instead of maintainability or implementation speed vs. scalability). #3 especially is important because you really level up when you recognize the right pattern to fit the problem you're trying to solve and have direct experience to draw from in solutioning.
by travoltaj on 12/27/22, 10:48 AM
I'm in the same position with the same question.
Personally, I'm planning to learn about the internal implementation of databas(es), starting with the book Designing Data Intensive Applications. This is so that I learn about the current ways data is stored
by jmartin2683 on 12/27/22, 1:34 PM
Rust! Polars/Datafusion are taking over, and for good reason.
by PaulHoule on 12/27/22, 2:49 PM
Like the ETL tools, computer languages are basically the same. It's (usually) more important to learn how to do something new with a programming language than it is to learn how to do all your old things in a new programming language.
(The exception to that is that a professional programmer is frequently a maintenance programmer or problem solver and it may well be you have to learn a bit of language X in a hurry to solve a specific problem in a system written in language X.)
by gigatexal on 12/27/22, 1:47 PM
At my workplace once you’re decent with Terraform, a cloud ecosystem or two, SQL and Python the next best thing to do to advance would be to hone soft skills and become a better communicator and manager of people. If you can unlock teams to accomplish things, set vision such that stuff gets done and on time that’s how you become truly valuable.
by morelandjs on 12/27/22, 1:44 PM
The skill I value most in a DE is the ability to build effective model deployments and project layouts that are SIMPLE.
If I were to switch functions and find some protected time, I’d go off into the woods and build an example deployment compatible with company infra that is as light weight as possible. Then evangelize my team.
by phoehne on 12/27/22, 4:14 PM
Math. A surprising number of people in 'data science' lack good understanding of basic statistics, leading to poor data quality or incorrect conclusions from the data.
by soumyadeb on 12/27/22, 2:46 PM
As cliche as it may sound, getting close to the "business value" of data might be a good investment. Learning and building use cases, like a product recommendation pipeline for marketing OR a churn model for support to helping marketing ops migrate from GA (GA4) to an in-house analytics stack etc would help bring the data teams to the (much deserved) front row seat in the organization.
by Eumenes on 12/27/22, 2:43 PM
Data engineers who can run ML/AI infrastructure get paid good money. If you have kubernetes/docker/SRE experience, even more.
by banglaman on 12/27/22, 11:16 AM
If you’re at a consumer company learning JavaScript would make you very useful for building event data collection systems.
by tacosbane on 12/27/22, 2:48 PM
Learn SQL in depth, not just select and join. I'm tired of writing SQL for DE.