by thewhitetulip on 12/27/22, 9:39 AM with 88 comments
New languages like Rust/Ocaml/Nim.. if yes then which?
I don't think learning an ETL tool will be helpful because essentially they are all one and the same.
Any tips?
by slotrans on 12/27/22, 6:20 PM
Completely irrelevant. DE is SQL, Python, sometimes Scala/Java.
Get really good at SQL. Learn relational fundamentals. Learn row- and column-store internals. Understand why databases are they way the are. Familiarize yourself with the differences between systems and why you need to care.
Get familiar with at least minimal Hadoop shit. Set up Spark from scratch on your machine, mess it up, do it again. Grapple with how absurdly complex it is.
Understand system-level architecture. Analytics systems tend to organize into ingest -> transform -> serving (BI tools etc)... why? There are good reasons! Non-analytics data systems have different patterns, but you will see echoes of this one.
Above all, spend time understanding the data itself. DE isn't just machinery. Semantics matter. What does the data "know"? What can it never know? Why? What mistakes do we make at the modeling step (often skipped) that result in major data shortcomings, or even permanently corrupt data that can never be salvaged? Talk to people building dashboards. Talk to data end-users. Build some reports end-to-end: understand the business problem, define the metrics, collect the data, compute the metrics, present them. The ultimate teacher.
(who am I? no one, just been doing data engineering since 2005, long before it was even a term in the industry)
by khaledh on 12/27/22, 2:13 PM
Without tools to help you repair data, sooner or later you'll run into a problem that will take you days (even weeks) to fix, and your stakeholders will be breathing down your neck.
by panda888888 on 12/27/22, 4:51 PM
1) Data modeling: Fully read Kimball's book, The Data Warehouse Toolkit, 3rd edition. It's outdated and boring but it's an excellent foundation, so don't skip it.
2) Data modeling: After #1 above, spend at least 30 minutes learning about Inmon's approach, the ER Model, 3NF, etc. You can find some decent YouTube videos. You don't need to learn this deeply, but you should understand how it's different than Kimball's approach.
3) Data warehousing & data lakes: Read the academic paper titled "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics." For an academic paper, it's surprisingly approachable.
4) If you don't know this already, do 15 minutes of googling to understand ETL vs ELT and the pros and cons of each approach.
5) Putting it all together: Once you've done the things above, then read Kleppman's book, Designing Data-Intensive Applications.
6) Focus on the future: Spend a little time learning about streaming architectures, including batch/microbatch/streaming and pros and cons for each. You may also want to learn about some of the popular tools, but you can easily google those.
by swyx on 12/27/22, 11:46 AM
take your pick from https://airbyte.com/blog/becoming-a-data-engineer-2023 :
- Language: SQL and Python Continue to Dominate, With Rust Looking Promising
- Abstraction: Custom ETL Is Replaced by Data Connectors
- Latency: Better Tools Available In The Streaming Domain
- Architecture: The Lakehouse and the Semantic Layer Combine Strengths
- Trust: Data Quality and Observability Become an Essential
- Usability: The New Focus Is on Data Products and Data Contracts
- Openness: When It Comes to Data Tools, Open Source Is the Answer
- Standards: Database Open Standard Simplified by DuckDB
- Roles: Data Practitioners Expand Their Knowledge into Data Engineering
- Collaboration: DataOps Reduces Data Silos and Improves Teamwork
- Adoption: Data Engineering Is Key Regardless of Industry or Business Size
- Foundations: The Data Engineering Lifecycle is Evergreen
(not my article but i work at airbyte, this was the result of our 2023 trends review)
by skrtskrt on 12/27/22, 12:57 PM
You say all workflow engines are the same but even just reading the Pachyderm docs will give an idea of modern data engineering best practices - data versioning and data lineage, incremental computation, etc.
Temporal also has a very cool, modern approach (distributed, robust, event-driven) to generalized workflow management (non big data specific)- if you’re used to stuff like Airflow, Temporal is a jump 10 years forward.
by rodrigodlu on 12/27/22, 11:25 AM
I'm just doing Python/Spark/AWS related tools. Most of my time is trying to break the bureaucratic ice between multiple layers from insane devops requirements to missing documentation on several "hidden requirements" (the ones that you find during the development process). So not much different than a lead role.
It's definitely different from a pure dev experience because it's expected that you lead changes inside the organization to make the pipelines work consistently. Without that (just plumbing) you can rely on "regular" backend devs.
This is obviously a high level data engineering perspective, not a low level like DB hacking or hyper optimizing existing pipelines and data transformations.
by Dowwie on 12/27/22, 11:58 AM
I manage Elixir systems for data engineering related work, so I can attest to its use in this domain within a production environment. I also have used Rust for comparable data-engineering related systems.
Elixir is a phenomenal language for data engineering. I've written message processing pipelines in Rust and didn't get anywhere near the level of design considerations that Broadway / Genstage have. Some day, there may be robust open source offerings in Rust as there are in Elixir, but the ecosystem hasn't reached this state, yet. Rust asyncio is also in a minimum-viable-product condition, lacking the sweet structured concurrency that Erlang-OTP solved long ago and Elixir benefits by. Data pipeline processing in Elixir, utilizing all available cores exposed to the runtime, is straightforward and manageable. Telemetry patterns have been standardized across Elixir ecosystem. There are background worker processing libraries like Oban that help with audit trail/transparency. Smart, helpful developer communities.
Elixir is not going to beat Rust on performance. CPU-bound work is going to take orders of magnitude longer to complete with Elixir than Rust. You could extend Elixir with Rust in CPU-intensive situations using what is known as NIFs but you'll need to become familiar with the tradeoffs associated with using Rust NIFs.
Writing in Rust involves waiting for compilation. When a developer waits for compilation, they switch to something else and lose focus. You can use partial compilation for local development, and that speed things up. You also need to have a very modern workstation for development, preferably an M1 laptop or 16-core Ryzen, with at least 32GB ram and SSD. Elixir, however, has quick compile times as it doesn't do anywhere near the level of work that Rust compiler does. There is a tradeoff for that, though. Elixir is a dynamic language and consequently has all the problems that dynamic languages do that are automatically solved by a strongly-typed compiled language such as Rust. You also discover problems at runtime in Elixir that often would be caught by the Rust compiler.
One final mention is Elixir livebooks. Elixir has thrown down the gauntlet with livebooks. Coupling adhoc data scientist workflows in notebooks with robust data engineering backends makes a lot of sense. There are high-performance livebook dataframes backed by Rust. Elixir backed by Rust is beating Python backed by Rust from developer experience down.
by beckingz on 12/27/22, 2:33 PM
We're starting to see an unbundling of data engineering into analytics Engineering (BI developer on steroids), ML engineering (AutoML is good enough that if you can do good feature engineering and pipelines the marginal value of adding a data scientist is not worth the cost, and data platform engineering (K8s. K8s everywhere).
by ed_elliott_asc on 12/27/22, 11:36 AM
Then think about that tools they use, in azure it will likely be ADF, Databricks, maybe Synapse.
Languages, python, sql, python, python, sql, some scala, more python, more sql, then get a general understanding of a few different languages (c, typescript, c#, Java, kotlin).
I’ve never seen a data engineering role asking for Rust/OCaml/Nim - I’m not saying they don’t exist but I’ve not seen them and I’ve rarely seen a data engineering role not asking for either python or scala.
by boringds on 12/27/22, 11:29 AM
by sammyd56 on 12/27/22, 11:50 AM
For short-term career growth, $YOUR_COMPANY's current preferred ETL tool will have the biggest ROI. Focus on design patterns: while APIs will come and go, the concepts, as you rightly say, are transferrable.
If you're looking to land a new role: the market says dbt, databricks and snowflake are pretty strong bets.
If it's personal interest, or a high-risk, high-reward long term play, take your pick from any of the new hotness!
by harlanji on 12/27/22, 11:34 AM
Might be an out there take, but being able to develop a shoe string web app that can be maintained by a solopreneur might be a good skill. Should translate to rapid prototyping concepts for big corporate managers as well. I'd argue Python could eat PHP's old niche in these regards, because it's way easier to get rolling from scratch with Flask and Pip than PHP was as recently as 2016.
by yawaramin on 12/27/22, 11:34 AM
by lightbendover on 12/27/22, 3:17 PM
by travoltaj on 12/27/22, 10:48 AM
Personally, I'm planning to learn about the internal implementation of databas(es), starting with the book Designing Data Intensive Applications. This is so that I learn about the current ways data is stored
by jmartin2683 on 12/27/22, 1:34 PM
by PaulHoule on 12/27/22, 2:49 PM
(The exception to that is that a professional programmer is frequently a maintenance programmer or problem solver and it may well be you have to learn a bit of language X in a hurry to solve a specific problem in a system written in language X.)
by gigatexal on 12/27/22, 1:47 PM
by morelandjs on 12/27/22, 1:44 PM
If I were to switch functions and find some protected time, I’d go off into the woods and build an example deployment compatible with company infra that is as light weight as possible. Then evangelize my team.
by phoehne on 12/27/22, 4:14 PM
by soumyadeb on 12/27/22, 2:46 PM
by Eumenes on 12/27/22, 2:43 PM
by banglaman on 12/27/22, 11:16 AM
by tacosbane on 12/27/22, 2:48 PM