We will start the book with an intro to what we'll dive into later with great detail in the following two chapters, starting with the history and the state of data engineering.
After, I will introduce my personal introduction and how I got into data engineering for context. With a quick intro from back in 2003, before data engineering was a thing, and how I got into it, then we jump quickly to today, where data engineering was the hottest job before all the generative AI hype.
In the ever-evolving world of technology, the term data engineering has only recently become part of our lexicon. However, its roots are deep, tracing back to concepts and practices predating the term. When I embarked on my journey in the realm of computer science back in 2003, roles like Business Intelligence, Data Warehouse Developer, and ETL Developer dominated the scene. It was a world yet to discover the complexities and possibilities of what we now call data engineering.
The formal advent of data engineering as a distinct field can be pinpointed to the late 2010s, marked by a significant shift in how we approach data. This transformation was not sudden but a result of a gradual evolution from foundational concepts like SQL and data warehousing, established by pioneers like Edgar F. Codd and Bill Inmon in the 70s and 80s, to more complex systems and architectures.
The journey from traditional data warehousing to the significant data era was fueled by innovations in massively parallel processing databases and cloud computing. Technologies like MapReduce, Hadoop, and cloud services like AWS redefined the boundaries of data processing and storage, ushering in an era of unprecedented data scalability and accessibility.
As we progressed, the role of a data engineer evolved, morphing from managing traditional business intelligence solutions to navigating the complexities of the modern data stack. This evolution marked a shift from the era of big data to a more nuanced understanding of data engineering, as highlighted in Maxime Beauchemin's "The Rise of the Data Engineer."
We'll delve into how each development, from the inception of SQL to the modern cloud-based ecosystems, has shaped the dynamic, multifaceted world of data engineering as we know it today.
In the rapidly evolving realm of data engineering, understanding the present is as crucial as knowing the past. This section delves into the latest trends and transformations defining the field today. We explore how the landscape of data engineering has shifted, embracing new methodologies and technologies reshaping our data approach.
The years 2022 and 2023 have been pivotal in steering the direction of data engineering. Let's have a glimpse into them. The field has seen significant advancements, from adopting declarative approaches in various domains to emerging metadata trends. The Rust programming language has made notable strides, challenging traditional frameworks with its robust performance and type-safety. Amidst these developments, the rise of AI and vector databases, alongside the increasing importance of privacy and governance, underscore the evolving nature of data engineering.
Navigating the modern Data Engineering Landscape is critical. As we delve deeper, we encounter the Modern Data Stack's impact on data modeling, especially in complex enterprise environments. The dynamic landscape of MDS and the growing focus on open standards and defragmented data stacks reveal the ongoing innovation and challenges in the field.
Next, in the world of data engineering, understanding and overcoming challenges is critical to unlocking the full potential of our data-driven solutions. This section delves into the daily hurdles data engineers face and helps us set the stage for patterns in this book.
The data engineering lifecycle is at the heart of these challenges, a journey from data collection to actionable insights. Each stage presents its unique set of obstacles, from ensuring data quality and consistency to deriving meaningful insights and making informed decisions based on the data. We'll explore these stages and the joint pain points encountered, offering insights into how to address them efficiently.
Another aspect we examine is the Pyramid of Work Product, which maps the function and outcome of our work in data engineering. This framework helps us understand how specific tasks, from setting up data infrastructure to making data accessible, contribute to the overarching goal of generating insights. We'll delve into how to optimize our approach to these tasks, balancing the use of tools and resources for the best outcomes.
Finally, we will dissect the Data Engineering Lifecycle, addressing the end-to-end flow of a data engineering project. This comprehensive view helps us identify and address challenges across core dimensions, including generating, storage, ingestion, transformation, and serving. By exploring these challenges in detail, we aim to equip data engineers with the knowledge and strategies to navigate this complex landscape successfully.
This will contrast the general history of data engineering with the ones I had and my introduction.
The field of data engineering is a term that only existed for a short time; in 2003, when I started my career, it didn't even exist. It was called business intelligence back then. Was it all that different? Yes and no.
Back then, the focus was on the business logic, bringing value to the user, and the how was less sophisticated. Usually, you choose one of the dominant vendors, Oracle, SAP, or Microsoft, and build all the business logic in any available tool.
You added triggers and date columns to the source databases to extract the data daily to our data warehouse. We had an ODS (Operational Data Store) for fast and daily accurate representation and a core and data marts. We used materialized views to crunch huge SQL statements containing massive business logic and persist them into quickly retrievable tables.
We had OBIEE, SAP BO, and other BI tools to visualize the data. There weren't dashboards yet, but lots of legal and pixel-perfect reporting was sent via email!—sometimes, it was playing Lego to bring all your data into a PDF that could be readable 😉.
We were doing lots of automation with bash scripts, PL/SQL, T-SQL, and everything available back then, mostly procedural. In a way, it was the early days of the orchestrator we know today, to sequentially start the right tasks. We even called the stored procedures
i<something> as we thought the procedures were so advanced that they had particular intelligence.
At that time, I was getting used to terms like Operational Data Store, Views, Data Marts, and so many more, frankly, to learn the world of data warehouses. We discussed how to access the data and extract data decisions. Can't we access the source directly? Only once? Can we export it into an Excel? Business people were asking if we could pull the data quickly for them, and so on.
Fast forward 20 years later to today, we talk all about data engineering.
But, I noticed more often that most of the challenges, tools, and strategies are doing the same thing as we did 20 years ago. Sometimes, they had different names, had a more sophisticated technology under the hood, or were open-source. No matter what, things seemed repetitive.
But I also notice that we are drifting away from the fundamentals. Back then, we spent all our time data modeling or thinking about the data architecture, whereas today, it's only about stacking as many tools as possible.
Sure, it's different times, but the fundamentals and its patterns are more important than ever.
With all this out of the way, let's get into the history of data engineering and its current state in the next two chapters. These lay the foundation for us to explore the convergent evolution, the foundation of this book before we dive into its pattern and design pattern.