The Modern Data Stack™ (MDS) is a new-ish set of tools that data teams are using to collect, transform, explore, and make use of their company’s data.
Although the data team as we know it is a relatively modern concept, there’s a rich history of data-related software and stacks
Older data stacks were run mostly on open source data stores, hosted on premises, and required tons of configuration and maintenance
The modern data stack focuses on speed and simplicity: serverless cloud data warehouses, easy cloud-based SaaS tools for modeling, and native integrations with other tools
The MDS has become a bit of a meme in the data community for growing into an overused marketing gimmick
In this post we’ll cut through the marketing noise and focus on what’s actually important about the MDS, how it helps data teams be more effective, and what specific tools people are using to do that.
Data teams exist, more or less, to build knowledge at your company. It’s their job to figure out what’s going on with the business, what might happen next, and how that information can help teams like Product, Marketing, and Sales make more money and such. So when we talk about a data stack, it just means what tools these teams use to get their jobs done.
There are a million ways to cut the data stack, but generally it will fit into a few categories:
Something to pull in data from where it’s generated
A place to store your data
Something to transform your data with
Something to visualize and analyze your data with
Yes, it turns out there’s a lot of logistics involved with “simple questions” like can you pull last month’s revenue for me?
🔍 Deeper Look 🔍
For a deeper dive into specific tools that data teams use for their data stacks, check out our “What Your Data Team Is Using” post here.
🔍 Deeper Look 🔍
Let’s break these down a bit more.
Problem number 1 when you want to analyze data is that the data is nowhere to be found. Typically, analytics questions like “how much revenue did we do this month?” require integrating data from multiple sources – perhaps in our case Stripe and our production database with user information. More generally, important data gets generated in a ton of tools your company might be using:
…you get the idea
When you use these tools to categorize a lead or answer a support question, that’s generating data that sits in a database owned by the SaaS provider. And to analyze that data alongside other data, you need to get it out of there and into your data warehouse. You’ve also probably got data that your company’s application is generating – like who your users are and what they’re doing – that you’ll want in there too.
You need somewhere to put all of that nice, useful data we’ve been talking about. The simple answer here is a database: but what kind of database? Generally there are two categories:
OLAP databases: made for analytics, long and multidimensional queries
You don’t want to use the same type of database for your application and you do for your analytics, because your queries for each are so different. Special purpose analytics databases help data teams run their queries faster and more efficiently, without having to worry about impacting whatever application their company sells.
You’re probably familiar with the “data warehouse” – a type of OLAP database that’s becoming increasingly standard. But let’s not get ahead of ourselves.
The data you’re pulling in from your systems is rarely (read: never) in the exact format you want it to be in. From the Technically post on ETL:
Here’s what a typical transform workflow might look like:
Remove dollar signs from the “order_value” column (clean) →
Fill in missing values in the “username” column with “anonymous” (clean) →
Aggregate orders on a monthly basis (transform) →
Join order data to user data to get shipping location (transform)
Typical ETL workflows can have tens or even hundreds of these kinds of transformations per dataset.
You can transform your data with SQL, Python, or any other language – no matter what you use though, you’ll need some sort of system and server for running this stuff, notifying you when it fails, debugging, etc.
Once your data is in the format and place it needs to be, it’s time to analyze it. There are a lot of different types of analysis:
Exploratory, open ended analysis
Ad hoc question answering (“can you cut the data this way for me?”)
Dashboards and regular reporting
Building ML models
Building data apps
Each requires, or at least benefits from, some degree of specialized tooling. Data Scientists need somewhere to write one off SQL queries, but also a place to build and maintain charts, which might be a separate place from where they share experiment results, and so on and so forth.
The sum of all tools required for the above (plus maybe a few others) makes up the data stack.
To talk about a modern data stack implies that there’s an old data stack. Generally, the steps were the same but the tools were different. A few highlights (or lowlights, depending on if you were around):
Oracle – for storing and moving data around
Teradata – an on-premise data warehouse
Hadoop – a set of tech for storying and querying large datasets
Trifacta – ETL tool1
Informatica – ETL tool
Outside of Oracle, you probably haven’t heard of any of these tools. And if you have, you are either an ancient data scientist, a disgruntled former stakeholder of an ancient data team, or worse, an equity research analyst.
🖇 Workplace Example 🖇
If you want to understand the kinds of companies we’re dealing with here…Teradata has an entire page on their website dedicated to headlines about their 40 year history. Here’s a particularly thoughtful, practical one from 2019: “Teradata celebrates 40 years of delivering technology innovation.” Riveting.
🖇 Workplace Example 🖇
Anyway there are a few themes that tie these tools together:
On premise: the tools are software that you set up and run in your own data center, as opposed to accessing over the internet
Open source (ish): Hadoop is open source, and was one of the big building blocks of data from 2006-2015 (RIP)
Highly complex: getting these tools set up and configured was a real hassle
These themes touch on another important theme: real analytics was mostly reserved for larger enterprises with significant resources to put towards data. Think banks, insurance companies, etc. This was also mostly before the cloud really existed…which brings us to the modern data stack.
The modern data stack basically just applied cloud philosophy to the data stack. Instead of large, highly configurable, on premise software, companies started using cloud-based, easy to get started with, more opinionated software. Tools in the modern data stack are:
Cloud first – your data sits on someone else’s servers in the cloud; no need to manage your own, deal with upgrades, etc.
Simple – products are designed to get started with quickly and require minimal configuration; you should be able to get something working in a single sitting
It’s worth noting that the old data stack didn’t suck because anyone wanted it to: technology has just progressed, a lot.
So what does all of this mean in practice? Let’s run through our 4 pieces of the data stack again:
Fivetran is the classic here, with open source competitor Airbyte up and coming. With Fivetran you just connect your source (let’s say Stripe) and your destination (let’s say your data warehouse), and then set up your syncs. You can configure how often you want your data to get moved, what you want to keep and throw out, and even configure alerts for when things go wrong.
Fivetran is entirely cloud based, which means you don’t need to set up or manage servers to do this data moving. Airbyte can be deployed on premises, but also has a cloud option you can pay for.
Gone are the days of Oracle and Teradata: in the modern data stack, teams use cloud based data warehouses. Quoting from the Technically post about data warehouse details:
Snowflake probably has the lead in terms of enterprise adoption (according to my very scientific conjecture). Here’s what the basic UI looks like:
Snowflake highlights (i.e. what make it unique) include:
Support for diverse file types (Parquet, Avro, other random names)
High quality permissioning and data sharing capabilities
Credits-based pricing model, charging for query run time
Getting started with Snowflake is literally as easy as signing up on their website – no infrastructure required. You also get a nice query UI so you can write SQL in your browser. Everything is designed to be powerful, but also simple to get started with.
dbt, that’s it.
Speaking of which, if any of you are trying to write more, I gave a talk at their conference recently about how to write better blog posts.
There’s so much to cover here, and a lot of legacy baggage from old BI tools. Suffice it to say that there are a million ways to analyze and chart your data. My personal favorite is Hex – like other tools in the modern data stack, it’s cloud based and incredibly easy to get started with.
You get a UI for querying and writing Python, collaboration via comments and multiplayer, beautiful charts, and more.
So to sum it all up: the modern data stack is a new wave of tools that moves analytics to the cloud and focuses on simple, intuitive interfaces.
But the line between new and old is blurry, and many companies in the so-called modern stack have been way overusing the term: just search “modern data stack” and you’ll see a truly remarkable amount of spam content and countless Google ads. Stay vigilant, my friends. Until next time.
To end, a message from a friend of mine. He helps people find jobs with 4 day work weeks, so you can focus on the stuff you like (for me, cocktails and music):
4 day work week is a job board for jobs with a better work-life balance. They’re constantly listing remote jobs with a 4 day work week (~32hrs/week) so you can find that elusive work life balance you’re looking for. They have jobs across the spectrum, from PM to Data Scientist to Customer Success Manager. Check it out here, and tell them Technically sent you!
Thank you to Technically reader Ashwin for pointing out that historically, Trifacta specifically marketed themselves as circumventing legacy ETL tools. Today, I think ETL has a much broader, looser meaning and it’s fair to include Trifacta’s main value prop – “data prep” – under that umbrella.