Welcome to the New Database Era
The new category of cloud database services emerging
— Originally posted on Techcrunch —
One of the most profound and maybe non-obvious shifts driving this is the emergence of the cloud database. Services such as Amazon S3, Google BigQuery, Snowflake, Databricks, and others, have solved computing on large volumes of data, and have made it easy to store data from every available source. Enterprise wants to store everything they can in the hopes of being able to deliver improved customer experiences and new market capabilities.
It has been a good time to be a database company.
Database companies have raised over $8.7B over the last 10 years, with almost half of that, $4.1B, just in the last 24 months, up from $849M in 2019 (according to CB Insights).
It’s not surprising with the sky-high valuations of Snowflake and Databricks and $16B in new revenue up for grabs in 2021, simply from market growth. A market that doubled in the last four years to almost $90B, is expected to double again over the next four. Safe to say there is a huge opportunity to go after.
See here for a solid list of database financings in 2021.
20 years ago, you had one option, a relational database.
Today, thanks to the cloud, microservices, distributed applications, global scale, real-time data, deep learning, etc., new database architectures emerged to hyper-solve new performance requirements. Different systems for fast reads, and fast writes. Systems specifically to power ad-hoc analytics, or for data that is unstructured, semi-structured, transactional, relational, graph, or time-series. Also for data used for cache, search, based on indexes, events, and more.
Each came with different performance needs, including high availability, horizontal scale, distributed consistency, failover protection, partition tolerance, serverless, and fully managed.
As a result, enterprises on average store data across seven or more different databases (i.e., Snowflake as your data warehouse, Clickhouse for ad-hoc analytics, Timescale for time series data, Elastic for their search data, S3 for logs, Postgres for transactions, Redis for caching or application data, Cassandra for complex workloads, and Dgraph for relationship data or dynamic schemas). That’s all assuming you are collocated to a single cloud, and that you’ve built a modern data stack from scratch.
The level of performance and guarantees from these services and platforms are unparalleled, compared to 5–10 years ago. At the same time, the proliferation and fragmentation of the database layer are increasingly creating new challenges. For example, syncing across the different schemas and systems, writing new ETL jobs to bridge workloads across multiple databases, constant cross-talk and connectivity issues, the overhead of managing active-active clustering across so many different systems, or data transfers when new clusters or systems come online. Each with different scaling, branching, propagation, sharding, and resource requirements.
What’s more, new databases emerge monthly to solve the next challenge of enterprise scale.
The New Age Database
So the question is, will the future of the database continue to be defined as what a database is today?
I’d make the case that they shouldn’t.
Instead, I hope the next generation of databases will look very different from the last. They should have the following capabilities:
- Primarily compute, query, and/or be infrastructure engines that can sit on top of commodity storage layers.
- No migration or restructuring of the underlying data is required.
- No re-writing or parsing of queries is needed.
- Work on top of multiple storage engines, whether columnar, non-relational, or graph.
- Moves the complexity of configuration, availability, and scale into code.
- Allows applications to call into a single interface, regardless of the underlying data infrastructure.
- Works out of the box as a serverless or managed service.
- Be built for developer-first experiences, in both single-player and multiplayer modes.
- Deliver day 0 value for both existing (brownfield) and new (greenfield) projects
There are many secular trends driving this future:
1. No one wants to migrate to a new database. The cost of every new database introduced into an organization is an N² problem to the number of databases you already have. Migrating to a new architecture, schema, configuration, and needing to re-optimize for rebalancing, query planning, scaling, resource requirements, and more is often a [value/(time+cost)] of close to zero. It may come as a surprise, but there are still billions of dollars in Oracle instances still powering critical apps today, and they likely aren’t going anywhere.
2. Majority of the killer features won’t be in the storage layer. Separating compute and storage has increasingly enabled new levels of performance, allowing for super cheap raw storage costs, and finely-tuned, elastically scaled compute/query/infra layers. The storage layer can be at the center of data infrastructure and leveraged in various different ways, by multiple tools, to solve routing, parsing, availability, scale, translation, and more.
3. The database is slowly unbundling into highly specialized services, moving away from the overly-complex, locked-in approaches of the past. No single database can solve transactional and analytical use cases fully; with fast reads and writes, with high availability and consistency;all while solving caching at the edge, and horizontally scaling as needed. But unbundling into a set of layers sitting on top of the storage engine can introduce a set of new services to deliver new levels of performance and guarantees. For example, a dynamic caching service that can optimize caches based on user, query, and data awareness; managing sharding based on data distribution query demand and data change rates; a proxy layer to enable high availability and horizontal scale, with connection pooling and resource management; a data management framework to solve async and sync propagation between schemas; or translation layers between GraphQL and relational databases. These multi-dimensional problems can be built as programmatic solutions, in code, decoupled from the database itself, and perform significantly better.
4. Scale and simplicity have been trade-offs up until now. Postgres, MySQL, and Cassandra are very powerful, but difficult to get right. Firebase and Heroku are super easy to use but don’t scale. These database technologies have massive install bases, and robust engines, and withstood the test of time at Facebook and Netflix-level scales. But tuning them for your needs often requires a Ph.D. and a team of database experts, as teams at Facebook, Netflix, Uber, Airbnb all have. For the rest of us, we struggle with consistency and isolation, sharding, locking, clock skews, query planning, security, networking, etc. What companies like Supabase and Hydras are doing in leveraging standard Postgres installs but building powerful compute and management layers on top, allow for the power of Postgres, but with the simplicity of Firebase or Heroku.
5. The database index model hasn’t changed in 30+ years. Today we rely on general-purpose, one size fits all indexes such as B-trees and Hash-maps, taking a black-box view of our data. Being more data-aware, such as leveraging a cumulative distribution function (CDF) as we’ve seen with Learned Indexes, can lead to smaller indexes, faster lookups, increased parallelism, and reduced CPU usage. We’ve barely even begun to demonstrate next-generation indexes that have adapted both to the shape and changes of our data.
6. There is little-to-no machine learning used to improve database performance. Instead, today we define static rule sets and configurations to optimize query performance, cost modeling, and workload forecasting. These combinatorial, multi-dimensional problem sets are too complex for humans to configure, and are perfect machine learning problems. Resources such as disk, RAM, and CPU are well characterized, query history is well understood, and data distribution can be defined. We could see 10x step-ups in query performance, cost, and resource utilization, and never see another nested loop join again.
7. Data platform and engineering teams don’t want to be DBAs, DevOps, or SREs. They want their systems and services to just work, out of the box, and not have to think about resources, connection pooling, cache logic, vacuuming, query planning, updating indexes, and more. Teams today want a robust set of endpoints that are easy to deploy, and just work.
8. The need for operational real-time data is driving a need for hybrid systems. Transactional systems can write new records into a table rapidly, with a high level of accuracy, speed, and reliability. An analytics system can search across a set of tables and data rapidly to find an answer. With streaming data and need for faster responsiveness in analytical systems, the idea of HTAP (hybrid transaction/analytical processing) systems are emerging — particularly for use cases that are highly operational in nature — meaning a very high level of new writes/records and more responsive telemetry or analytics on business metrics. This introduces a new architectural paradigm, where transactional and analytical data and systems start to reside much closer to each other, but not together.
A New Category of Databases
A new category of cloud database companies is emerging, effectively deconstructing the traditional database monolith stack into core layered services; storage, compute, optimization, query planning, indexing, functions, and more. Companies like ReadySet, Hasura, Xata, Ottertune, Apollo, Polyscale, and others are examples of this movement and quickly becoming the new developer standard.
These new unbundled databases are focused on solving the hard problems of caching, indexes, scale, and availability, and beginning to remove the trade-off between performance and guarantees. Fast databases, always-on, handling mass scale, and data-aware, blurring the lines between the traditional divisions between operational and analytical systems. The future looks bright.