Everyone loves Snowflake. The ease of use when it comes to elastic scaling, data backups, and table optimizations amongst other features make Snowflake a great choice when setting up a data stack. Given the proper budget and data resources, Snowflake is undoubtedly the leading data warehouse candidate. But has your company been considering alternatives to Snowflake, whether it be for cost, vendor lock-in concerns, or other reasons?
We’ve done this. Here at Definite we’ve been building an alternative approach that uses an open source database, DuckDB, and a self-hosted infrastructure. With this setup we were able to migrate all our analytical data and queries from Snowflake to DuckDB. Everything from internal team dashboards to ad-hoc queries (eg Stripe MRR metrics, Hubspot deal progress, etc) now runs on our self-hosted DuckDB database.
TLDR: it required a good bit of work but we saw a radical reduction in cost (>70%).
DuckDB is a columnar database management system built for analytical workloads. It’s fast (see below), open source and has a very active community. The Github repo has 17.5k stars on GitHub with 34 authors have pushing 575 commits to main in the past month. Some features of DuckDB:
*pip install duckdb*
and pointing it to your filesThis article has a good overview of DuckDB and where it fits in your data stack.
Unlike Snowflake, which supports multiple concurrent processes, DuckDB is designed for single-user workloads and becomes locked and un-queryable during data writes or table edits. To work around this issue, we set up two DuckDB instances, a write-to database and a read-only database:
INSERT
, CREATE
, UPDATE
, etc are actioned on this database first.SELECT...
), the query is routed to the read-only DuckDB which in turn fetches the data from GCS and returns it.The benefit of having write-to and read-only DuckDB instances is that the data warehouse is still queryable while edits are happening in the background. This could be a long running ETL job such as loading Stripe transaction data or a user initiated UPDATE
statement; we want to ensure that the data is still readable even while these are running.
Why not also use GCS as the file store for the write-to database given how much cheaper it is relative to using a persistent disk? Currently at the time of writing, DuckDB only supports read-only mode for blob storage such as GCS or S3.
Though the set up outlined above requires some engineering effort, it plays a crucial role in enabling data warehouse-like functionality required for analytics and BI use cases, and we are constantly working to optimize this piece of the infrastructure.
To run the DuckDB setup described above we need two Google Compute Engine instances, one persistent disk, and access to GCS. To estimate the daily cost of running all this we will make the following configuration and storage assumptions:
Since most data warehouses are not in use 24 hours a day, we will estimate the cost under different usage scenarios. In typical BI use cases involving dashboard refreshes, data ingestion, and ad-hoc queries, compute may only be needed during peak business hours.
We also compare the cost of different Snowflake warehouse sizes (compute) running under the different usage scenarios:
As seen in the table on the right above, at 12 hours of usage per day DuckDB on GCP is ~55% cheaper than the smallest Snowflake warehouse. If we compare it to the next smallest Snowflake warehouse, Small, we can see it is ~77% cheaper and savings increase as we go up the Snowflake warehouse sizes.
The same table as before but in chart form:
Of course there are a lot of assumptions in this estimate that could vary; GKE specs can be scaled up or down, Snowflake users may utilize multiple warehouses of different sizes, total data storage size could be larger or smaller, etc, but directionally self hosting DuckDB should yield cost savings relative to Snowflake (see bottom for GCP and Snowflake pricing pages). As mentioned in the beginning, anecdotally at Definite we’ve seen >70% savings by switching to DuckDB.
If you want to give DuckDB a try, here is an excellent guide on how to get it running on your local machine. If you want a more cloud data warehouse-like experience and have the resources to set up your own infrastructure, the DIY approach we outlined in this post is an option. Alternatively, managed solutions like Motherduck or Definite can be a pain-free way to quickly spin up your own DuckDB powered data warehouse.
With Definite, not only do you get a managed DuckDB warehouse, we’ve also added hundreds of integrations to a wide range of other products (Salesforce, Stripe, Postgres, etc) so you can seamlessly move data in and out of DuckDB. Additionally, we also built a BI layer and AI-enabled natural language query layer on top of DuckDB, allowing anyone on your team to derive data insights. Definite’s goal is to be your data-stack in a box with ETL, data warehousing, and analytics all deployed with a few clicks. If you would like to learn more, feel free to reach out!
Resource Pricing References:
Get the new standard in analytics. Sign up below or get in touch and we’ll set you up in under 30 minutes.