Snowflake Offers The Benefits Of Partitions, Indexes & Stats Without The Work

Paul M Fosse
5 min readApr 6, 2021

I’ll tell a story of how the development of a new data mart goes in a typical on-premise environment and how that will change as we move to Snowflake, a modern cloud analytical database. (originally published on LinkedIn).

Using Traditional Oracle to load a data mart

When designing a data mart on a traditional Oracle database, before Exadata, developers are trained they need to perform many steps before loading the data.

  1. Gather requirements on how the data will be used. Will your business partners run reports by time period or state or region or product line? How many years of history will be loaded initially and maintained for the life of the project?
  2. Best practices are to partition the data by time and if the data is more than 10 million rows or so, maybe another subpartition. These partitions have to be manually created (or a script manually written to create new ones as need and delete ones as no longer needed). You must define the partitions before loading the data and if you set up anything wrong, the load will fail.
  3. Next, analyze the queries expected and the columns you will likely use the most and you set up individual and composite key indexes on many of the important columns. Later, you will need to test your queries to see if the indexes are even beneficial and if they are not you should remove them (but nobody remembers to do this step).
  4. You then write your load programs and must decide between 3 bad choices. First, if your data loads are small, you can load your data with all your indexes on the tables, but they will load slowly with the overhead of incrementally updating the indexes as you load each row. Second, you can drop the indexes before the load, then load the data, and then recreate the indexes. This speeds the load, but recreating the indexes on the whole table when you only loaded a few rows is wasteful. The 3rd bad choice is to just drop the indexes for the partitions you are loading, but you usually can’t do that since you can get updates to data in any partition.
  5. As you load each partitioned table, you can either gather stats on just that partition and its indexes (which sounds like a great idea until you try it) or wait until you are complete with all the partition loads. There are issues with both approaches.
  6. After you do all these tasks, your database queries end up running well for a while, but then a new release of the database comes out of some settings on the database are changed without notification and all of a sudden a query that took 20 minutes is taking 20 hours. You see traditional optimizers are very fragile. They depend on partitions, indexes, and stats all being updated correctly or they frequently pick a plan that is much slower than the query ran yesterday.

Using Snowflake to load a data mart

So how are those 6 steps different using a Cloud database? It seems these tools are using the 80/20 rule or the Pareto principle. Can we get 80% of the benefits of partitions, indexes, and stats for 20% of the effort? And it is even better if the 20% effort is done in the backend without the developer lifting a finger.

  1. Ever since the data lake concept has been popularized, the idea is just to put the data in the Data Lake since whatever the requirements are, you are going to need the data, so just load it in whatever format it came in.
  2. Snowflake’s micro-partitions capture the low and high value for every file they create. Then the execution engine can use this metadata to skip reading the files that don’t meet the needs of the query. Now if you have a big table that you want to be clustered by something other than load order, Snowflake lets you set that up in the background, but you can do that later if needed, you don’t have to do it before getting the data mart up and running.
  3. Snowflake doesn’t use indexes, but since it stores data by column and range of values, it can scan the data needed to find specific values pretty quickly. It wouldn’t be optimal for an online transaction processing system, but it works great for analytical queries.
  4. You write your data loads, but any reclustering (if needed) is done in the background. So you don’t have to make any choices.
  5. Likewise, stats are automatically gathered on all data loaded, so you don’t have to schedule another process to do that.
  6. Your queries end up having consistent plans since they always have consistent high and low values for each column in each file and how many distinct values. You don’t have the problems of stale indexes or stale statistics. You could have the data clustered on the wrong columns (similar to having the data partitioned on the wrong columns in Oracle), but if you do run into this issue, you can correct it in the background with no downtime and without writing a script to #1 back up the data (hopefully you have enough space on the server to do this), #2 drop the table, #3 recreate the table, #4 load the data, #5 create the indexes and #6 gather statistics. On Snowflake you just “alter table a cluster by (c1)” to change the table to be clustered by column c1. Snowflake will recluster your table in the background while continuing to support queries of all types on the table.

Conclusion

As you can see, there is a lot more to Snowflake than the flexibility of scaling your warehouse without making the long-term commitment to buy more hardware and software. The software is designed to streamline the development of analytical databases. Does this cut the time in half or a quarter? I don’t know, it depends on how much time you spent on partitions, indexes, and statistics today. In my experience, small projects don’t need to spend much time on it, but larger projects spend weeks or even months figuring those things out.

To read more about Snowflake’s table structures, check out their online documentation, no registration needed. I hope you enjoyed my article contrasting development on Oracle and Snowflake. If you like my writing, I’ve written hundreds of articles on Electric Vehicles (EV’s) and Solar at CleanTechnica and TorqueNews.

--

--