5 Features Of Snowflake That Data Engineers Must Know

Deepti J 19 Oct, 2021 • 7 min read
This article was published as a part of the Data Science Blogathon
Snowflake is a cloud data platform that comes with a lot of unique features when compared to traditional on-premise RDBMS systems. In this tutorial, you will see the top 5 features that developers should know before implementing a solution on the Snowflake data platform with a quick example as well. If you are just beginning with Snowflake, then I would recommend you to go through my article first.

 

                                                                                              Image Source: Canva

The 5 features are listed below

1. Stages
2. Zero copy cloning
3. Time Travel
4. Streams
5. Tasks
 

Let’s discuss these in detail.

Stages in Snowflake

Stages are locations used to store data. From the stages, the data will be loaded into the snowflake tables. Stages can also be used to unload the data as well from Snowflake. Stages are classified as below-

a.Internal Stage

Internal Stage where the data that has to be loaded into the Snowflake is stored inside Snowflake itself

b.External Stage

External Stage where the data that has to be loaded into the Snowflake is stored outside Snowflake, for example in AWS S3. By using the external stage you can query the files that are available in AWS S3.

Internal Stages are further divided as below:

a.User Stages

User stages are created by default and it is specific to that particular user. If a user wants to load data from local into many tables then this can be used.

b.Tables Stages 

Table stages are created by default when a table is created and is specific to that particular table.

c.Named Stages

Named stages can be explicitly created by us.

If you looking for a deep dive into Stages with hands-on, then do check out my article here.

Time Travel in Snowflake

Remember the time machine from the good old movies!! We can go back in time and see the world right. The time travel feature offered by snowflake is similar to this.

Time travel can be defined on objects like Databases, Schemas, Tables. The amount of time with which we can see the historical data of the table is defined by the data retention period parameter. By default, it is set to 1 day for all the objects in all editions of Snowflake.
For Enterprise and Business-Critical editions, you can extend this parameter to 90 days.
 

The time travel option on an object is set using the parameter “DATA RETENTION PERIOD”.Once the time travel duration is crossed then the object enters the Fail-safe region. When the object is in Fail safe mode,  if you need to retrieve this object then you need to contact the snowflake itself.

Let us see this with an example

1. Create an employee table with a data retention period as 4 days. Note that I am using DEMO_DB database and PUBLIC schema

---TIME TRAVEL EXAMPLE
create or replace table T_EMP_DETLS (empid number,empname varchar) data_retention_time_in_days=4;

insert into T_EMP_DETLS values(1,'Micheal');

insert into T_EMP_DETLS values(2,'Nick');

insert into T_EMP_DETLS values(3,'George');

insert into T_EMP_DETLS values(4,'Donald');

insert into T_EMP_DETLS values(5,'Vincent');

insert into T_EMP_DETLS values(6,'Leo');

select * from T_EMP_DETLS;

                                                                                                             Image Source: Author

After 5 minutes I inserted another row with EMPID 7 as below

insert into T_EMP_DETLS values(7,'William');

Now there are 7 rows in the table, But let us go back to 5 minutes and see how the table looks.

Image Source: Author

Like this, you can go back in time and check what data the table contains.

Zero Copy Cloning in Snowflake

In a typical RDBMS system, When you are copying a table from one database to another then the table along with the data is also copied which means that the data is duplicated twice and storage is also higher.

Snowflake has a feature of zero-copy cloning where you can create a copy of a table without copying the entire data. So the best part is we are charged for only one copy of data. For example, if you need to copy data of a particular table from the Production to the Development environment, you can use zero-copy cloning. Here data is only stored in Production.
 

So what happens if you apply changes in development. Every table in the snowflake is stored internally as micro partitions.

So when a change happens in development, only that particular micro partition is changed and it will have two replicas. The changed replica will be available in development and the original replica will be available in Production. So like these you can create any number of replicas and pay only once if the data is not changed.
In this example , you will learn how to clone a table. Let us clone the table that is created in the above section.
create or replace table T_EMP_DETLS_CLN clone T_EMP_DETLS ;

Let us see how much storage this lone table occupies when compared to the original table T_EMP_DETLS.

We can see that the storage for the cloned table is 0 and both the tables share the same clone id

create or replace table T_EMP_DETLS_CLN clone T_EMP_DETLS ;
SELECT * FROM T_EMP_DETLS_CLN

Image Source: Author
select * from snowflake.information_Schema.table_Storage_metrics WHERE TABLE_NAME LIKE '%T_EMP_DETLS%'

Image Source: Author

Streams in Snowflake

Stream is a feature where you can track the changes happening in the data for a table. You can capture all inserts, updates, and deletes as well. You can capture changes from the time when the stream is enabled but you will not be able to capture the change before that.

Once the stream is created, you can query the stream or you can consume the stream. If you consume the stream, that is once you insert the data from the stream to another table then the stream will become empty. This is very handy while building continuous data pipelines.
 

The creation of a stream on a table will create three extra hidden columns on the table that will track the metadata about the changes. Please note that the stream itself does not contain any data.

Below 3 columns are added to the table when creating a stream

 

1.METADATA$ACTION – This column will specify whether the record is inserted or deleted.
2.METADATA$ISUPDATE – This column will be set to TRUE if the record is updated.
3.METADATA$ROW_ID – This column is a unique id for rows that cannot be changed.

Snowflake supports 3 types of streams

1. Standard – This supports tracking of all inserts, updates, and deletes.
2. Append only – This supports tracking of all inserts on the table.
3. Insert Only – This is supported only on external tables.
Let us create a stream on the existing table as below and insert the data into the main table.

create stream st_emp_dtls on table T_EMP_DETLS;
insert into T_EMP_DETLS values(8,'Arnold');

Let us see what we have in the stream now

Image Source: Author

As you can see, the record of the row that is inserted was displayed in the stream. There is a difference between querying a stream and consuming the stream, We have queried the stream in the above section. We will see what happens when we consume the stream in the next section.

Tasks in Snowflake

The task is one of the handiest features when creating a data pipeline in a snowflake. Do you want to schedule any SQL statement in snowflake, then use tasks? Tasks can be ordered in a hierarchical manner. At a particular point in time, only a single instance of task runs. For example, If you schedule a task for every hour, then if the task from the first hour is still executing, then the second run of the task will not be initiated. It will be skipped. Once you define the task, then it will not run immediately. You need to resume the task to start it.

Let us now create a task that runs a SQL statement

 

CREATE TASK T_TASK1
  WAREHOUSE = COMPUTE_WH
  SCHEDULE = '5 minute'
WHEN
  SYSTEM$STREAM_HAS_DATA('st_emp_dtls')
AS
  INSERT INTO test_emp_tabel(EMPID,EMPNAME) SELECT empid,empname FROM st_emp_changes WHERE METADATA$ACTION = 'INSERT';
Image Source: Author

Here we are trying to capture the newly inserted records on T_EMP_DETLS into another table called test_emp_tabel which is also one kind of change data capture. The task created also consumes the data from the stream which makes the stream empty now. Once it is created we need to start the task as by default it will be in suspended mode. Ensure that the table test_emp_tabel is created before you run the task

Image Source: Author

Let us start the task

alter task T_TASK1 resume;


                                      Image Source: Author

Now let us check the data in the target table and also the data in the stream. The stream will become empty as we have consumed it.

Image Source: Author

Image Source: Author

Conclusion

I hope you got an overview of the most important features in Snowflake and these features will help to design your solutions better especially in cases when you are migrating from an on-premise environment to Snowflake or building continuous data pipelines. If you want to explore more then refer to the official documentation from Snowflake here.

Happy Learning!!

Let me know your comments below if any.

If you would like to connect with me, then reach out here.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Deepti J 19 Oct 2021

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear