Snowflake is a cloud-based software-as-a-service (SaaS) data warehouse. It offers separate and on-the-fly scalable services for data storage, compute, and analytics.
To make the most out of your Snowflake Data Warehouse you need an ETL tool that can help you ingest data into Snowflake, transform it (beyond Snowflake’s own SQL transformation capabilities) and move data from the database in Snowflake to other tables/databases within Snowflake, external files, applications, or anywhere where the data drives impact.
In this blog we are going to overview 5 ETL tools that are currently available on the market:
Keboola
Stitch
Matillion
Apache Airflow
Blendo (now RudderStack)
#getsmarter
Oops! Something went wrong while submitting the form.
Use 250+ connectors to bring all your third-party data together in one centralized ETL data pipeline.
Keboola is a data platform as a service that helps you build and automate all your data pipelines.
By automating ETL, ELT, and reverse ETL pipelines, you save precious data engineering time so you can focus on more revenue-generating tasks.
Keboola is fully self-service, offering intuitive no-code tools to the business experts who don’t know how to use code to create data integrations by themselves as well as a feature-rich developer toolbox for engineers so they can fully customize their data pipelines.
Pros:
250 connectors help you build data pipelines with out-of-the-box components in a couple of clicks. In case you stumble upon a data source/destination that is not covered by Keboola’s pre-built connectors, you can use the Generic Extractor and Generic Writer to extract and load data from any endpoint.
Easy to use. Business experts can use Keboola as a self-service ETL tool without needing to rely on the IT department. The Visual Flow Builder empowers them to build ETL pipelines within a drag-and-drop GUI. The no-code transformations allow them to clean the data with a couple of clicks. The data engineers can build pipelines in the same friendly GUI, or use developer tools, such as code-driven data transformations (SQL, Python, R, or Julia), a devoted CLI for data pipelines, or connect their dbt code.
ETL, ELT, and reverse ETL. Keboola can transfer data between any endpoints. Move data to your cloud data warehouse or database (Snowflake, Microsoft Azure, Google BigQuery, Amazon Redshift, SQL Server, …), BI tools (PowerBi, Tableau, …), or SaaS apps (SalesForce, Mailchimp, …).
Keboola is a Snowflake partner, offering the highest quality of nativedata integrations with your Snowflake instance.
Keboola offers near real-time data integration, with orchestrators that can keep data fresh up to the 1 min granularity. It cannot handle full real-time data integrations (data streaming).
Pricing:
Freemium model with 300 compute minutes for free every month, without any limitation on the number of users, connectors, types of operations, etc. Basically, you get the entire Keboola experience for free.
After the 300 minutes are used up, Keboola has a fair pricing model that scales with your needs. And you only pay for what you use at 14 cents per compute minute.
Automate every step of your ETL process and save time for revenue-generating tasks. Start for free. No credit card required.
Stitch is an open-source ETL tool that focuses on data extraction and loading data to a data warehouse or data lake. It was acquired by Talend and is integrated with many complementary paid and proprietary data services by Talend.
Pros:
Can extract data from multiple sources. Sources not covered out-of-the-box by Stitch can still be extracted using the company’s extensibility framework.
Very intuitive to use.
Great for replication of relational databases.
Very polished product, with little to no bugs.
Cons:
The out-of-the-box data sources and data destinations covered by the tool are quite limited. Expect a lot of manual coding.
Depending on your pricing tear, you might be very limited in the number of data destinations (only 1 destination is allowed in the lowest pricing tier). So ELT and reverse ETL are harder.
A lot of the best-working parts of Stitch (advanced scheduling, monitoring, …) are only available on the highest-paid tier. In general, Stitch is on the higher end of pricing for ETL tools.
Users are charged by the number of rows processed by the tool. Multiple passes of the same datasets (e.g. once for extract, once for load) are counted at each pass.
The first import of historical data is free and does not count toward the row quota.
3. Matillion
Matillion ETL is a data integration tool that can build ETL data pipelines through a simple no-code/low-code drag-and-drop GUI.
Pros:
Intuitive drag-and-drop GUI that helps you build ETL pipelines with code or no-code.
Single data operations scale well with change data capture and batch processing ingrained in data operations.
Full support for ETL, ELT, and reverse ETL. The number and types of connectors covered by Matillion are extensive enough to cover the vast majority of use cases.
Cons:
The no-code ETL features are unlocked at the higher tier pricing.
Can have issues with scaling hardware infrastructure, especially EC2 instances at higher loads.
Pricing is compounded - you pay for Matillion and the compute resources it uses to perform data operations on your cloud.
Users often report documentation can be stale, new versions of Matillion are rarely backward compatible (so you need to do a lot of maintenance when updating the software), and there is poor support for versioning (git).
Pricing:
Freemium model for loading data, but not for the entire ETL pipeline building.
The ETL process is covered in a consumption-based pricing model, which combines rows of processed data, users, and active ETL processes into a single consumption formula.
4. Apache Airflow
Apache Airflow is a data platform that builds, schedules, and monitors workflows. Workflows can be ETL pipelines, or they can be extended to other developer use cases (reverse ETL, data enrichment, transformations, …). Workflows are written as Python DAGs (directed acyclic graphs).
Pros:
Extremely flexible. Covers all extract, transform, and load operations within Python scripts for any combination of data source - data destination.
Very intuitive UI with easy-to-understand monitoring for every workflow and scheduled task.
The ecosystem for monitoring is very advanced. You can send alerts and logs to communication software (like Slack) or to monitoring platforms (like Grafana).
Cons:
It is Python-based, so not friendly for non-developers.
Sometimes Apache Airflow does not scale smoothly with data volumes and speed.
Maintenance and bug resolution can be painful since the product is not as polished as the paid alternatives from competitors.
Pricing:
Fully open-source and free. You just have to deploy it on a server, and then allocate and monitor its memory and compute consumption.
There are paid commercial supports available (e.g. Google Cloud Platform offers Cloud Composer as a paid commercial version of Apache Airflow). The pricing policy differs between the commercial vendors, so make sure to check which vendor works best for you.
5. Blendo (now RudderStack)
After the acquisition you might find the popular Blendo tool under either its original name or the new one called RudderStack.
Blendo is tailored towards speeding up ETL pipelines from raw sources to the data warehouse and finally integrating the data into your business intelligence tool. Blendo is mostly geared towards BI and data analytics use cases, by simplifying the data engineering pipelines.
Pros:
Very simple to use, especially for no-code users.
Covers a high number of sources and destinations, covering the majority of marketing, HR, sales, PR, and data applications.
Cons:
Not as flexible as other tools. Because Blendo is primarily concerned with automating the no-code use cases, users find it hard to customize the ETL process.
In case a data source or destination is not covered by Blendo’s connectors, you’ll have to talk to customer support to get it implemented (there is no universal component like you can find in Keboola, Matillion, or Airflow).
Pricing:
No freemium model, only a 14-day free trial.
Pricing is not transparent and is based on the volume of rows processed by the tool (either by inserts, updates, deletions, or other operations). You will need to contact the vendor to get a quote.
With 5 best tools, how do you pick the one?
How to evaluate Snowflake ETL tools?
The ETL tool will help you partially or fully automate each stage of the extract, transform, and load data pipelines.
To make sure you do not miss any crucial feature, match your shortlisted ETL tools against these 7 criteria:
Coverage of data sources. ETL tools differ in the number and type of data sources they can extract raw data from. Some are focused on marketing and sales APIs, while others are more agnostic and can also handle incoming files (CSV, Excel, JSON, …) and different endpoints. Prepare a list of all your data sources to match against the ETL tool vendor’s coverage.
Transformation capabilities. Some ETL tools are called “ETL tools”, but do not offer any (or only limited) transformations. The Snowflake data cloud has a powerful and optimized transformation capability with ANSI SQL. So technically, you do not need a separate tool for it. But having a devoted tool helps you keep transformations consolidated in a single location and automate many transformation tasks.
Target audience. Some tools are no-code, while others require you to know how to code in SQL, Python, Scala, Java, etc. Pick the right tool depending on who will use the ETL tool (ex.: data engineer vs business expert).
Data load destinations. Snowflake ETL tools assume they can ingest data into Snowflake. Keep an eye out for other destinations that might be beneficial to your company. For example, can you use the tool to build a data integration pipeline from raw data sources to Snowflake and then pipe the sanitized data to a business intelligence app? In other words: is the tool capable of Reverse ETL or ELT?
Pricing. Open-source solutions are usually cheaper than data platforms and tools offered by vendors (no fees, no licenses, no throttling at “unfair” consumption thresholds, and other vendor tricks). But the Total Cost of Ownership is usually greater for open-source solutions because of maintenance, debugging, running your own DataOps to provision servers and instances needed for the open-source tool. Keep in mind all the possible expenses and hidden fees when evaluating your tool of choice.
Ease of use. Intuitiveness will help you speed up the deployment of new data pipelines. Also, check the tool for all the automation that makes your life easier. Does it offer scheduling? Collaboration and user-role handling? Versioning of data? All the little automations that help you ease up the work surrounding ETL pipelines.
Support and documentation. When things go wrong, is there a strong support system, such as vendor-guaranteed SLAs for support? Or if the tool is open-source, is there a strong community of users who can answer your questions? Is there extensive documentation you can rely on?
The seven criteria cover the most common “gotchas” when picking the best ETL tool for Snowflake. But there are also more subtle considerations to keep in mind.
Pro tips for data engineers
Beyond the general criteria for evaluating the ETL process coverage and tradeoffs, keep these tips in mind to narrow down your tool selection:
Pick a Snowflake partner. Snowflake partners offer native integrations that are quality checked, are in sync with the latest version of Snowflake, are backward compatible, and have many other features. Snowflake partners are the “seal of approval” for the ETL tool.
No-code and full-code are not necessarily a tradeoff. Some tools offer both no-code and full-code implementations. For example, with a drag-and-drop GUI for non-developers and a terminal/IDE for developers. Tools that offer both mechanisms usually have higher adoption rates throughout the company.
Check data security. Snowflake secures data with many mechanisms (end-to-end encryption, role-based access, fully inspectable log of changes, etc.). But your ETL tool will access raw data before it enters Snowflake and cleaned data if you’ll pipe it to external tools. Make sure the tools comply with your regulatory standards.
Verify the ETL tool can scale. Snowflake speeds up queries using MPP (massively parallel processing) compute clusters. It would be a shame if your ETL tool would slow down the entire ETL process with non-scalable operations. Whether a tool can scale or not is hard to determine during the “window shopping” phase. Things that hint at great scalability are any “big data” features (not just as a name drop, but actually any machine learning processing), cloud-based compute technologies supporting the tool (or just multiprocessing), change data capture architecture (CDC), etc.
Pick tools that help you with DBA. Database administration (DBA) is becoming an outdated term, but it still covers a lot of data warehousing work, such as database replication, schema control, etc. The best ETL tools help you automate DBA work alongside the ETL pipelines they build.
Enough theory. Let’s put all this knowledge in action. We’ll build an ETL data pipeline from scratch.
Example of Snowflake ETL in practice
Keboola makes it easy to build ETL pipelines.
Step 1: Login into Keboola
Navigate to the login/signup page and sign in or create a new account, if you don’t already have one. It is free and no credit card is required.
Once inside Keboola’s UI, start creating a new data pipeline with the Visual Flow Builder:
Click “Flow” in the navigation menu.
Click “+ New Flow” to create a new ETL process.
Name your process “My first ETL in Keboola” and click “Create Flow”.
Step 2: Pick the source to extract raw data
Inside the Visual Flow Builder:
Click “Select first step”. A component list will show up.
Use the search bar to find the data source you want to extract raw data from. In the example below we picked Google Analytics data.
Click through the configuration wizard to connect Keboola with your data source so Keboola can start extracting data.
Step 3: Clean the data
Click on the + icon and add a new component. This time, we’ll use the no-code transformations to filter out rows from Google Analytics.
The wizard will guide you through the configuration of the no-code transformation layer. Once set up, you can pick the many transformation operations available. We’ll use the filter functionality to keep just certain rows (the ones which have users sessions > 0).
Step 4: Load data to Snowflake
Click on the + icon and add a new component. In the example below, we used the component to send the cleaned data to the Snowflake data destination.
Step 5: Run and automate
You can either run your ETL data pipeline right away (click the green button “Run Flow” in the upper right corner), or set it to run automatically on a schedule (click the link “Set Schedule”).
There you have it. In under 10 minutes, we have created an ETL pipeline that extracts data from Google Analytics, cleans it, sends it to Snowflake, and is fully automated to run on a schedule.
Deploy ETL pipelines faster and at a fraction of the costs
Keboola cuts complexity and simplifies your data engineering tasks.
Keboola automates the repetitive manual engineering tasks, takes care of maintaining and scaling the infrastructure, and saves you precious time with tools that empower everyone to self-serve their analytic needs.
Use Keboola as the fastest way to get your data into Snowflake.
We use cookies to make Keboola's website a better place. Cookies help to provide a more personalized experience and relevant advertising for you, and web analytics for us. By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage and assist in our marketing efforts. To learn more about the different cookies we're using, check out our Cookie Policy
By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage and assist in our marketing efforts. More info