Read&Subscribe to my blogs@siddhithakkar.com. Details of slides below:
This document is aimed at giving you a short introduction about ETL tools and processes. Please do not forget to look at the notes section which include descriptive narration of points shown in slides.
2. Agenda
What is ETL?
How does it work?
It’s advantages & disadvantages.
Popular ETL tools in market.
What are its alternatives?
Comparison of ETL with its alternatives.
3. What is ETL?
Extract, Transform, Load
Three functions combined in
one tool
Pull data from one database to
put in another
4. How does it work?
All processes can be performed simultaneously.
5. Process 1- Extraction
• Objective: To retrieve data from all sources
• Process of gathering data from multiple sources.
• Data is often loaded into staging area.
• Special care about performance of source system is
needed
• Three ways to perform the extraction:
• Update notification
• Incremental update
• Full Extraction
6. Process 2- Transform
• Objective- Improve quality of data
• Some data sets can skip this process
• Key step in delivering meaningful business insights
• Consists of three steps:
• Cleanse
• Map
• Transform
7. Process 3- Load
• Objective- To load data in datawarehouse
solution/target repository
• Optimization for performance is on high priority
• Resilient processes
• Following three types of loading is possible:
• Initial load
• Incremental load
• Full refresh
8. Advantages of ETL tools:
• Less time-consuming
• Identify delta changes
• Convert heterogenous data into homogenous
• Derive greater business insights
• Advanced data cleansing mechanisms
• Support for Big data
9. Disadvantages of ETL tools:
• Needs special skills
• Takes long time to set up
• Not ideal for near real-time data access
• Not suitable for changing requirements
• Meant for batch processing
12. ELT versus ETL
Extract, Load, Transform Extract, Transform, Load
Transformation takes place on an intermediate server. Transformation takes place directly on target server.
Transformations are processed by ETL tools. Transformations are processed by target databases.
Load time is substantial. Load time is reduced.
Not so suitable for huge amount of data. Most suitable when:
- Big volumes of data need to be processed.
- the target and source database are same.
Old school New age
Hello and welcome to this short introduction of ETL tools and processes
I plan to start by giving you a short definition of ETL, followed by how does it really work.
We will move on further by talking about advantages and disadvantages of ETL processes.
We will also have a short look at some of the common ETL tools in the market.
In the end, we will know about alternatives of ETL tools and a short comparison amongst them.
Starting on with the definition….
ETL is short form of three different processes which are- Extract, Transform, Load
ETL tools are primarily the ones that allow you to combine these three database functions into one tool
Such a process is used to pull data from one database and put it into another.
But how does it really work? Let’s look at each of these processes one by one.
Starting on with Extract-
- Which is the most important step of ETL- It is the process of gathering data, often from multiple sources.
- These sources of data could either be Relational or Hierarchical databases, Excel files, XML files or anything else
Secondly Transform-
- Entire data gathered via the extraction is analyzed, and made ready for conversion to another format.
- Lookup rules or tables are used in this process
- Usual tasks performed are- cleaning data (converting from date format to another OR changing Male to M), filtering(removing some data), enriching (populating first, middle and last names), splitting columns (dividing name column into first name and last name), removing duplicates, standardizing, translating, verifying data sources
Load-
Final stage of ETL process
Transformed data is loaded into target repository (usually databases)
All three processes can be performed simultaneously. Extraction is a time-consuming process, while the extraction is still on- data can be transformed and loaded without waiting for completion of previous steps
The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible.
Data is extracted into staging area since it often needs to be cleaned up before loading into target repositories
While extraction of data, it must be ensured that performance or response times of source systems is not impacted, because often these sources systems are live production databases
Update notification- in some cases, source systems are able to provide notification when a record is changed and provide details of change. That is the easiest way to data extraction
Incremental update- some databases are not able to provide notification and details of changes, but hold the capability to tell you which record change and provide extract of that specific record.
Full extraction-some databases are neither able to provide notifications, neither extractions- in those full extraction of complete data sets in needed. This case can handle deletion as well.
Data gathered from extraction stage is raw, and most often, can’t be used as such
Data that doesn’t need any transformation is called direct data or pass through data
Cleanse: Different spellings of the same person (Jon, John)
Different ways to denote same company (Google, Google Inc.)
Male written as Male, M
Validate address fields against each other (Street, State, City etc.)
Map: Mapping all address field to one column
Concatenating first and last name fields into name
Transform- a set of functions are performed on data sets to convert it from source format into target format.
Load is the last step of ETL process
To load data in datawarehouse solution/target repository: explain datawarehouse
Optimization for performance is on high priority since huge amount of data needs to be loaded in short time (overnight).
Resilient processes: Recover mechanisms have to be good so that if loading of data fails, the process must be able to pick it back up from the point of failure.
Initial load: when datawarehouse tables are populated
Incremental load: when only changes in data is copied
Full refresh: erasing contents of table and re-loading with fresh data
Less time-consuming: The older way of moving data from one db to another was to write humongous amount of code. ETL tools often contain GUI which help in forming rules and choosing source and target in columns in different databases. That’s why it is also regarded highly for its ease of use.
Identify delta changes: ETL tools are also capable of identifying delta changes, and therefore allow to copy only the changes, rather than copying the entire dataset again.
Convert heterogenous data into homogenous form: We generate huge amount of data and all of this data is pretty useless unless you perform ETL operations on it
Derive greater business insights: ETL tools help improve access to data, which helps businesses take data-driven decisions accurately
Advances data cleansing mechanisms: most ETL tools provide in-built modules to clean up your data or provide easy interfaces to other systems that can help you clean data. (eg; duplication, enrichment etc.)
Support for Big data: In most simple terms, big data refers to data sets that are too complex and huge for traditional systems to manage. This could mean both structured and unstructured data. ETL tools are now increasingly supporting both structured and unstructured data in one mapping table.
Needs special skills- you need to be a database analyst to use ETL tools. It’s not accessible to business owners
Meant for batch processing: ETL is essentially a batch process which means- it picks up a data set, transforms it, loads it and then moves to next data set. Batch processes are inherently delayed.
Data integration process that sends data from source server to target server and then transforming it for relevant downstreaming use cases.
SQL: Structured Query Language is used to query data
ELT or ETL: are processes to load data so that it can be queried further.