"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Chapter 6.pptx
1. The material used in this presentation i.e., pictures/graphs/text, etc. is solely
intended for educational/teaching purpose, offered free of cost to the students for
use under special circumstances of Online Education due to COVID-19 Lockdown
situation and may include copyrighted material - the use of which may not have
been specifically authorised by Copyright Owners. It’s application constitutes Fair
Use of any such copyrighted material as provided in globally accepted law of many
countries. The contents of presentations are intended only for the attendees of the
class being conducted by the presenter.
Fair Use Notice
2. ETL :Data Extraction, Transformation and
Loading
Disclaimer: The contents in this presentation have been taken
from multiple resources available at the internet including
books, notes, reports, websites and presentations.
3. ETL Overview
• Technical architecture of the data warehouse comprises of three functional
areas. These areas are data acquisition, data storage, and information
delivery.
• Data extraction, transformation, and loading (ETL process) encompass the
areas of data acquisition and data storage.
• These are back-end processes that cover the extraction of data from the
source systems. Next, they include all the functions and procedures for
changing the source data into the exact formats and structures appropriate for
storage in the data warehouse database.
• After the transformation of the data, these processes consist of all the
functions for physically moving the data into the data warehouse repository
4. • You will not extract all of operational data and dump it into the data
warehouse.
• This approach is something driven by the user requirements. Your
requirements definition should guide you as to what data you need to
extract and from which source systems.
5. Challenges in ETL
• Source systems are very diverse and disparate.
• There is usually a need to deal with source systems on multiple platforms and
different operating systems.
• Many source systems are older legacy applications running on obsolete
database technologies.
• Generally, historical data on changes in values are not preserved in source
operational systems. Historical information is critical in a data warehouse.
• Quality of data is dubious in many old source systems that have evolved over
time.
• Source system structures keep changing over time because of new business
conditions. ETL functions must also be modified accordingly.
6. • Gross lack of consistency among source systems is commonly prevalent. Same data is
likely to be represented differently in the various source systems.
• For example, data on salary may be represented as monthly salary, weekly salary,
and bimonthly salary in different source payroll systems.
• Even when inconsistent data is detected among disparate source systems, lack of a
means for resolving mismatches escalates the problem of inconsistency.
• Most source systems do not represent data in types or formats that are meaningful
to the users. Many representations are cryptic and ambiguous.
(This is a prevalent problem in legacy systems, many of which were not designed with
end-users in mind. For example, the customer status codes could have been started
with R = Regular and N = New. Then at one time, another code D = Deceased could
have been added. Down the road, a further code A = Archive could have been
included.)
• Project team spend as much as 50–70% of the project effort on ETL functions.
7. DATA EXTRACTION
• Data extraction for a new operational system is quite different from the data
extraction for a data warehouse.
• First, for a data warehouse, you have to extract data from many disparate sources.
• Next, for a data warehouse, you have to extract data on the changes for ongoing
incremental loads as well as for a one-time initial full load. For operational
systems, all you need is one-time extractions and data conversions.
• The complexity of data extraction for a data warehouse, therefore, warrant the
use of third-party data extraction tools(generally more expensive) in addition to
in-house programs or scripts.
8. • On the other hand, in-house programs increase the cost of maintenance and
are hard to maintain as source systems change.
• Effective data extraction is a key to the success of your data warehouse.
Therefore, you need to pay special attention to the issues and formulate a
data extraction strategy for your data warehouse.
9. Data Extraction issues:
1. Source Identification: Identify source applications and source structures.
2. Method of extraction: For each data source, define whether the extraction
process is manual or tool-based.
3. Extraction frequency: For each data source, establish how frequently the
data extraction must by done—daily, weekly, quarterly, and so on.
4. Time window: For each data source, denote the time window for the
extraction process.
5. Job sequencing: Determine whether the beginning of one job in an
extraction job stream has to wait until the previous job has finished
successfully.
6. Exception handling: Determine how to handle input records that cannot be
extracted.
10. Data Extraction Techniques
• Source data is in a state of constant flux because Business transactions keep
changing the data in the source systems.
• Data in the source systems are said to be time-dependent or temporal. This is
because source data changes with time.
• For example: When a customer moves to another state, the data about that
customer changes in the customer table in the source system.
• When two additional package types are added to the way a product may be sold,
the product data changes in the source system.
• When a correction is applied to the quantity ordered, the data about that order
gets changed in the source system.
11. Now consider the following scenario:
• A customer moves from New York state to California.
• In the operational system, current address of the customer that has CA as the
state code is important.
• The actual change transaction itself, stating that the previous state code was NY
and the revised state code is CA, need not be preserved.
• But think about how this change affects the information in the data warehouse?
• If the state code is used for analyzing some measurements such as sales, the
sales to the customer prior to the change must be counted in New York state
and those after the move must be counted in California.
• In other words, the history cannot be ignored in the data warehouse.
• So how do you capture the history from the source systems?
12. Data in Operational Systems
• These source systems generally store data in two ways:
• Current Value. Most of the attributes in the source systems fall into this
category. Here the stored value of an attribute represents the value of the
attribute at this moment of time.
• Periodic Status. This category is not as common as the previous category. In this
category, the value of the attribute is preserved as the status every time a
change occurs. At each of these points in time, the status value is stored with
reference to the time when the new value became effective
14. Extraction Techniques
• When you deploy your data warehouse, the initial data as of a certain time
must be moved to the data warehouse to get it started. This is the initial load.
• After the initial load, your data warehouse must be kept updated so the
history of the changes and statuses are reflected in the data warehouse.
• Broadly, there are two major types of data extractions from the source
operational systems:
• “as is” (static) data and data of revisions.
• “As is” or static data is the capture of data at a given point in time. It is like
taking a snapshot of the relevant source data at a certain point in time.
• You will use static data capture primarily for the initial load of the data
warehouse.
15. Extraction Techniques
• Data of revisions is also known as incremental data capture.
• For periodic status data or periodic event data, the incremental data
capture includes the values of attributes at specific times.
• Extract the statuses and events that have been recorded since the last
date of extract.
16. DATA TRANSFORMATION
• The extracted data is raw data and it cannot be applied to the data warehouse.
• All the extracted data must be made usable in the data warehouse.
• Having information that is usable for strategic decision making is the underlying
principle of the data warehouse.
• All the relevant data extracted from the dissimilar source systems have variety of
source data formats, data values, and the condition of the data quality(Data
quality is crucial – it assesses whether information can serve its purpose in a
particular context such as data analysis). Quality traits: accuracy, completeness,
reliability, relevance, and timeliness.
• You have to perform several types of transformations to make the source data
suitable for your data warehouse.
• Transformation of source data encompasses a wide variety of manipulations to
change all the extracted source data into usable information to be stored in the data
warehouse
17. Major Transformation Types
Format Revisions
• You will come across these quite often. These revisions include changes to the
data types and lengths of individual fields.
• In your source systems, product package types may be indicated by codes and
names in which the fields are numeric and text data types.
• Again, the lengths of the package types may vary among the different source
systems. It is wise to standardize and change the data type to text to provide
values meaningful to the users.
18. Decoding
• This is also a common type of data transformation. When you deal with
multiple source systems, you are bound to have the same data items
described by a plethora of field values.
• The classic example is the coding for gender, with one source system using 1
and 2 for male and female and another system using M and F.
• Also, many legacy systems use cryptic codes to represent business values.
What do the codes AC, IN, RE, and SU mean in a customer file? You need to
decode all such cryptic codes and change these into values that make sense to
the users.
19. Calculated and Derived Values
• What if you want to keep profit margin along with sales and cost amounts in
your data warehouse tables?
• The extracted data from the sales system contains sales amounts, sales units,
and operating cost estimates by product. You will have to calculate the total
cost and the profit margin before data can be stored in the data warehouse.
• When you calculate the total cost and the profit margin before data can be
stored in the data warehouse, This is an example of the calculated value. You
may also want to store customer's age separately—that would be an example
of the derived value.
20. Splitting of Single Fields
• In legacy systems, generally large text fields are used to store data such as
names (comprising of first name, middle initials, and last name) and
addresses (city, state, and Zip Code ) of customers and employees.
• You mostly need to store individual components in separate fields in your
data warehouse for two good reasons:
• First, you may improve the operating performance by indexing on individual
components.
• Second, your users may need to perform analysis by using individual
components such as city, state, and Zip Code.
21. Merging of Information
• This type of data transformation does not literally mean the merging of several
fields to create a single field of data.
• For example, information about a product may come from different data
sources. The product code and description may come from one data source.
The relevant package types may be found in another data source. The cost
data may be from yet another source.
• In this case, merging of information denotes the combination of the product
code, description, package types, and cost into a single entity.
22. Character Set Conversion
• This type of data transformation relates to the conversion of character sets to
an agreed standard character set for textual data in the data warehouse.
• If you have mainframe legacy systems as source systems, the source data from
these systems will be in EBCDIC characters. If PC-based architecture is the
choice for your data warehouse, then you must convert the mainframe EBCDIC
format to the ASCII format.
23. Conversion of Units of Measurements
• Many companies today have global branches.
• Measurements in many European countries are in metric units. If
your company has overseas operations, you may have to convert the
metrics so that the numbers may all be in one standard unit of
measurement.
24. Date/Time Conversion
• This type relates to representation of date and time in standard formats.
Summarization
• This type of transformation is the creating of summaries to be loaded in the
data warehouse instead of loading the most granular level of data.
• For example, for a credit card company to analyze sales patterns, it may not
be necessary to store in the data warehouse every single transaction on each
credit card. Instead, you may want to summarize the daily transactions for
each credit card and store the summary data.
25. Key Restructuring
• While extracting data from your input sources, look at the primary keys of
the extracted records.
• You will have to come up with keys for the fact and dimension tables based
on the keys in the extracted records.
• For example: the product code in this organization is structured to have
inherent meaning.
• So If you use this product code as the primary key, there would be problems.
• If the product is moved to another warehouse, the warehouse part of the
product key will have to be changed.
• When choosing keys for your data warehouse database tables, avoid such
keys with built-in meanings.
• Transform such keys into generic keys generated by the system itself. This is
called key restructuring.
27. Deduplication
• In many companies, the customer files have several records for the same
customer. Mostly, the duplicates are the result of creating additional records
by mistake.
• In your data warehouse, you need to keep a single record for one customer
and link all the duplicates in the source systems to this single record.
• This process is called deduplication of the customer file.
28. DATA LOADING
• The next major set of functions consists of the ones that take the prepared
data, apply it to the data warehouse, and store it in the database there.
• Initial Load—populating all the data warehouse tables for the very first time
• Incremental Load—applying ongoing changes as necessary in a periodic
manner
• Full Refresh—completely erasing the contents of one or more tables and
reloading with fresh data (initial load is a refresh of all the tables).
• During the loads, the data warehouse has to be offline.
• You need to find a window of time when the loads may be scheduled without
affecting your data warehouse users.
29. • Therefore, consider dividing up the whole load process into smaller chunks
and populating a few files at a time. This will give you two benefits.
• You may be able to run the smaller loads in parallel. Also, you might be
able to keep some parts of the data warehouse up and running while
loading the other parts.
• It is hard to estimate the running times of the loads, especially the initial
load or a complete refresh. Do test loads to verify the correctness and to
estimate the running times.
30. Modes for applying data
1. Load
2. Append
3. Destructive merge and
4. Constructive merge
31. Load
• If the target table to be loaded already exists and data exists in the
table, the load process wipes out the existing data and applies the
data from the incoming file.
• If the table is already empty before loading, the load process simply
applies the data from the incoming file.
32. Append
• You may think of the append as an extension of the load.
• If data already exists in the table, the append process unconditionally
adds the incoming data, preserving the existing data in the target
table.
• When an incoming record is a duplicate of an already existing record,
you may define how to handle an incoming duplicate.
• The incoming record may be allowed to be added as a duplicate, or
the incoming duplicate record may be rejected during the append
process.
33. Destructive Merge
• In this mode, you apply the incoming data to the target data. If the
primary key of an incoming record matches with the key of an existing
record, update the matching target record.
• If the incoming record is a new record without a match with any
existing record, add the incoming record to the target table.
34. Constructive Merge
• This mode is slightly different from the destructive merge.
• If the primary key of an incoming record matches with the key of an
existing record, leave the existing record, add the incoming record, and
mark the added record as superseding the old record.
37. ETL versus ELT
• "Data Lakes" are special kinds of data stores that—unlike OLAP data
warehouses—accept any kind of structured or unstructured data.
• Data lakes don't require you to transform your data before loading it. You can
immediately load any kind of raw information into a data lake, no matter the
format or lack thereof.
• Data transformation is still necessary before analyzing the data with a
business intelligence platform. However, data cleansing, enrichment, and
transformation occur after loading the data into the data lake
38. • ELT is a relatively new technology, made possible because of modern, cloud-based
server technologies.
• Cloud-based data warehouses offer near-endless storage capabilities and scalable
processing power. For example, platforms like Amazon Redshift and Google
BigQuery.
• ELT paired with a data lake lets you ingest an ever-expanding pool of raw data
immediately, as it becomes available. There's no requirement to transform the
data into a special format before saving it in the data lake.
• Transforms only the data you need. ELT transforms only the data required for a
particular analysis.
• It can slow down the process of analyzing the data, it offers more flexibility—
because you can transform the data in different ways on the fly to produce
different types of metrics, forecasts and reports.
39. • ELT is less-reliable than ETL
• It’s important to note that the tools and systems of ELT are still evolving, so
they're not as reliable as ETL paired with an OLAP database.
• Although it takes more effort to setup, ETL provides more accurate insights
when dealing with massive pools of data.