Introduction to Data Analysis, Storage & Processing Solutions

Introduction To
Data Analysis, Storage, & Processing Solutions

Data analysis and data analytics solutions
Generally, Analysis is a examination of something in order to understand
its nature or determine its essential features.
Specifically, Data Analysis is a process of compiling, processing, and
analyzing data so that you can use it to make proper decisions.
Analytics is systematic analysis of data.
Data analytics is the specific analytical process being applied for analysis.

Why use data analytics? 🤔
It’s simple,to stop making business or any such decisions not just based on
intuitions but based on data. Other specific use cases are:
● Customer Personalization
● Fraud Detection
● Security Threat Detection
● User Behaviour
● Financial modeling and forecasting
many more...

Components of data analytics solution

Steps of a data analysis solution
1. Get the data [Collect, Store]: Know where your data comes
from.
2. Discover and analyze your data[Analyze/Process]: Know the
options for processing your data
3. Visualize and learn from your data[Consume/Visualize]: Know
what you need to learn from data

Challenges
of data
analytics
😑

Knowledge check
Scenario
My business has a set of 15 JSON data files that are
each about 2.5 GB in size. They are placed on a file
server once an hour. They must be ingested as soon
as they arrive in this location. This data must be
combined with all transactions from the financial
dashboard for this same period, then compared to the
recommendations from the marketing engine. All data
is fully cleansed. The results from this time period
must be made available to decision makers by 10
minutes after the hour in the form of financial
dashboards.
Based on the scenario, which of
the following Vs pose a
challenge for this business?
● Volume
● Velocity
● Variety
● Veracity
● Value

Volume - data storage
When businesses have more data than they are able to process and analyze,
they have a volume problem.
Classification of data source types:
● Structured data
● Semistructured data
● Unstructured data

Unstructured data is every file we store, every picture we take, and email we send.

Introduction to
Amazon S3
Amazon S3 is object
storage built to store and
retrieve any amount of
data from anywhere.
It is the perfect place to
store your semi-structured
and unstructured data in
the internet.

Amazon S3 concepts
How does S3 store your data?
- Amazon S3 stores data as objects within buckets.
How to access your content?
- Through object key

Data Analysis solution on Amazon S3
● Decoupling of storage from compute and data processing
● Centralized data architecture
● Integration with clusterless and serverless AWS services
● Standardized Application Programming Interfaces (APIs)

Knowledge Check
Which of the following elements does an Amazon S3 object URL contain?
● Object key
● Bucket
● User key
● Access token

Introduction to
data lakes
● A data lake is a
centralized
repository that
allows you to store
structured,
semistructured, and
unstructured data at
any scale.

Benefits of data lakes
● Single source of Truth
● Store any type of data, regardless of structure
● Can be analyzed using Artificial Intelligence and Machine
Learning

Introduction to data storage methods
- Data Lakes
- Data Warehouse

Data WareHouse
A data warehouse is a central repository of structured data from many data
sources. This data is transformed, aggregated, and prepared for business
reporting and analysis.

Traditional data warehousing: pros and cons
Pros Cons
Fast data retrieval Costly to implement
Curated data sets Maintenance can be challenging
Centralized storage Security concerns
Better business intelligence Hard to scale to meet demand

Amazon Redshift
It is a cloud-based, scalable, secure environment for your data warehouse
Benefits of Amazon Redshift
Faster performance
10x faster than other data warehouses
Easy to set up, deploy, and manage
Secure
Scales quickly to meet your needs

Data storage on Big Scale
We have discussed several recommendations for storing data:
● When storing individual objects or files, AWS recommends Amazon S3.
● When storing massive volumes of data, both semistructured and
unstructured, AWS recommends building a data lake on Amazon S3.
● When storing massive amounts of structured data for complex analysis,
AWS recommends storing your data in Amazon Redshift.

Apache Hadoop
Hadoop uses a distributed processing architecture, in which a task is mapped to a cluster of
commodity servers for processing.

Velocity- data processing
When businesses need rapid insights from the data they are collecting, but the systems
in place simply cannot meet the need, there's a velocity problem.
Data processing means the collection and manipulation of data to produce meaningful
information. Data processing is divided into two parts:

Introduction to data processing methods

Introduction to batch data processing
Batch processing is the execution of a series of programs, or jobs, on one or
more computers without manual intervention.

Batch processing architecture
● AWS EMR - It is used for processing vast amounts of data. It does ETL
operations.
● AWS Glue - It is used for processing vast amounts of data. It helps us with
data discovery, conversion, mapping and job scheduling.
● AWS Lambda - It is a serverless compute service that runs your code in
response to events and automatically manages the underlying compute
resources for you.

Batch processing architecture
Basic Batch processing architecture using Amazon EMR Basic Batch processing architecture using AWS Glue

Introduction to stream data processing
Stream processing is the collection and processing of a constant stream of
data.

Stream processing architecture
1. Amazon Kinesis - It makes it easy to collect, process, and analyze real-
time, streaming data so you can get timely insights and react quickly to
new information.
2. Amazon Athena - It is used for querying data directly in Amazon S3
3. Amazon Quicksight - It is used to produce insightful dashboards and
reports

How does Stream processing takes place?

Stream processing architecture

Combined processing architecture

Introduction to Data Analysis, Storage & Processing Solutions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Data Analysis, Storage & Processing Solutions

Similar to Introduction to Data Analysis, Storage & Processing Solutions (20)

Recently uploaded

Recently uploaded (20)

Introduction to Data Analysis, Storage & Processing Solutions

Editor's Notes