Introduction To
Data Analysis, Storage, & Processing Solutions
Data analysis and data analytics solutions
Generally, Analysis is a examination of something in order to understand
its nature or determine its essential features.
Specifically, Data Analysis is a process of compiling, processing, and
analyzing data so that you can use it to make proper decisions.
Analytics is systematic analysis of data.
Data analytics is the specific analytical process being applied for analysis.
Why use data analytics? 🤔
It’s simple,to stop making business or any such decisions not just based on
intuitions but based on data. Other specific use cases are:
● Customer Personalization
● Fraud Detection
● Security Threat Detection
● User Behaviour
● Financial modeling and forecasting
many more...
Components of data analytics solution
Steps of a data analysis solution
1. Get the data [Collect, Store]: Know where your data comes
from.
2. Discover and analyze your data[Analyze/Process]: Know the
options for processing your data
3. Visualize and learn from your data[Consume/Visualize]: Know
what you need to learn from data
Challenges
of data
analytics
😑
Knowledge check
Scenario
My business has a set of 15 JSON data files that are
each about 2.5 GB in size. They are placed on a file
server once an hour. They must be ingested as soon
as they arrive in this location. This data must be
combined with all transactions from the financial
dashboard for this same period, then compared to the
recommendations from the marketing engine. All data
is fully cleansed. The results from this time period
must be made available to decision makers by 10
minutes after the hour in the form of financial
dashboards.
Based on the scenario, which of
the following Vs pose a
challenge for this business?
● Volume
● Velocity
● Variety
● Veracity
● Value
Volume - data storage
When businesses have more data than they are able to process and analyze,
they have a volume problem.
Classification of data source types:
● Structured data
● Semistructured data
● Unstructured data
Unstructured data is every file we store, every picture we take, and email we send.
Introduction to
Amazon S3
Amazon S3 is object
storage built to store and
retrieve any amount of
data from anywhere.
It is the perfect place to
store your semi-structured
and unstructured data in
the internet.
Amazon S3 concepts
How does S3 store your data?
- Amazon S3 stores data as objects within buckets.
How to access your content?
- Through object key
Data Analysis solution on Amazon S3
● Decoupling of storage from compute and data processing
● Centralized data architecture
● Integration with clusterless and serverless AWS services
● Standardized Application Programming Interfaces (APIs)
Knowledge Check
Which of the following elements does an Amazon S3 object URL contain?
● Object key
● Bucket
● User key
● Access token
Introduction to
data lakes
● A data lake is a
centralized
repository that
allows you to store
structured,
semistructured, and
unstructured data at
any scale.
Benefits of data lakes
● Single source of Truth
● Store any type of data, regardless of structure
● Can be analyzed using Artificial Intelligence and Machine
Learning
Introduction to data storage methods
- Data Lakes
- Data Warehouse
Data WareHouse
A data warehouse is a central repository of structured data from many data
sources. This data is transformed, aggregated, and prepared for business
reporting and analysis.
Data Marts
Traditional data warehousing: pros and cons
Pros Cons
Fast data retrieval Costly to implement
Curated data sets Maintenance can be challenging
Centralized storage Security concerns
Better business intelligence Hard to scale to meet demand
Amazon Redshift
It is a cloud-based, scalable, secure environment for your data warehouse
Benefits of Amazon Redshift
Faster performance
10x faster than other data warehouses
Easy to set up, deploy, and manage
Secure
Scales quickly to meet your needs
Data storage on Big Scale
We have discussed several recommendations for storing data:
● When storing individual objects or files, AWS recommends Amazon S3.
● When storing massive volumes of data, both semistructured and
unstructured, AWS recommends building a data lake on Amazon S3.
● When storing massive amounts of structured data for complex analysis,
AWS recommends storing your data in Amazon Redshift.
Apache Hadoop
Hadoop uses a distributed processing architecture, in which a task is mapped to a cluster of
commodity servers for processing.
Velocity- data processing
When businesses need rapid insights from the data they are collecting, but the systems
in place simply cannot meet the need, there's a velocity problem.
Data processing means the collection and manipulation of data to produce meaningful
information. Data processing is divided into two parts:
Introduction to data processing methods
Introduction to batch data processing
Batch processing is the execution of a series of programs, or jobs, on one or
more computers without manual intervention.
Batch processing architecture
● AWS EMR - It is used for processing vast amounts of data. It does ETL
operations.
● AWS Glue - It is used for processing vast amounts of data. It helps us with
data discovery, conversion, mapping and job scheduling.
● AWS Lambda - It is a serverless compute service that runs your code in
response to events and automatically manages the underlying compute
resources for you.
Batch processing architecture
Basic Batch processing architecture using Amazon EMR Basic Batch processing architecture using AWS Glue
Introduction to stream data processing
Stream processing is the collection and processing of a constant stream of
data.
Benefits of stream processing
Stream processing architecture
1. Amazon Kinesis - It makes it easy to collect, process, and analyze real-
time, streaming data so you can get timely insights and react quickly to
new information.
2. Amazon Athena - It is used for querying data directly in Amazon S3
3. Amazon Quicksight - It is used to produce insightful dashboards and
reports
How does Stream processing takes place?
Stream processing architecture
Combined processing architecture
Thank you!!!

Introduction to Data Analysis, Storage & Processing Solutions

  • 1.
    Introduction To Data Analysis,Storage, & Processing Solutions
  • 2.
    Data analysis anddata analytics solutions Generally, Analysis is a examination of something in order to understand its nature or determine its essential features. Specifically, Data Analysis is a process of compiling, processing, and analyzing data so that you can use it to make proper decisions. Analytics is systematic analysis of data. Data analytics is the specific analytical process being applied for analysis.
  • 3.
    Why use dataanalytics? 🤔 It’s simple,to stop making business or any such decisions not just based on intuitions but based on data. Other specific use cases are: ● Customer Personalization ● Fraud Detection ● Security Threat Detection ● User Behaviour ● Financial modeling and forecasting many more...
  • 4.
    Components of dataanalytics solution
  • 5.
    Steps of adata analysis solution 1. Get the data [Collect, Store]: Know where your data comes from. 2. Discover and analyze your data[Analyze/Process]: Know the options for processing your data 3. Visualize and learn from your data[Consume/Visualize]: Know what you need to learn from data
  • 6.
  • 7.
    Knowledge check Scenario My businesshas a set of 15 JSON data files that are each about 2.5 GB in size. They are placed on a file server once an hour. They must be ingested as soon as they arrive in this location. This data must be combined with all transactions from the financial dashboard for this same period, then compared to the recommendations from the marketing engine. All data is fully cleansed. The results from this time period must be made available to decision makers by 10 minutes after the hour in the form of financial dashboards. Based on the scenario, which of the following Vs pose a challenge for this business? ● Volume ● Velocity ● Variety ● Veracity ● Value
  • 8.
    Volume - datastorage When businesses have more data than they are able to process and analyze, they have a volume problem. Classification of data source types: ● Structured data ● Semistructured data ● Unstructured data
  • 9.
    Unstructured data isevery file we store, every picture we take, and email we send.
  • 10.
    Introduction to Amazon S3 AmazonS3 is object storage built to store and retrieve any amount of data from anywhere. It is the perfect place to store your semi-structured and unstructured data in the internet.
  • 11.
    Amazon S3 concepts Howdoes S3 store your data? - Amazon S3 stores data as objects within buckets. How to access your content? - Through object key
  • 12.
    Data Analysis solutionon Amazon S3 ● Decoupling of storage from compute and data processing ● Centralized data architecture ● Integration with clusterless and serverless AWS services ● Standardized Application Programming Interfaces (APIs)
  • 13.
    Knowledge Check Which ofthe following elements does an Amazon S3 object URL contain? ● Object key ● Bucket ● User key ● Access token
  • 14.
    Introduction to data lakes ●A data lake is a centralized repository that allows you to store structured, semistructured, and unstructured data at any scale.
  • 15.
    Benefits of datalakes ● Single source of Truth ● Store any type of data, regardless of structure ● Can be analyzed using Artificial Intelligence and Machine Learning
  • 16.
    Introduction to datastorage methods - Data Lakes - Data Warehouse
  • 17.
    Data WareHouse A datawarehouse is a central repository of structured data from many data sources. This data is transformed, aggregated, and prepared for business reporting and analysis.
  • 18.
  • 19.
    Traditional data warehousing:pros and cons Pros Cons Fast data retrieval Costly to implement Curated data sets Maintenance can be challenging Centralized storage Security concerns Better business intelligence Hard to scale to meet demand
  • 20.
    Amazon Redshift It isa cloud-based, scalable, secure environment for your data warehouse Benefits of Amazon Redshift Faster performance 10x faster than other data warehouses Easy to set up, deploy, and manage Secure Scales quickly to meet your needs
  • 21.
    Data storage onBig Scale We have discussed several recommendations for storing data: ● When storing individual objects or files, AWS recommends Amazon S3. ● When storing massive volumes of data, both semistructured and unstructured, AWS recommends building a data lake on Amazon S3. ● When storing massive amounts of structured data for complex analysis, AWS recommends storing your data in Amazon Redshift.
  • 22.
    Apache Hadoop Hadoop usesa distributed processing architecture, in which a task is mapped to a cluster of commodity servers for processing.
  • 23.
    Velocity- data processing Whenbusinesses need rapid insights from the data they are collecting, but the systems in place simply cannot meet the need, there's a velocity problem. Data processing means the collection and manipulation of data to produce meaningful information. Data processing is divided into two parts:
  • 24.
    Introduction to dataprocessing methods
  • 25.
    Introduction to batchdata processing Batch processing is the execution of a series of programs, or jobs, on one or more computers without manual intervention.
  • 26.
    Batch processing architecture ●AWS EMR - It is used for processing vast amounts of data. It does ETL operations. ● AWS Glue - It is used for processing vast amounts of data. It helps us with data discovery, conversion, mapping and job scheduling. ● AWS Lambda - It is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you.
  • 27.
    Batch processing architecture BasicBatch processing architecture using Amazon EMR Basic Batch processing architecture using AWS Glue
  • 28.
    Introduction to streamdata processing Stream processing is the collection and processing of a constant stream of data.
  • 29.
  • 30.
    Stream processing architecture 1.Amazon Kinesis - It makes it easy to collect, process, and analyze real- time, streaming data so you can get timely insights and react quickly to new information. 2. Amazon Athena - It is used for querying data directly in Amazon S3 3. Amazon Quicksight - It is used to produce insightful dashboards and reports
  • 31.
    How does Streamprocessing takes place?
  • 32.
  • 33.
  • 34.

Editor's Notes

  • #2 Hadoop Video:
  • #3 <<You may be wondering why are there two different topics analysis and analytics as they sound familiar>><<My story with this>>
  • #5 A data analysis solution has many components. The analytics performed in each of these components may require different services and different approaches.
  • #6 Data analysis solutions incorporate many forms of analytics to store, process, and visualize data. Planning a data analysis solution begins with knowing what you need out of that solution i.e. Looking at the Big picture. What does existing solution look like? What’s the end result of model’s output?
  • #12 An object key is the unique identifier for an object in the bucket. There is no user key or access token built into the URL itself.
  • #14 Every Amazon S3 object URL contains the bucket and object key for the item. An object key is the unique identifier for an object in the bucket. There is no user key or access token built into the URL itself.
  • #15 In the last topic, we discussed data storage and Amazon S3. Now it’s time to discuss how the data is organized in this service. Amazon S3 is an amazing object container. Like any bucket, you can put content in it in a neat and orderly fashion, or you can just dump it in. But no matter how the data gets there, once it’s there, you need a way to organize it in a meaningful way so you can find it when you need it.
  • #16 Need to add Knowledge Check section. Think about it.
  • #17 As the volume of data has increased, so have the options for storing data. Traditional storage methods such as data warehouses are still very popular and relevant. However, data lakes have become more popular recently. These new options can confuse businesses that are trying to be financially wise and technically relevant. So which is better: data warehouses or data lakes? Neither and both. They are different solutions that can be used together to maintain existing data warehouses while taking full advantage of the benefits of data lakes.
  • #18 A data warehouse is a central repository of information coming from one or more data sources. Data flows into a data warehouse from transactional systems, relational databases, and other sources. These data sources can include structured, semistructured, and unstructured data. These data sources are transformed into structured data before they are stored in the data warehouse. Data is stored within the data warehouse using a schema. A schema defines how data is stored within tables, columns, and rows.... Business analysts, data scientists, and decision makers access the data through business intelligence (BI) tools, SQL clients, and other analytics applications....
  • #19 Data warehouses can be massive. Analyzing these huge stores of data can be confusing. Many organizations need a way to limit the tables to those that are most relevant to the analytics users will be performing. Because data marts are generally a copy of data already contained in a data warehouse, they are often fast and simple to implement.
  • #21 Amazon Redshift Spectrum: works like data lake Video source: https://www.youtube.com/watch?v=_qKm6o1zK3U AWS Data warehouse
  • #23 Each of the AWS processing services we will cover in the next lesson incorporate a temporary storage layer that houses data while it is being processed and analyzed. This data is eventually moved to permanent storage within one of the other solutions we have already discussed. Apache Hadoop can consume data from an Amazon S3 data lake and process it in batches, scripted, or real-time. Hadoop can analyze for AI or machine learning. When many people think of working with a massive volume of fast-moving data, the first thing that comes to mind is Hadoop. Within AWS, Hadoop frameworks are implemented using Amazon EMR and AWS Glue.
  • #25 Scheduled batch processing represents data that is processed in a very large volume on a regularly scheduled basis. For instance, once a week or once a day. It is generally the same amount of data with each load, making these workloads predictable. Periodic batch processing is a batch of data that is processed at irregular times. These workloads are often run once a certain amount of data has been collected. This can make them unpredictable and hard to plan around. Near Real-time processing represents streaming data that is processed in small individual batches. The batches are continuously collected and processed within minutes of the data generation. Real-time processing represents streaming data that is processed in very small individual batches. The batches are continuously collected and processed within milliseconds of the data generation.
  • #26 Data is collected into batches asynchronously. The batch is sent to a processing system when specific conditions are met, such as a specified time of day. The results of the processing job are then sent to a storage location that can be queried later as needed
  • #28 Batch processing can be performed in different ways using AWS services. The architecture diagram below depicts the components and the data flow of a basic batch analytics system using a traditional approach. This approach uses Amazon S3 for storing data, AWS Lambda for intermediate file-level ETL, Amazon EMR for aggregated ETL (heavy lifting, consolidated transformation, and loading engine), and Amazon Redshift as the data warehouse hosting data needed for reporting. The architecture diagram below depicts the same data flow as above but uses AWS Glue for aggregated ETL (heavy lifting, consolidated transformation, and loading engine). AWS Glue is a fully managed service, as opposed to Amazon EMR, which requires management and configuration of all of the components within the service. It helps us with: Data Discovery Conversion Mapping Job Scheduling In simple words: It deals with simplifying data processing.
  • #29 Stream data processing gives companies the ability to get insights from their data within seconds of the data being collected.
  • #30 Consume data in parallel allows multiple users to work simultaneously on the same data.
  • #31 In this architecture, sensor data is being collected in the form of a stream. The streaming data is being collected from the sensor devices by Amazon Kinesis Data Firehose. This service is configured to send the data to be processed using Amazon Kinesis Data Analytics. This service filters the data for relevant records and send the data into another Kinesis Data Firehose process, which places the results into an Amazon S3 bucket at the serving layer. Using Amazon Athena, the data in the Amazon S3 bucket can now be queried to produce insightful dashboards and reports using Amazon QuickSight.
  • #33 In this architecture, sensor data is being collected in the form of a stream. The streaming data is being collected from the sensor devices by Amazon Kinesis Data Firehose. This service is configured to send the data to be processed using Amazon Kinesis Data Analytics. This service filters the data for relevant records and send the data into another Kinesis Data Firehose process, which places the results into an Amazon S3 bucket at the serving layer. Using Amazon Athena, the data in the Amazon S3 bucket can now be queried to produce insightful dashboards and reports using Amazon QuickSight.