1. Handling and Processing Big
Data
Big Data & IoT
Lecture #2
Umair Shafique (21015956-003)
Scholar MS Information Technology - University of Gujrat
2. Recap
• What is Big Data?
• Why Is Big Data Important?
• Big Data Analytics
• Benefits of Big Data Analytics
• Types of Big Data
• Characteristics of Big Data
• Source of Big Data
• Big Data Tools and Software
3. What is Big Data?
• Big data is the big buzz nowadays and there are no second thoughts
on that.
• Basically, big data is data that is generated in high volume, variety,
and velocity. There are many other concepts, theories, and facts
related to big data and its popularity.
4.
5. What Is Big Data?
• In simple words, big data is defined as
mass amounts of data that may
involve complex, unstructured data,
as well as semi-structured data.
• Previously, it was too difficult to
interpret huge data accurately and
efficiently with traditional database
management systems. But big data
tools like Apache Hadoop and Apache
Spark make it easier. For example, a
human genome, which took about ten
years to process, can now be
processed in just about one week.
6. How Big Is Big Data?
• It's not possible to put a number on what quantifies big data, but it
generally refers to figures around petabytes and exabytes. It includes
vast amounts of data sources gathered from a given company, its
customers, its channel partners, and suppliers, as well as external
data sources.
• Big data analytics is the often complex process of examining big data
to uncover information -- such as hidden patterns, correlations,
market trends and customer preferences -- that can help
organizations make informed business decisions.
8. Handling and Processing Big Data
• Big Data management is the systematic organization, administration as well as
governance of massive amounts of data.
• The process includes management of both unstructured and structured data.
• The primary objective is to ensure the data is of high quality and accessible for
business intelligence along with big data analytics applications.
• To contend with the rapidly growing data pools, government agencies,
corporations and other large organizations have begun implementing Big Data
management solutions.
• The data involves several terabytes or even petabytes of data that has been
saved in a broad range of file formats.
• Effective Big Data management enables an organization to find valuable
information with ease irrespective of how large or unstructured the data is. The
data is gathered from different sources such as call records, system logs and social
media sites.
9. Handling Big Data
• Here are some ways to effectively handle Big Data:
1. Outline Your Goals
• The first tick on the checklist when it comes to handling Big Data is knowing what data to gather
and the data that need not be collected. To do this one has to determine clearly defined goals.
Failure to accomplish this will lead one to gather large amounts of data which isn’t aligned with a
business’ continuous requirements.
• Many enterprises eventually collect unnecessary data as they would not have clearly defined
goals, well mapped strategies for achieving the said goals. It is of paramount importance that
organizations should collect data with a laser focus to benefit business objectives.
2. Do Not Ignore Audit Regulations
• Offsite Database Managers should maintain the right database components especially when an
audit is in hand. Irrespective of the data nature being payment data, credit scores or data of
lesser importance, the data should be managed accordingly. One should steer clear of liability and
progressively earn the client’s trust.
10. Handling Big Data
3. Secure the Data
• The next step in managing Big Data is to ensure the relevant data collected is secured with a broad range of
measures. To ensure the data secured is both accessible and secure, it must be protected by firewall
security measures, spam filtering, malware scanning and elimination, along with most importantly team
permission control.
• Since data has the immense power to drive your business to new heights of success, or crash into oblivion.
Therefore it is wise not to take data management lightly since securing organizational data is the highest
priority in Big Data Management.
4. Keep the Data Protected
• A database is susceptible to threats from not just human influences and synthetic anomalies, but also is
prone to damage from the elements of nature such as heat, humidity, and extreme cold. All of which can
easily corrupt data. Whenever data is damaged, system failures are bound to follow leading to expensive
downtimes and related overheads.
• Organizations have to safeguard databases against adverse environmental situations which would damage
data and put forth considerable efforts to protect their data. It is essential to create and maintain/update a
backup of the database elsewhere, in addition to implementation of safety features. The updates should be
at planned at frequent intervals.
11. Handling Big Data
5. Data Has to Be Interlinked
• Since organizational databases are bound to be accessed by a number of channels, it is not
recommended to use different software for the required solutions. In essence, all organizational
data must be able to talk to each other. If there are communication hassles between applications
and data and the converse of this as well can lead huge problems.
• Cloud Storage solution is the perfect answer to data interlinking issue. Also useful in this
circumstance would be a remote database administrator among other tools. The objective is to
generate seamless data synchronization. This will be needed all the more when more than just
team will be accessing and working on the same data simultaneously.
6. Know the Data You Need to Capture
• The key to successful Big Data management is knowing which data will suit a particular solution.
This will mean one will be aware which data is needed to be collected for different situations.
• Organizations are required to know which data has to be collected and also when. To do this
correctly, objectives will have to be clearly known and a plan must be formulated on how to
accomplish them.
12. 7. Adapt to the New Changes
• One of the most important aspects of Big Data Management is
keeping up with the latest trends in the same. Software and data in all
its forms change constantly and almost on a daily basis, globally.
Keeping up with the newest technologies and strategies for adoption
will enable organizations to stay ahead of the curve and build highly
optimized and efficient databases. Being flexible and open to new
trends and technologies will go a long way in giving you an edge over
the competition.
13. Meta Data for Big Data Handling and
Processing
• Traditionally in the world of data management, metadata has been
often ignored and implemented as a post implementation process.
• When you start looking at Big Data, you need to create a strong
metadata library, as you will be having no idea about the content of
the data format that you need to process. Remember in the Big Data
world, we ingest and process data, then tag it, and after these steps,
consume it for processing.
• Fundamentally nine types of metadata that are useful for information
technology and data management
14. Meta Data for Big Data Handling and
Processing
• Technical metadata
• Data transformation rules, data storage structures,
semantic layers, and interface layers
• Business metadata
• Data describing the content i.e. structure, values
etc. of attributes etc.
• Contextual metadata
• Context to large objects like text, images, and
videos
• Process design–level metadata
• Source and target table, algorithms, business
rules etc.
• Program-level metadata
• ETL information,
• Infrastructure metadata
• Source and Targeted platforms, network,
contacts etc.
• Core business metadata
• Frequency of update, valid entries, basic
business metadata etc.
• Operational metadata
• Usage, record count, processing time, security
etc.
• Business intelligence metadata
• BI metadata contains information about how
data is queried, filtered, analyzed, and displayed
in business intelligence and analytics
• Data mining metadata ( data sets, algorithms, and
queries)
• OLAP metadata (dimensions, cubes, measures
(metrics), hierarchies, levels, and drill paths.)
15. Big Data Processing Requirements
• What is unique about Big Data processing?
• What makes it different or mandates new thinking?
• To understand this better let us look at the underlying requirements.
• We can classify Big Data requirements based on its five main
characteristics:
1. Volume:
● Size of data to be processed is large—it needs to be broken into manageable chunks.
● Data needs to be processed in parallel across multiple systems.
● Data needs to be processed across several program modules simultaneously
Data needs to be processed once and processed to completion due to volumes.
● Data needs to be processed from any point of failure, since it is extremely large to
restart the process from the beginning
16. Big Data Processing Requirements
2. Velocity:
● Data needs to be processed at streaming speeds during data collection.
● Data needs to be processed for multiple acquisition points.
3. Variety:
● Data of different formats needs to be processed.
● Data of different types needs to be processed.
● Data of different structures needs to be processed.
● Data from different regions needs to be processed.
4. Ambiguity:
● Big Data is ambiguous by nature due to the lack of relevant metadata and context in many
cases. An example is the use of M and F in a sentence—it can mean, respectively, Monday and
Friday, male and female, or mother and father.
● Big Data that is within the corporation also exhibits this ambiguity to a lesser degree. For
example, employment agreements have standard and custom sections and the latter is
ambiguous without the right context.
17. Big Data Processing Requirements
5. Complexity:
● Big Data complexity needs to use many algorithms to process data quickly
and efficiently.
● Several types of data need multi pass processing and scalability is extremely
important.
• Processing large-scale data requires an extremely high-performance
computing environment that can be managed with the greatest ease
and can performance tune with linear scalability.
18. Processing Limitations
• There are a couple of processing limitations for processing Big Data:
● Write-once model—with Big Data there is no update processing logic due to
the intrinsic nature of the data that is being processed. Data with changes will
be processed as new data.
● Data fracturing—due to the intrinsic storage design, data can be fractured
across the Big Data infrastructure. Processing logic needs to understand the
appropriate metadata schema used in loading the data. If this match is missed,
then errors could creep into processing the data.
• Big Data processing can have combinations of these limitations and
complexities, which will need to be accommodated in the processing
of the data.
19. Processing Big Data
• Big Data processing involves steps
very similar to processing data in
the transactional or data
warehouse environments.
• Figure shows the different stages
involved in the processing of Big
Data; the approach to processing
Big Data is:
● Gather the data.
● Analyze the data.
● Process the data.
● Distribute the data. Processing Big Data
20. Processing Big Data
• While the stages are similar to
traditional data processing the key
differences are:
● Data is first analyzed and then
processed.
● Data standardization occurs in the
analyze stage, which forms the
foundation for the distribute stage
where the data warehouse integration
happens.
● There is not special emphasis on data
quality except the use of metadata,
master data, and semantic libraries to
enhance and enrich the data.
● Data is prepared in the analyze stage
for further processing and integration.
21. Processing Big Data
1. Gather stage
• Data is acquired from multiple
sources including real-time
systems, near-real-time systems,
and batch-oriented applications.
The data is collected and loaded to
a storage environment like Hadoop
or NoSQL.
• Another option is to process the
data through a knowledge
discovery platform and store the
output rather than the whole data
set.
2. Analysis stage
• The analysis stage is the data
discovery stage for processing Big
Data and preparing it for
integration to the structured
analytical platforms or the data
warehouse.
• The analysis stage consists of
tagging, classification, and
categorization of data, which
closely resembles the subject area
creation data model definition
stage in the data warehouse.
22. Processing Big Data
3. Process stage
• Processing Big Data has several substages, and the
data transformation at each substage is significant
to produce the correct or incorrect output.
Context processing
• Context processing relates to exploring the context of
occurrence of data within the unstructured or Big Data
environment. The relevancy of the context will help the
processing of the appropriate metadata and master
data set with the Big Data.
Metadata, master data, and semantic linkage
• The most important step in creating the integration of
Big Data into a data warehouse is the ability to use
metadata, semantic libraries, and master data as the
integration links.
Standardize
• Preparing and processing Big Data for integration with
the data warehouse requires standardizing of data,
which will improve the quality of the data.
4. Distribute stage
• Big Data is distributed to downstream systems by
processing it within analytical applications and
reporting systems. Using the data processing
outputs from the processing stage where the
metadata, master data, and metatags are available,
the data is loaded into these systems for further
processing.
• Another distribution technique involves exporting
the data as flat files for use in other applications
like web reporting and content management
platforms.
• From here big data analytics starts.
23. Technologies for Big Data Processing
• There are various technologies that are foundations of Big Data
processing.
• The evolution and implementation of these technologies evolve
around
● Data movement
● Data storage
● Data management
24. Technologies for Big Data Processing
• Hadoop
• Hadoop has taken the world by storm in providing the solution architecture to solve Big Data processing on a
cheaper commodity platform with faster scalability and parallel processing.
• Google file system
• Google discovered that its requirements could not be met by traditional file systems, and thus was born the
need to create a file system that could meet the demands and rigor of an extremely high-performance file
system for large-scale data processing on commodity hardware clusters
• MapReduce
• MapReduce is a programming model for processing extremely large data sets and was originally developed by
Google in the early 2000s for solving the scalability of search computation.
• Zookeeper
• Zookeeper is an open-source, in-memory, distributed NoSQL database that is used for coordination services
for managing distributed applications. It consists of a simple set of functions that can be used to build services
for synchronization, configuration maintenance, groups, and naming.
• Pig
• Analyzing large data sets introduces dataflow complexities that become harder to implement in a MapReduce
program as data volumes and processing complexities increase
25. Technologies for Big Data Processing
• HBase
• HBase is an open-source, nonrelational, column-
oriented, multidimensional, distributed database
developed on Google’s BigTable architecture. It is
designed with high availability and high
performance as drivers to support storage and
processing of large data sets on the Hadoop
framework.
• Hive
• Hive is an open-source data warehousing
solution that has been built on top of Hadoop.
• Chukwa
• Chukwa is an open-source data collection system
for monitoring large distributed systems. Chukwa
is built on top of HDFS (Hadoop Distributed File
System ) and MapReduce frameworks. There is a
flexible and powerful toolkit for displaying,
monitoring, and analyzing results to make the
best use of the collected data available in
Chukwa.
27. Big Data Recent Research Trends
• Big Data in Retail
• Big Data in Healthcare
• Big Data in Education
• Big Data in E-commerce
• Big Data in Media and
Entertainment
• Big Data in Finance
• Big Data in Travel Industry
• Big Data in Telecom
• Big Data in Automobile
28.
29. References
• Big Data Databases: the Essence
https://www.scnsoft.com/analytics/big-data/databases
• Big Data Applications – A manifestation of the hottest buzzword
https://data-flair.training/blogs/big-data-applications/
• Big Data Tutorial For Beginners | What Is Big Data?
https://www.softwaretestinghelp.com/big-data-
tutorial/#Big_Data_Benefits_Over_Traditional_Database
• Healthcare Big Data and the Promise of Value-Based Care
https://catalyst.nejm.org/doi/full/10.1056/CAT.18.0290
30. S.No. TRADITIONAL DATA BIG DATA
01. Traditional data is generated in enterprise level. Big data is generated in outside and enterprise level.
02. Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to Zettabytes or Exabytes.
03. Traditional database system deals with structured data.
Big data system deals with structured, semi structured and
unstructured data.
04. Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds.
05.
Traditional data source is centralized and it is managed in
centralized form.
Big data source is distributed and it is managed in distributed
form.
06. Data integration is very easy. Data integration is very difficult.
07.
Normal system configuration is capable to process traditional
data.
High system configuration is required to process big data.
08. The size of the data is very small. The size is more than the traditional data size.
09.
Traditional data base tools are required to perform any data
base operation.
Special kind of data base tools are required to perform any data
base operation.
10. Normal functions can manipulate data. Special kind of functions can manipulate data.
11. Its data model is strict schema based and it is static. Its data model is flat schema based and it is dynamic.
12.. Traditional data is stable and inter relationship. Big data is not stable and unknown relationship.
13. Traditional data is in manageable volume. Big data is in huge volume which becomes unmanageable.
14. It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data.
15.
Its data sources includes ERP transaction data, CRM
transaction data, financial data, organizational data, web
transaction data etc.
Its data sources includes social media, device data, sensor data,
video, images, audio etc.
Editor's Notes
Structured. If your data is structured, it means that it is already organized and convenient to work with. An example is data in Excel or SQL databases that is tagged in a standardized format and can be easily sorted, updated, and extracted.
Unstructured. Unstructured data does not have any pre-defined order. Google search results are an example of what unstructured data can look like: articles, e-books, videos, and images.
Semi-structured. Semi-structured data has been pre-processed but it doesn’t look like a ‘normal’ SQL database. It can contain some tags, such as data formats. JSON or XML files are examples of semi-structured data. Some tools for data analytics can work with them.
Quasi-structured. It is something in between unstructured and semi-structured data. An example is textual content with erratic data formats such as the information about what web pages a user visited and in what order.
Secure Data : While most organizations gather data from customers via interactions with their websites and products, not many businesses spend time to employ measures to guarantee the security of the data collected. In the situation that collected data is damaged, it might damage the relationship with the customer through loss of trust, business bankrupt, or have it collapse due to lack of essential customer data.
EDW Enterprise Data Warehouse
Linkage of different units of data from multiple data sets is not a new concept by itself.
This process can be repeated multiple times for a given data set, as the business rule for each component is different.