BIG DATA
Definition, History
Assistant Professor Dr., Software Engineering Dept. Ş. Esra Dinçer
EMT516 Big Data Management
1
Definition of DATA
• Collection of facts, numbers, words, observations or other
useful information.
• Through data processing and data analysis, raw data is
transformed into valuable insights that improve decision-
making and drive better business outcomes.
• In recent years, the rise of AI has further increased the focus
on data to train machine learning models and refine predictive
algorithms.
• As data’s volume, complexity and importance grow,
effective data management processes are needed to keep
information organized and accessible for data analysis.
2
Data versus Information
Data;
• raw,
• chaotic,
• lacking meaningful structure or
context.
Information:
• refined,
• analyzed,
• structured output,
• derived from this data,
• facilitate strategic decision-
making.
3
Purposes of Data Use
• Predict customer behaviour
• Optimise supply chain
• Forcast demand
• Predictive analytics
• Generative AI
• Healthcare innovations
• Social science research
• Cybersecurity and risk management
• Operational efficiency
• Business intelligence (BI)
4
Data collecting steps
1. Setting clear objectives,
2. Identifying relevant sources,
3. Acquiring data,
4. Clean data,
5. Integrate into a unified data set,
6. Ongoing quality checks help ensure the collected data is
accurate and reliable.
Proper data collection leads to better analyses on complete,
accurate data, better insights and decision-making.
5
Noisy data
• Data is corrupted, distorted.
• The noise in data can lead to a false sense of accuracy or false
conclusions.
• Data with a large amount of additional meaningless information in it
called noise.
6
Name Phone number City Order number
Whitten +1 (5423)196354 New Y o r k
S mith 1 3654 4 1 5 9 63 #%Phonix 23987
*Lowell **1 (3612) Phonix ?//*+
Data resources
• Social media interactions: Real-time data from platforms such as Twitter
and Public data: Freely available data sets from governments and
organizations
• Open data sets: Data sets from academic institutions and governments
• Transactional data: Data from business transactions, such as sales records,
• Surveys and questionnaires: Qualitative or quantitative data collected
through customer feedback or research surveys
• Web analytics: Data from website interactions
• IoT devices: Data from Internet of Things (IoT)
7
Types of DATA
Some of the most common types of data include:
• Quantitative data
• Qualitative data
• Structured data
• Unstructured data
• Semi-structured data
• Metadata
• Big data
8
Quantitative data
• Quantitative data consists of values that can be measured
numerically.
• It may include discrete data points (like the number of
products sold) or continuous data points (such as temperature
or revenue figures).
• It is often structured, making it easy to analyze using
mathematical tools and algorithms.
• Common use cases of quantitative data include trend
forecasting, statistical analysis, budgeting, pattern
identification and performance measurement.
9
Qualitative data
• It is descriptive and non-numerical, capturing characteristics,
concepts or experiences that numbers cannot measure.
• Examples include customer feedback, product reviews and
social media comments.
• Qualitative data can be structured (such as coded survey
responses) or unstructured (such as free-text responses or
interview transcripts).
• Common use cases for qualitative data include understanding
customer behavior, market trends and user experiences.
10
Structured data
• Structured data is organized in a clear, defined format, often
stored in relational databases or spreadsheets.
• It can consist of both quantitative (such as sales figures) and
qualitative data (such as categorical labels like “yes or no”).
• The highly organized nature of structured data allows for quick
querying and data analysis, making it useful for business
intelligence systems and reporting processes.
11
Unstructured data
• It does not have a strictly defined format. It often comes in
complex forms such as text documents, images and videos.
• It can include both qualitative information and quantitative
elements.
• Examples: emails, social media content and multimedia files.
• Unstructured data doesn’t easily fit into traditional relational
databases, and organizations often use techniques such
as natural language processing (NLP) and machine learning to
streamline analysis of unstructured data.
12
Semi-structured data
• Semi-structured data blends elements of structured and
unstructured data.
• It doesn't follow a rigid format but can include tags or markers
that make it easier to organize and analyze.
• Examples; XML files and JSON objects.
• Semi-structured data is widely used in scenarios such as web
scraping and data integration projects because it offers
flexibility while retaining some structure for search and
analysis.
13
Metadata
• Metadata is data about data.
• It includes information about the attributes of a data point or
data set, such as file names, authors, creation dates or data
types.
• Metadata enhances data organization, searchability and
management.
• It is critical to systems such as databases, digital libraries and
content management platforms because it helps users more
easily sort and find the data they need.
14
Big data
• Big data refers to massive, complex data sets that traditional
systems can't handle.
• It includes both structured and unstructured data.
• Big data analytics helps organizations process and analyze
these large data sets to systematically extract valuable
insights. It often requires advanced tools such as machine
learning.
• Common use cases for big data include customer behavior
analysis, fraud detection and predictive maintenance.
15
History of Big Data
• Around 2005, users generated huge amounts of data through
Facebook, YouTube, and other online services
• Hadoop, Spark and other open-source framework created
specifically to store and analyze big data sets.
• They make big data easier to work with and cheaper to store
• The volume of big data has skyrocketed in the recent years
• The servers on the internet have been gathering data on customer
usage patterns and product performance
• Cloud computing has expanded big data possibilities even further.
16
Data management
• Data management is the practice of collecting, processing and
using data securely and efficiently to improve business
outcomes.
• It addresses critical challenges such as managing large data
sets, breaking down silos and handling inconsistent data
formats.
• Data management solutions typically to help ensure access to
high-quality, usable data for data scientists, analysts and other
stakeholders.
17
2 of the most significant roles in the field
• Data scientist: create models and algorithms to find insights
in large data sets, often using advanced tools such as
machine learning and predictive modeling.
• Data analyst: use statistics to analyze data and answer
specific business questions. Their main goal is to find useful
insights that help with everyday decisions and strategies.
18
The 3 VS of the Big Data
Volume is the amount
of data matters
Velocity is the fast rate at
which data is received and
(perhaps) acted on.
Variety; In today’s big data world,
data comes in new unstructured
data types
19
Examples from Companies
• Companies like Netfix and Procter & Gamble use big data to
anticipate customer demand.
• Classify key attributes of past and current products or services,
• Model the relationship between those attributes and the commercial
success of the offerings,
• Build predictive models for new products and services.
• P&G uses data and analytics from focus groups, social media, test
markets, and early store rollouts to plan, produce, and launch new
products.
20
Big Data Challenges
• Data volumes are increasing in size and organizations need to find
ways to effectively store it.
• Data is valuable when it is clean. Clean data, or data that’s relevant
and organized in a way that enables meaningful analysis requires a lot
of work.
• Data technology is changing and keeping up with big data technology
is an ongoing challenge. Apache Hadoop and Apache Spark
frameworks appears to be the best approach.
21
Steps of Big Data Works
1. Integrating data from different sources, formatting in a required
form
2. Store; in the Cloud, on-premises or both
3. Analyse and act on the data
22
Big Data Platforms
1. Apache Hadoop
2. Snowflake
3. Apache Spark
4. Google BigQuery
5. AWS Big Data Solutions
6. Microsoft Azure HDInsight
23
Resources
(https://www.ibm.com/think/topics/data, oracle.com)
24

Lesson_1_definitions_BIG DATA INROSUCTIONUE.pdf

  • 1.
    BIG DATA Definition, History AssistantProfessor Dr., Software Engineering Dept. Ş. Esra Dinçer EMT516 Big Data Management 1
  • 2.
    Definition of DATA •Collection of facts, numbers, words, observations or other useful information. • Through data processing and data analysis, raw data is transformed into valuable insights that improve decision- making and drive better business outcomes. • In recent years, the rise of AI has further increased the focus on data to train machine learning models and refine predictive algorithms. • As data’s volume, complexity and importance grow, effective data management processes are needed to keep information organized and accessible for data analysis. 2
  • 3.
    Data versus Information Data; •raw, • chaotic, • lacking meaningful structure or context. Information: • refined, • analyzed, • structured output, • derived from this data, • facilitate strategic decision- making. 3
  • 4.
    Purposes of DataUse • Predict customer behaviour • Optimise supply chain • Forcast demand • Predictive analytics • Generative AI • Healthcare innovations • Social science research • Cybersecurity and risk management • Operational efficiency • Business intelligence (BI) 4
  • 5.
    Data collecting steps 1.Setting clear objectives, 2. Identifying relevant sources, 3. Acquiring data, 4. Clean data, 5. Integrate into a unified data set, 6. Ongoing quality checks help ensure the collected data is accurate and reliable. Proper data collection leads to better analyses on complete, accurate data, better insights and decision-making. 5
  • 6.
    Noisy data • Datais corrupted, distorted. • The noise in data can lead to a false sense of accuracy or false conclusions. • Data with a large amount of additional meaningless information in it called noise. 6 Name Phone number City Order number Whitten +1 (5423)196354 New Y o r k S mith 1 3654 4 1 5 9 63 #%Phonix 23987 *Lowell **1 (3612) Phonix ?//*+
  • 7.
    Data resources • Socialmedia interactions: Real-time data from platforms such as Twitter and Public data: Freely available data sets from governments and organizations • Open data sets: Data sets from academic institutions and governments • Transactional data: Data from business transactions, such as sales records, • Surveys and questionnaires: Qualitative or quantitative data collected through customer feedback or research surveys • Web analytics: Data from website interactions • IoT devices: Data from Internet of Things (IoT) 7
  • 8.
    Types of DATA Someof the most common types of data include: • Quantitative data • Qualitative data • Structured data • Unstructured data • Semi-structured data • Metadata • Big data 8
  • 9.
    Quantitative data • Quantitativedata consists of values that can be measured numerically. • It may include discrete data points (like the number of products sold) or continuous data points (such as temperature or revenue figures). • It is often structured, making it easy to analyze using mathematical tools and algorithms. • Common use cases of quantitative data include trend forecasting, statistical analysis, budgeting, pattern identification and performance measurement. 9
  • 10.
    Qualitative data • Itis descriptive and non-numerical, capturing characteristics, concepts or experiences that numbers cannot measure. • Examples include customer feedback, product reviews and social media comments. • Qualitative data can be structured (such as coded survey responses) or unstructured (such as free-text responses or interview transcripts). • Common use cases for qualitative data include understanding customer behavior, market trends and user experiences. 10
  • 11.
    Structured data • Structureddata is organized in a clear, defined format, often stored in relational databases or spreadsheets. • It can consist of both quantitative (such as sales figures) and qualitative data (such as categorical labels like “yes or no”). • The highly organized nature of structured data allows for quick querying and data analysis, making it useful for business intelligence systems and reporting processes. 11
  • 12.
    Unstructured data • Itdoes not have a strictly defined format. It often comes in complex forms such as text documents, images and videos. • It can include both qualitative information and quantitative elements. • Examples: emails, social media content and multimedia files. • Unstructured data doesn’t easily fit into traditional relational databases, and organizations often use techniques such as natural language processing (NLP) and machine learning to streamline analysis of unstructured data. 12
  • 13.
    Semi-structured data • Semi-structureddata blends elements of structured and unstructured data. • It doesn't follow a rigid format but can include tags or markers that make it easier to organize and analyze. • Examples; XML files and JSON objects. • Semi-structured data is widely used in scenarios such as web scraping and data integration projects because it offers flexibility while retaining some structure for search and analysis. 13
  • 14.
    Metadata • Metadata isdata about data. • It includes information about the attributes of a data point or data set, such as file names, authors, creation dates or data types. • Metadata enhances data organization, searchability and management. • It is critical to systems such as databases, digital libraries and content management platforms because it helps users more easily sort and find the data they need. 14
  • 15.
    Big data • Bigdata refers to massive, complex data sets that traditional systems can't handle. • It includes both structured and unstructured data. • Big data analytics helps organizations process and analyze these large data sets to systematically extract valuable insights. It often requires advanced tools such as machine learning. • Common use cases for big data include customer behavior analysis, fraud detection and predictive maintenance. 15
  • 16.
    History of BigData • Around 2005, users generated huge amounts of data through Facebook, YouTube, and other online services • Hadoop, Spark and other open-source framework created specifically to store and analyze big data sets. • They make big data easier to work with and cheaper to store • The volume of big data has skyrocketed in the recent years • The servers on the internet have been gathering data on customer usage patterns and product performance • Cloud computing has expanded big data possibilities even further. 16
  • 17.
    Data management • Datamanagement is the practice of collecting, processing and using data securely and efficiently to improve business outcomes. • It addresses critical challenges such as managing large data sets, breaking down silos and handling inconsistent data formats. • Data management solutions typically to help ensure access to high-quality, usable data for data scientists, analysts and other stakeholders. 17
  • 18.
    2 of themost significant roles in the field • Data scientist: create models and algorithms to find insights in large data sets, often using advanced tools such as machine learning and predictive modeling. • Data analyst: use statistics to analyze data and answer specific business questions. Their main goal is to find useful insights that help with everyday decisions and strategies. 18
  • 19.
    The 3 VSof the Big Data Volume is the amount of data matters Velocity is the fast rate at which data is received and (perhaps) acted on. Variety; In today’s big data world, data comes in new unstructured data types 19
  • 20.
    Examples from Companies •Companies like Netfix and Procter & Gamble use big data to anticipate customer demand. • Classify key attributes of past and current products or services, • Model the relationship between those attributes and the commercial success of the offerings, • Build predictive models for new products and services. • P&G uses data and analytics from focus groups, social media, test markets, and early store rollouts to plan, produce, and launch new products. 20
  • 21.
    Big Data Challenges •Data volumes are increasing in size and organizations need to find ways to effectively store it. • Data is valuable when it is clean. Clean data, or data that’s relevant and organized in a way that enables meaningful analysis requires a lot of work. • Data technology is changing and keeping up with big data technology is an ongoing challenge. Apache Hadoop and Apache Spark frameworks appears to be the best approach. 21
  • 22.
    Steps of BigData Works 1. Integrating data from different sources, formatting in a required form 2. Store; in the Cloud, on-premises or both 3. Analyse and act on the data 22
  • 23.
    Big Data Platforms 1.Apache Hadoop 2. Snowflake 3. Apache Spark 4. Google BigQuery 5. AWS Big Data Solutions 6. Microsoft Azure HDInsight 23
  • 24.