BIG DATA
Agenda
• What is Data?
• What is Big Data?
• What is an Example of Big Data?
• Types Of Big Data
• Characteristics Of Big Data
• Advantages Of Big Data Processing
• Big data use cases
• How is big data stored and processed?
• Big data management technologies
• Big data challenges
What is Data?
• The quantities, characters, or symbols on which operations are
performed by a computer, which may be stored and transmitted in
the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
• Now, let’s learn Big Data definition
What is Big Data?
• Big Data is a collection of data that is huge in volume, yet growing
exponentially with time.
• It is a data with so large size and complexity that none of traditional
data management tools can store it or process it efficiently.
• Big data is also a data but with huge size.
What is an Example of Big Data?
• The New York Stock Exchange:
• that generates about one terabyte of new trade data per day.
• Social Media:
• The statistic shows that 500+terabytes of new data get ingested into the databases of social
media site Facebook, every day.
• This data is mainly generated in terms of photo and video uploads, message exchanges,
putting comments etc.
• Jet engine:
• can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of
data reaches up to many Petabytes.
Types Of Big Data
1. Structured
2. Unstructured
3. Semi-structured
Structured
• Any data that can be stored, accessed and processed in the form of
fixed format.
• talent in computer science has achieved greater success in developing
techniques for working with such kind of data .
• However, nowadays, we are foreseeing issues when a size of such
data grows to a huge extent, typical sizes are being in the rage of
multiple zettabytes.
• Looking at these figures one can easily understand why the name Big
Data is given and imagine the challenges involved in its storage and
processing.
Do you know?
• 1021 bytes equal to 1 zettabyte or one billion terabytes forms a
zettabyte.
• Data stored in a relational database management system is one
example of a ‘structured’ data.
Examples Of Structured Data
• An ‘Employee’ table in a database is an example of Structured Data.
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Unstructured
• Any data with unknown form or the structure is classified as
unstructured data.
• In addition to the size being huge, un-structured data poses multiple
challenges in terms of its processing for deriving value out of it.
• A typical example of unstructured data is a heterogeneous data
source containing a combination of simple text files, images, videos
etc.
• Now day organizations have wealth of data available with them but
unfortunately, they don’t know how to derive value out of it since this
data is in its raw form or unstructured format.
Examples Of Un-structured Data
• The output returned by ‘Google Search’
Semi-structured
• Semi-structured data can contain both the forms of data.
• We can see semi-structured data as a structured in form but it is
actually not defined with e.g. a table definition in relational DBMS.
Examples Of Semi-structured Data
• Personal data stored in an XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over the years
The three Vs of big data
Volume
• The amount of data matters.
• you’ll have to process high volumes of low-density, unstructured data.
• This can be data of unknown value, such as Twitter data feeds, clickstreams on a web page or a mobile
app, or sensor-enabled equipment.
• For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of
petabytes.
Velocity
• Velocity is the fast rate at which data is received and (perhaps) acted on.
• Normally, the highest velocity of data streams directly into memory versus being written to disk.
• Some internet-enabled smart products operate in real time or near real time and will require real-time
evaluation and action.
Variety
• Variety refers to the many types of data that are available.
• Traditional data types were structured and fit neatly in a relational database.
• With the rise of big data, data comes in new unstructured data types.
• Unstructured and semi structured data types, such as text, audio, and video, require additional
preprocessing to derive meaning and support metadata.
The value—and truth—of big data
• Two more Vs have emerged over the past few years:
• Value:
• Data has intrinsic value. But it’s of no use until that value is discovered.
• Veracity:
• How truthful is your data—and how much can you rely on it?
• Finding value in big data isn’t only about analyzing it.
• It’s an entire discovery process that requires insightful analysts,
business users, and executives who ask the right questions, recognize
patterns, make informed assumptions, and predict behavior.
Advantages Of Big Data Processing
• Ability to process Big Data in DBMS brings in multiple benefits, such as-
 Businesses can utilize outside intelligence while taking decisions
• Access to social data from search engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies.
 Improved customer service
• Traditional customer feedback systems are getting replaced by new systems designed
with Big Data technologies. In these new systems, Big Data and natural language
processing technologies are being used to read and evaluate consumer responses.
 Early identification of risk to the product/services, if any
 Better operational efficiency
• Big Data technologies can be used for creating a staging area or landing zone for new
data before identifying what data should be moved to the data warehouse. In addition,
such integration of Big Data technologies and data warehouse helps an organization to
offload infrequently accessed data.
Big data benefits:
• Big data makes it possible for you to gain more complete answers
because you have more information.
• More complete answers mean more confidence in the data—which
means a completely different approach to tackling problems.
Big data use cases
• Big data can help you address a range of business activities, from
customer experience to analytics. Here are just a few.
Product development
• Companies like Netflix and Procter & Gamble use big data to
anticipate customer demand. They build predictive models for new
products and services by classifying key attributes of past and current
products or services and modeling the relationship between those
attributes and the commercial success of the offerings. In addition,
P&G uses data and analytics from focus groups, social media, test
markets, and early store rollouts to plan, produce, and launch new
products.
Predictive maintenance
• Factors that can predict mechanical failures may be deeply buried in
structured data, such as the year, make, and model of equipment, as
well as in unstructured data that covers millions of log entries, sensor
data, error messages, and engine temperature. By analyzing these
indications of potential issues before the problems happen,
organizations can deploy maintenance more cost effectively and
maximize parts and equipment uptime.
Customer experience
• The race for customers is on. A clearer view of customer experience is
more possible now than ever before. Big data enables you to gather
data from social media, web visits, call logs, and other sources to
improve the interaction experience and maximize the value delivered.
Start delivering personalized offers, reduce customer churn, and
handle issues proactively.
Fraud and compliance
• When it comes to security, it’s not just a few rogue hackers—you’re
up against entire expert teams. Security landscapes and compliance
requirements are constantly evolving. Big data helps you identify
patterns in data that indicate fraud and aggregate large volumes of
information to make regulatory reporting much faster.
Machine learning
• Machine learning is a hot topic right now. And data—specifically big
data—is one of the reasons why. We are now able to teach machines
instead of program them. The availability of big data to train machine
learning models makes that possible.
Operational efficiency
• Operational efficiency may not always make the news, but it’s an area
in which big data is having the most impact. With big data, you can
analyze and assess production, customer feedback and returns, and
other factors to reduce outages and anticipate future demands. Big
data can also be used to improve decision-making in line with current
market demand.
Drive innovation
• Big data can help you innovate by studying interdependencies among
humans, institutions, entities, and process and then determining new
ways to use those insights. Use data insights to improve decisions
about financial and planning considerations. Examine trends and
what customers want to deliver new products and services.
Implement dynamic pricing. There are endless possibilities.
How is big data stored and processed?
• Big data is often stored in a data lake.
• While data warehouses are commonly built on relational
databases and contain structured data only.
• data lakes can support various data types and typically are based
on Hadoop clusters, cloud object storage services, NoSQL databases
or other big data platforms.
Big data management technologies
• Hadoop, an open source distributed processing framework released
in 2006, initially was at the center of most big data architectures.
• The development of Spark and other processing engines
pushed MapReduce, the engine built into Hadoop, more to the side.
• The result is an ecosystem of big data technologies that can be used
for different applications but often are deployed together.
Big data platforms
• Big data platforms and managed services offered by IT vendors
combine many of those technologies in a single package, primarily for
use in the cloud.
• Amazon EMR (formerly Elastic MapReduce)
• Cloudera Data Platform
• Google Cloud Dataproc
• HPE Ezmeral Data Fabric (formerly MapR Data Platform)
• Microsoft Azure HDInsight
• For organizations that want to deploy big data systems themselves,
either on premises or in the cloud, the technologies that are available
to them in addition to Hadoop and Spark include the following
categories of tools:
storage repositories
• such as the Hadoop Distributed File System (HDFS) and cloud object
storage services that include Amazon Simple Storage Service (S3),
Google Cloud Storage and Azure Blob Storage;
cluster management frameworks
• like Kubernetes, Mesos and YARN, Hadoop's built-in resource
manager and job scheduler, which stands for Yet Another Resource
Negotiator but is commonly known by the acronym alone;
stream processing engines
• such as Flink, Hudi, Kafka, Samza, Storm and the Spark Streaming and
Structured Streaming modules built into Spark;
NoSQL databases
• that include Cassandra, Couchbase, CouchDB, HBase, MarkLogic Data
Hub, MongoDB, Neo4j, Redis and various other technologies;
data lake and data warehouse platforms
• among them Amazon Redshift, Delta Lake, Google BigQuery, Kylin and
Snowflake; and
SQL query engines
• like Drill, Hive, Impala, Presto and Trino.
Big data challenges
• In connection with the processing capacity issues, designing a big
data architecture is a common challenge for users. Big data systems
must be tailored to an organization's particular needs, a DIY
undertaking that requires IT and data management teams to piece
together a customized set of technologies and tools. Deploying and
managing big data systems also require new skills compared to the
ones that database administrators and developers focused on
relational software typically possess.
• Both of those issues can be eased by using a managed cloud service,
but IT managers need to keep a close eye on cloud usage to make
sure costs don't get out of hand. Also, migrating on-premises data
sets and processing workloads to the cloud is often a complex
process.
• Other challenges in managing big data systems include making the
data accessible to data scientists and analysts, especially in
distributed environments that include a mix of different platforms
and data stores. To help analysts find relevant data, data management
and analytics teams are increasingly building data catalogs that
incorporate metadata management and data lineage functions. The
process of integrating sets of big data is often also complicated,
particularly when data variety and velocity are factors.

Big_Data.pptx

  • 1.
  • 2.
    Agenda • What isData? • What is Big Data? • What is an Example of Big Data? • Types Of Big Data • Characteristics Of Big Data • Advantages Of Big Data Processing • Big data use cases • How is big data stored and processed? • Big data management technologies • Big data challenges
  • 3.
    What is Data? •The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. • Now, let’s learn Big Data definition
  • 4.
    What is BigData? • Big Data is a collection of data that is huge in volume, yet growing exponentially with time. • It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. • Big data is also a data but with huge size.
  • 5.
    What is anExample of Big Data? • The New York Stock Exchange: • that generates about one terabyte of new trade data per day. • Social Media: • The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. • This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc. • Jet engine: • can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
  • 6.
    Types Of BigData 1. Structured 2. Unstructured 3. Semi-structured
  • 7.
    Structured • Any datathat can be stored, accessed and processed in the form of fixed format. • talent in computer science has achieved greater success in developing techniques for working with such kind of data . • However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes. • Looking at these figures one can easily understand why the name Big Data is given and imagine the challenges involved in its storage and processing.
  • 8.
    Do you know? •1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte. • Data stored in a relational database management system is one example of a ‘structured’ data.
  • 9.
    Examples Of StructuredData • An ‘Employee’ table in a database is an example of Structured Data. Employee_ID Employee_Name Gender Department Salary_In_lacs 2365 Rajesh Kulkarni Male Finance 650000 3398 Pratibha Joshi Female Admin 650000 7465 Shushil Roy Male Admin 500000 7500 Shubhojit Das Male Finance 500000 7699 Priya Sane Female Finance 550000
  • 10.
    Unstructured • Any datawith unknown form or the structure is classified as unstructured data. • In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. • A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. • Now day organizations have wealth of data available with them but unfortunately, they don’t know how to derive value out of it since this data is in its raw form or unstructured format.
  • 11.
    Examples Of Un-structuredData • The output returned by ‘Google Search’
  • 12.
    Semi-structured • Semi-structured datacan contain both the forms of data. • We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
  • 13.
    Examples Of Semi-structuredData • Personal data stored in an XML file- <rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec> <rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec> <rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec> <rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec> <rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
  • 14.
  • 15.
    The three Vsof big data Volume • The amount of data matters. • you’ll have to process high volumes of low-density, unstructured data. • This can be data of unknown value, such as Twitter data feeds, clickstreams on a web page or a mobile app, or sensor-enabled equipment. • For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes. Velocity • Velocity is the fast rate at which data is received and (perhaps) acted on. • Normally, the highest velocity of data streams directly into memory versus being written to disk. • Some internet-enabled smart products operate in real time or near real time and will require real-time evaluation and action. Variety • Variety refers to the many types of data that are available. • Traditional data types were structured and fit neatly in a relational database. • With the rise of big data, data comes in new unstructured data types. • Unstructured and semi structured data types, such as text, audio, and video, require additional preprocessing to derive meaning and support metadata.
  • 16.
    The value—and truth—ofbig data • Two more Vs have emerged over the past few years: • Value: • Data has intrinsic value. But it’s of no use until that value is discovered. • Veracity: • How truthful is your data—and how much can you rely on it? • Finding value in big data isn’t only about analyzing it. • It’s an entire discovery process that requires insightful analysts, business users, and executives who ask the right questions, recognize patterns, make informed assumptions, and predict behavior.
  • 17.
    Advantages Of BigData Processing • Ability to process Big Data in DBMS brings in multiple benefits, such as-  Businesses can utilize outside intelligence while taking decisions • Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.  Improved customer service • Traditional customer feedback systems are getting replaced by new systems designed with Big Data technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.  Early identification of risk to the product/services, if any  Better operational efficiency • Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of Big Data technologies and data warehouse helps an organization to offload infrequently accessed data.
  • 18.
    Big data benefits: •Big data makes it possible for you to gain more complete answers because you have more information. • More complete answers mean more confidence in the data—which means a completely different approach to tackling problems.
  • 19.
    Big data usecases • Big data can help you address a range of business activities, from customer experience to analytics. Here are just a few.
  • 20.
    Product development • Companieslike Netflix and Procter & Gamble use big data to anticipate customer demand. They build predictive models for new products and services by classifying key attributes of past and current products or services and modeling the relationship between those attributes and the commercial success of the offerings. In addition, P&G uses data and analytics from focus groups, social media, test markets, and early store rollouts to plan, produce, and launch new products.
  • 21.
    Predictive maintenance • Factorsthat can predict mechanical failures may be deeply buried in structured data, such as the year, make, and model of equipment, as well as in unstructured data that covers millions of log entries, sensor data, error messages, and engine temperature. By analyzing these indications of potential issues before the problems happen, organizations can deploy maintenance more cost effectively and maximize parts and equipment uptime.
  • 22.
    Customer experience • Therace for customers is on. A clearer view of customer experience is more possible now than ever before. Big data enables you to gather data from social media, web visits, call logs, and other sources to improve the interaction experience and maximize the value delivered. Start delivering personalized offers, reduce customer churn, and handle issues proactively.
  • 23.
    Fraud and compliance •When it comes to security, it’s not just a few rogue hackers—you’re up against entire expert teams. Security landscapes and compliance requirements are constantly evolving. Big data helps you identify patterns in data that indicate fraud and aggregate large volumes of information to make regulatory reporting much faster.
  • 24.
    Machine learning • Machinelearning is a hot topic right now. And data—specifically big data—is one of the reasons why. We are now able to teach machines instead of program them. The availability of big data to train machine learning models makes that possible.
  • 25.
    Operational efficiency • Operationalefficiency may not always make the news, but it’s an area in which big data is having the most impact. With big data, you can analyze and assess production, customer feedback and returns, and other factors to reduce outages and anticipate future demands. Big data can also be used to improve decision-making in line with current market demand.
  • 26.
    Drive innovation • Bigdata can help you innovate by studying interdependencies among humans, institutions, entities, and process and then determining new ways to use those insights. Use data insights to improve decisions about financial and planning considerations. Examine trends and what customers want to deliver new products and services. Implement dynamic pricing. There are endless possibilities.
  • 27.
    How is bigdata stored and processed? • Big data is often stored in a data lake. • While data warehouses are commonly built on relational databases and contain structured data only. • data lakes can support various data types and typically are based on Hadoop clusters, cloud object storage services, NoSQL databases or other big data platforms.
  • 28.
    Big data managementtechnologies • Hadoop, an open source distributed processing framework released in 2006, initially was at the center of most big data architectures. • The development of Spark and other processing engines pushed MapReduce, the engine built into Hadoop, more to the side. • The result is an ecosystem of big data technologies that can be used for different applications but often are deployed together.
  • 29.
    Big data platforms •Big data platforms and managed services offered by IT vendors combine many of those technologies in a single package, primarily for use in the cloud. • Amazon EMR (formerly Elastic MapReduce) • Cloudera Data Platform • Google Cloud Dataproc • HPE Ezmeral Data Fabric (formerly MapR Data Platform) • Microsoft Azure HDInsight
  • 30.
    • For organizationsthat want to deploy big data systems themselves, either on premises or in the cloud, the technologies that are available to them in addition to Hadoop and Spark include the following categories of tools:
  • 31.
    storage repositories • suchas the Hadoop Distributed File System (HDFS) and cloud object storage services that include Amazon Simple Storage Service (S3), Google Cloud Storage and Azure Blob Storage;
  • 32.
    cluster management frameworks •like Kubernetes, Mesos and YARN, Hadoop's built-in resource manager and job scheduler, which stands for Yet Another Resource Negotiator but is commonly known by the acronym alone;
  • 33.
    stream processing engines •such as Flink, Hudi, Kafka, Samza, Storm and the Spark Streaming and Structured Streaming modules built into Spark;
  • 34.
    NoSQL databases • thatinclude Cassandra, Couchbase, CouchDB, HBase, MarkLogic Data Hub, MongoDB, Neo4j, Redis and various other technologies;
  • 35.
    data lake anddata warehouse platforms • among them Amazon Redshift, Delta Lake, Google BigQuery, Kylin and Snowflake; and
  • 36.
    SQL query engines •like Drill, Hive, Impala, Presto and Trino.
  • 37.
    Big data challenges •In connection with the processing capacity issues, designing a big data architecture is a common challenge for users. Big data systems must be tailored to an organization's particular needs, a DIY undertaking that requires IT and data management teams to piece together a customized set of technologies and tools. Deploying and managing big data systems also require new skills compared to the ones that database administrators and developers focused on relational software typically possess.
  • 38.
    • Both ofthose issues can be eased by using a managed cloud service, but IT managers need to keep a close eye on cloud usage to make sure costs don't get out of hand. Also, migrating on-premises data sets and processing workloads to the cloud is often a complex process.
  • 39.
    • Other challengesin managing big data systems include making the data accessible to data scientists and analysts, especially in distributed environments that include a mix of different platforms and data stores. To help analysts find relevant data, data management and analytics teams are increasingly building data catalogs that incorporate metadata management and data lineage functions. The process of integrating sets of big data is often also complicated, particularly when data variety and velocity are factors.

Editor's Notes

  • #8  Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes
  • #28 What is a data lake? A data lake is a storage repository that holds a vast amount of raw data in its native... What is a relational database? A relational database is a collection of information that organizes data points... Hadoop is an open source distributed processing framework that manages data processing and storage for big data... What are NoSQL databases? NoSQL is an approach to database management that can accommodate a wide variety of...