Bigdataandmachinelearning
Bigdata
Big Data is similar to small data but bigger in size but having data bigger it requires different
approaches, Techniques, tools and architecture an aim to solve new problems or old problems
in a better way Big Data generates value from the storage and processing of very large
quantities of digital information that cannot be analyzed with traditional computing
techniques.
The challenges include
• capture
• curation
• storage
• search
• sharing
• transfer
• analysis
• visualization
Characteristicsof bigdata
Volume
Big Data indicates huge volumes’ of data that is being generated on a daily basis from various
sources like social media platforms, business processes, machines, networks, human
interactions, etc. Such a large amount of data are stored in data warehouses.
Data volume is increasing exponentially
◦ 44x increase from 2009 2020
◦ From 0.8 zettabytes to 35zb
•A typical PC might have had 10 gigabytes of storage in 2000
•Today, Facebook ingests 500 terabytes of new data every day.
•Boeing 737 will generate 240 terabytes of flight data during a single
flight across the US
Earthscope
The Earth scope is the world's largest science
project. Designed to track North America's
geological evolution, this observatory records
data over 3.8 million square miles, amassing
67 terabytes of data. It analyzes seismic slips
in the San Andreas fault, sure, but also the
plume of magma underneath
Velocity
The term velocity refers to the speed of generation of data. How fast data is generated and
process to meet the demands.
Big Data velocity deals with the speed at which data flows in from sources like business
processes, applications, networks, social media, sensors and mobile devices etc. The flow of
data is massive and continuous.
•Data is begin generated fast and need to be processed fast
•Online Data Analytics
•Late decisions  missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what you like  send
promotions right now for store next to you
Healthcare monitoring: sensors monitoring your activities and body  any abnormal
measurements require immediate reaction
Variety
Variety of Big Data refers to structured, unstructured, and semi structured data that is gathered from
multiple sources. While in the past, data could only be collected from spreadsheets and databases,
today data comes in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so
much more.
•Relational Data (Tables/Transaction/Legacy Data)
•Text Data (Web)
•Semi-structured Data (XML)
•Graph Data
◦ Social Network, Semantic Web (RDF), …
• Streaming Data
◦ You can only scan the data once
•Big Public Data (online, weather, finance, etc)
Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being
stored, and mined meaningful to the problem being analyzed. Veracity is often defined as the
quality or trustworthiness of the data you collect Inderal feel veracity in data analysis is the
biggest challenge when compares to things like volume and velocity.
Considering how accurate the data you collect and
analyze is important. In this sense, when it comes to
big data, quality is always preferred over quantity.
To focus on quality, it is important to set metrics
around what type of data you may collect and from what sources.
The ModelHas Changed…
The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
The Types of bigdata
•Structured
Most traditional data
sources
•Semi-structured
Many sources of big data
•Unstructured
Video data, audio data
Types
Structured: structured data, we mean data that can be processed, stored, and retrieved in a
fixed format. For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized manner.
Unstructured: Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured
data. Email is an example of unstructured data.
Semi-structured: Semi-structured data pertains to the data containing both the formats
mentioned above, that is, structured and unstructured data. To be precise, it refers to the data
that although has not been classified under a particular repository (database), yet contains vital
information or tags that segregate individual elements within the data.
Storingthe bigdata
Analyzing your data characteristics
Selecting data sources for analysis
Eliminating redundant data
Establishing the role of NoSQL
Overview of Big Data stores
Data models: key value, graph, document, column-family
Hadoop Distributed File System
HBase
Hive
Processingbig data
Integrating disparate data stores
• Mapping data to the programming framework
• Connecting and extracting data from storage
• Transforming data for processing
• Subdividing data in preparation for Hadoop
Map Reduce
Employing Hadoop Map Reduce
• Creating the components of Hadoop Map Reduce jobs
• Distributing data processing across server farms
• Executing Hadoop Map Reduce jobs
• Monitoring the progress of job flows
Big datasources
Data generationpoints
Applications
Benefits
•Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data
warehouse, It’s about the ability to make better decisions and take meaningful actions at the
right time.
•Fast forward to the present and technologies like Hadoop give you the scale and flexibility to
store data before you know how you are going to process it.
•Technologies such as Map Reduce, Hive and Impala enable you to run queries without changing
the data structures underneath.
•organizations are using big data to target customer-centric outcomes, tap into internal data and
build a better information ecosystem
Machine learning
Introduction
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without being explicitly programmed. Machine
learning focuses on the development of computer programs that can access data and use it learn for
themselves.
The basic premise of machine learning is
to build algorithms that can receive input
data and use statistical analysis to predict
an output while updating outputs as
new data becomes available
How machinelearningworks
How machinelearningworks
Machine learning algorithms are often categorized as
Basically supervised learning is a learning in which we teach or train the machine using data which
is well labeled that means some data is already tagged with the correct answer. After that, the
machine is provided with a new set of examples(data) so that supervised learning algorithm
analyses the training data(set of training examples) and produces a correct outcome from labeled
data
Example suppose you are given a basket filled with different kinds of fruits. Now the first step is
to train the machine with all different fruits one by one like this:
Example
If shape of object is rounded and depression at top having color Red then it will be labelled as –
Apple.
If shape of object is long curving cylinder having color Green-Yellow then it will be labelled as –
Banana.
Example
Now suppose after training the data, you have given
a new separate fruit say Banana from basket and asked to identify it.
Since the machine has already learned the things from previous data and this time have to use it
wisely. It will first classify the fruit with its shape and color and would confirm the fruit name as
BANANA and put it in Banana category
Supervised learning classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category, such as “Red” or
“blue” or “disease” and “no disease”.
Regression: A regression problem is when the output variable is a real value, such as “dollars” or
“weight”.
Unsupervisedlearning
Unsupervised learning is the training of machine using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Unlike
supervised learning, no training will be given to the machine. Therefore machine is restricted to
find the hidden structure in unlabeled data by our-self.
Example
suppose it is given an image having both dogs and cats which have not seen ever.
Example
Thus the machine has no idea about the features of dogs and cat so we can’t categorize it in
dogs and cats. But it can categorize them according to their similarities, patterns, and
differences we can easily categorize the above picture into two parts. First may contain all pics
having dogs in it and second part may contain all pics having cats in it. Here you didn’t learn
anything before, means no training data or examples
Unsupervised learning classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
Reinforcementlearning
Reinforcement learning.
This area of deep learning involves models iterating over many attempts to complete a process.
Steps that produce favorable outcomes are rewarded and steps that produce undesired
outcomes are penalized until the algorithm learns the optimal process.
Examplesofmachinelearning
Facebook news feeds
Machine learning is being used in a wide range of applications today. One of the most well-known
examples is Facebook's News Feed. The News Feed uses machine learning to personalize each
member's feed. If a member frequently stops scrolling to read or like a particular friend's posts, the
News Feed will start to show more of that friend's activity earlier in the feed. Behind the scenes,
the software is simply using statistical analysis and predictive analytics to identify patterns in the
user's data and use those patterns to populate the News Feed. Should the member no longer stop
to read, like or comment on the friend's posts, that new data will be included in the data set and
the News Feed will adjust accordingly.
Examplesof ML
.Self-driving cars
Machine learning also plays an important role in self-driving cars. Deep learning neural networks
are used to identify objects and determine optimal actions for safely steering a vehicle down the
road
Typesofmachinelearningalgorithms
. Here are a few of the most commonly used models:
Decision trees. These models use observations about certain actions and identify an optimal
path for arriving at a desired outcome.
K-means clustering. This model groups a specified number of data points into a specific number
of groupings based on like characteristics.
Neural networks. These deep learning models utilize large amounts of training data to identify
correlations between many variables to learn to process incoming data in the future.
future of machinelearning
Current machine learning (ML) algorithms identify statistical regularities in complex data sets and
are regularly used across a range of application domains, but they lack the robustness and
generalizability associated with human learning. If ML techniques could enable computers to
learn from fewer examples, transfer knowledge between tasks, and adapt to changing contexts
and environments, the results would have very broad scientific and societal impacts.
Howarebigdataandmachinelearningrelated?
Machine learning(ML) is based on algorithms that can learn from data without relying on rules-
based programming. Big data is the type of data that may be supplied into the analytical system
so that a ML model could ‘learn’ (or in other words, improve the accuracy of its predictions).
A quick example: preventive machinery maintenance. We use big data from sensors
(temperature, humidity, pressure and vibration readings for each machinery part that come
every second) to train, test and retrain a ML model. The role of the model is to identify hidden
patterns that lead to machinery failure and check newly incoming data against the identified
patterns. As a final step – the analytical system may trigger alerts to the maintenance team if
the model identifies a match with a pre failure condition pattern.
A good way to think about the relationship between Big Data and Machine Learning is that the
data is the raw material that feeds the machine learning process. The tangible benefit to a
business is derived from the predictive model(s) that comes out at the end of the process, not
the data used to construct it.
big data and machine learning ppt.pptx

big data and machine learning ppt.pptx

  • 1.
  • 2.
    Bigdata Big Data issimilar to small data but bigger in size but having data bigger it requires different approaches, Techniques, tools and architecture an aim to solve new problems or old problems in a better way Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques. The challenges include • capture • curation • storage • search • sharing • transfer • analysis • visualization
  • 3.
  • 4.
    Volume Big Data indicateshuge volumes’ of data that is being generated on a daily basis from various sources like social media platforms, business processes, machines, networks, human interactions, etc. Such a large amount of data are stored in data warehouses. Data volume is increasing exponentially ◦ 44x increase from 2009 2020 ◦ From 0.8 zettabytes to 35zb •A typical PC might have had 10 gigabytes of storage in 2000 •Today, Facebook ingests 500 terabytes of new data every day. •Boeing 737 will generate 240 terabytes of flight data during a single flight across the US
  • 5.
    Earthscope The Earth scopeis the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath
  • 6.
    Velocity The term velocityrefers to the speed of generation of data. How fast data is generated and process to meet the demands. Big Data velocity deals with the speed at which data flows in from sources like business processes, applications, networks, social media, sensors and mobile devices etc. The flow of data is massive and continuous. •Data is begin generated fast and need to be processed fast •Online Data Analytics •Late decisions  missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction
  • 7.
    Variety Variety of BigData refers to structured, unstructured, and semi structured data that is gathered from multiple sources. While in the past, data could only be collected from spreadsheets and databases, today data comes in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so much more. •Relational Data (Tables/Transaction/Legacy Data) •Text Data (Web) •Semi-structured Data (XML) •Graph Data ◦ Social Network, Semantic Web (RDF), … • Streaming Data ◦ You can only scan the data once •Big Public Data (online, weather, finance, etc)
  • 8.
    Veracity Big Data Veracityrefers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed. Veracity is often defined as the quality or trustworthiness of the data you collect Inderal feel veracity in data analysis is the biggest challenge when compares to things like volume and velocity. Considering how accurate the data you collect and analyze is important. In this sense, when it comes to big data, quality is always preferred over quantity. To focus on quality, it is important to set metrics around what type of data you may collect and from what sources.
  • 9.
    The ModelHas Changed… TheModel of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data
  • 10.
    The Types ofbigdata •Structured Most traditional data sources •Semi-structured Many sources of big data •Unstructured Video data, audio data
  • 11.
    Types Structured: structured data,we mean data that can be processed, stored, and retrieved in a fixed format. For instance, the employee table in a company database will be structured as the employee details, their job positions, their salaries, etc., will be present in an organized manner. Unstructured: Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured data. Email is an example of unstructured data. Semi-structured: Semi-structured data pertains to the data containing both the formats mentioned above, that is, structured and unstructured data. To be precise, it refers to the data that although has not been classified under a particular repository (database), yet contains vital information or tags that segregate individual elements within the data.
  • 12.
    Storingthe bigdata Analyzing yourdata characteristics Selecting data sources for analysis Eliminating redundant data Establishing the role of NoSQL Overview of Big Data stores Data models: key value, graph, document, column-family Hadoop Distributed File System HBase Hive
  • 13.
    Processingbig data Integrating disparatedata stores • Mapping data to the programming framework • Connecting and extracting data from storage • Transforming data for processing • Subdividing data in preparation for Hadoop Map Reduce Employing Hadoop Map Reduce • Creating the components of Hadoop Map Reduce jobs • Distributing data processing across server farms • Executing Hadoop Map Reduce jobs • Monitoring the progress of job flows
  • 14.
  • 15.
  • 16.
  • 17.
    Benefits •Real-time big dataisn’t just a process for storing petabytes or exabytes of data in a data warehouse, It’s about the ability to make better decisions and take meaningful actions at the right time. •Fast forward to the present and technologies like Hadoop give you the scale and flexibility to store data before you know how you are going to process it. •Technologies such as Map Reduce, Hive and Impala enable you to run queries without changing the data structures underneath. •organizations are using big data to target customer-centric outcomes, tap into internal data and build a better information ecosystem
  • 18.
  • 19.
    Introduction Machine learning isan application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves. The basic premise of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available
  • 20.
  • 21.
    How machinelearningworks Machine learningalgorithms are often categorized as Basically supervised learning is a learning in which we teach or train the machine using data which is well labeled that means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data Example suppose you are given a basket filled with different kinds of fruits. Now the first step is to train the machine with all different fruits one by one like this:
  • 22.
    Example If shape ofobject is rounded and depression at top having color Red then it will be labelled as – Apple. If shape of object is long curving cylinder having color Green-Yellow then it will be labelled as – Banana.
  • 23.
    Example Now suppose aftertraining the data, you have given a new separate fruit say Banana from basket and asked to identify it. Since the machine has already learned the things from previous data and this time have to use it wisely. It will first classify the fruit with its shape and color and would confirm the fruit name as BANANA and put it in Banana category Supervised learning classified into two categories of algorithms: Classification: A classification problem is when the output variable is a category, such as “Red” or “blue” or “disease” and “no disease”. Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.
  • 24.
    Unsupervisedlearning Unsupervised learning isthe training of machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Unlike supervised learning, no training will be given to the machine. Therefore machine is restricted to find the hidden structure in unlabeled data by our-self. Example suppose it is given an image having both dogs and cats which have not seen ever.
  • 25.
    Example Thus the machinehas no idea about the features of dogs and cat so we can’t categorize it in dogs and cats. But it can categorize them according to their similarities, patterns, and differences we can easily categorize the above picture into two parts. First may contain all pics having dogs in it and second part may contain all pics having cats in it. Here you didn’t learn anything before, means no training data or examples Unsupervised learning classified into two categories of algorithms: Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior. Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.
  • 26.
    Reinforcementlearning Reinforcement learning. This areaof deep learning involves models iterating over many attempts to complete a process. Steps that produce favorable outcomes are rewarded and steps that produce undesired outcomes are penalized until the algorithm learns the optimal process.
  • 27.
    Examplesofmachinelearning Facebook news feeds Machinelearning is being used in a wide range of applications today. One of the most well-known examples is Facebook's News Feed. The News Feed uses machine learning to personalize each member's feed. If a member frequently stops scrolling to read or like a particular friend's posts, the News Feed will start to show more of that friend's activity earlier in the feed. Behind the scenes, the software is simply using statistical analysis and predictive analytics to identify patterns in the user's data and use those patterns to populate the News Feed. Should the member no longer stop to read, like or comment on the friend's posts, that new data will be included in the data set and the News Feed will adjust accordingly.
  • 28.
    Examplesof ML .Self-driving cars Machinelearning also plays an important role in self-driving cars. Deep learning neural networks are used to identify objects and determine optimal actions for safely steering a vehicle down the road
  • 29.
    Typesofmachinelearningalgorithms . Here area few of the most commonly used models: Decision trees. These models use observations about certain actions and identify an optimal path for arriving at a desired outcome. K-means clustering. This model groups a specified number of data points into a specific number of groupings based on like characteristics. Neural networks. These deep learning models utilize large amounts of training data to identify correlations between many variables to learn to process incoming data in the future.
  • 30.
    future of machinelearning Currentmachine learning (ML) algorithms identify statistical regularities in complex data sets and are regularly used across a range of application domains, but they lack the robustness and generalizability associated with human learning. If ML techniques could enable computers to learn from fewer examples, transfer knowledge between tasks, and adapt to changing contexts and environments, the results would have very broad scientific and societal impacts.
  • 31.
    Howarebigdataandmachinelearningrelated? Machine learning(ML) isbased on algorithms that can learn from data without relying on rules- based programming. Big data is the type of data that may be supplied into the analytical system so that a ML model could ‘learn’ (or in other words, improve the accuracy of its predictions). A quick example: preventive machinery maintenance. We use big data from sensors (temperature, humidity, pressure and vibration readings for each machinery part that come every second) to train, test and retrain a ML model. The role of the model is to identify hidden patterns that lead to machinery failure and check newly incoming data against the identified patterns. As a final step – the analytical system may trigger alerts to the maintenance team if the model identifies a match with a pre failure condition pattern. A good way to think about the relationship between Big Data and Machine Learning is that the data is the raw material that feeds the machine learning process. The tangible benefit to a business is derived from the predictive model(s) that comes out at the end of the process, not the data used to construct it.