BIG DATA AND MACHINE
LEARNING
Big Data & IoT
Lecture #3
Umair Shafique (03246441789)
Scholar MS Information Technology - University of Gujrat
Table of contents
Define big data
Big data as
10V’s
Some Pros and
cons Of Big
Data
Perceived
Challenges of
Big Data
Define machine
learning
Real-world
examples
Working flow of
ML
types of ML
Challenges of
ML
Relate big data
with ML
Features of ML
with big data
Framework
based on ML for
big data
processing
Tools and
technologies for
big data and ML
Difference b/w
ML and Big data
Research
challenges and
open issues
Summary
References
What is Big Data?
Big Data is a collection of data that is
huge in volume, yet growing
exponentially with time. It is a data
with so large size and complexity that
none of traditional data management
tools can store it or process it
efficiently. Big data is also a data but
with huge size.
Who’s Generating Big Data?
The progress and innovation is no longer
hindered by the ability to collect data But,
by the ability to manage, analyze,
summarize, visualize, and discover
knowledge from the collected data in a
timely manner and in a scalable fashion.
Big data
as 10V’s:
Some Pros Of
Big Data :
Better decision-making
Increased productivity
Reduce costs
Improved customer service
Fraud detection
Greater innovation
Cons of big
data:
Need for talent
Data quality
Need for cultural change
Rapid change
Hardware needs
Costs
Perceived Challenges of Big Data
What is Machine
Learning?
Machine learning is an application
of AI that provides systems the
ability to learn on their own and
improve from experiences without
being programmed externally. If
your computer had machine
learning, it might be able to play
difficult parts of a game or solve a
complicated mathematical equation
for you.
Real world examples of machine learning
Machine learning is relevant in many fields, industries, and has the capability to grow over time. Here are
six real-life examples of how machine learning is being used.
1. Image recognition
Image recognition is a well-known and widespread example of machine learning in the real world. It can
identify an object as a digital image, based on the intensity of the pixels in black and white images or
colour images.
e.g.
• Label an x-ray as cancerous or not
• Assign a name to a photographed face (aka “tagging” on social media)
• Recognise handwriting by segmenting a single letter into smaller images
• Machine learning is also frequently used for facial recognition within an image. Using a database of
people, the system can identify commonalities and match them to faces. This is often used in law
enforcement.
2. Speech recognition
Machine learning can translate speech into text. Certain software applications can convert live voice and recorded
speech into a text file. The speech can be segmented by intensities on time-frequency bands as well.
• Voice search
• Voice dialling
• Appliance control
• Some of the most common uses of speech recognition software are devices like Google Home or Amazon Alexa.
3. Medical diagnosis
Machine learning can help with the diagnosis of diseases. Many physicians use chatbots with speech recognition
capabilities to discern patterns in symptoms.
• Assisting in formulating a diagnosis or recommends a treatment option
• Oncology and pathology use machine learning to recognise cancerous tissue
• Analyse bodily fluids
• In the case of rare diseases, the joint use of facial recognition software and machine learning helps scan patient
photos and identify phenotypes that correlate with rare genetic diseases.
4. Predictive analytics
Machine learning can classify available data into groups, which are then defined by rules set by analysts. When the
classification is complete, the analysts can calculate the probability of a fault.
• Predicting whether a transaction is fraudulent or legitimate
• Improve prediction systems to calculate the possibility of fault
• Predictive analytics is one of the most promising examples of machine learning. It's applicable for everything;
from product development to real estate pricing.
5. Extraction
Machine learning can extract structured information from unstructured data. Organizations amass huge volumes of
data from customers. A machine learning algorithm automates the process of annotating datasets for predictive
analytics tools.
• Generate a model to predict vocal cord disorders
• Develop methods to prevent, diagnose, and treat the disorders
• Help physicians diagnose and treat problems quickly
• Typically, these processes are tedious. But machine learning can track and extract information to obtain billions
of data samples.
How Machine Learning Works?
Consider a system with input data that contains photos of various kinds of fruits. You want the system to
group the data according to the different types of fruits.
First, the system will analyze the input data. Next, it tries to find patterns, like shapes, size, and color. Based
on these patterns, the system will try to predict the different types of fruit and segregate them. Finally, it
keeps track of all the decisions it made during the process to ensure it is learning. The next time you ask
the same system to predict and segregate the different types of fruits, it won't have to go through the
entire process again. That’s how machine learning works.
Types of Machine
Learning
• Supervised machine learning: You supervise the machine
while training it to work on its own. This requires labeled
training data
• Unsupervised learning: There is training data, but it won’t
be labeled
• Reinforcement learning: The system learns on its own
Supervised Learning
To understand how supervised learning works, look at the example
below, where you have to train a model or system to recognize an
apple.
• First, you have to provide a data set that contains pictures of a
kind of fruit, e.g., apples.
• Then, provide another data set that lets the model know that
these are pictures of apples. This completes the training phase.
• Next, provide a new set of data that only contains pictures of
apples. At this point, the system can recognize what the fruit it is and
will remember it.
• That's how supervised learning works. You are training the model
to perform a specific operation on its own. This kind of model is
often used in filtering spam mail from your email accounts.
Supervised learning include:
Classification: A typical supervised learning is a classification. The spam filter that we spoke
above is one such example. It is trained with many example emails along with its class (Spam,
Not-Spam) and then works automatically in classifying new emails.
Used for:
• Spam filtering
• Sentiment analysis
• Recognition of handwritten characters and numbers
• Fraud detection
Popular algorithms: Naive Bayes, Decision Tree, Linear Regression, Logistic Regression, K-Nearest
Neighbors, Support Vector Machine, Neural Networks
Regression: Regression is basically a classification where we forecast a number instead of
category. Examples are car price by its mileage, traffic by time of the day, demand volume by the
growth of the company, etc. Regression is perfect when something depends on time.
• Used for:
• Stock price forecasts
• Demand and sales volume analysis
• Medical diagnosis
• Any number-time correlations
Unsupervised
Learning
consider a cluttered dataset: a collection of pictures of different fruit.
You feed this data to the model, and the model analyzes it to
recognize any patterns. In the end, the machine categorizes the
photos into three types, as shown in the image, based on their
similarities. Flipkart uses this model to find and recommend products
that are well suited for you.
It include:
• Clustering: Clustering algorithm tries to find similar (by some
features) objects and merge them in a cluster. Those that have lots of
similar features are joined in one class. With some algorithms, you
even can specify the exact number of clusters you want.
Used:
• For market segmentation (types of customers, loyalty)
• For image compression
• To analyze and label new data
• To detect abnormal behavior
Popular Clustering algorithms are:
• K-Means
Reinforcement
Learning
Used today for:
• Replacement of all algorithms above
• Object identification of photos and videos
• Speech recognition and synthesis
• Image processing, style transfer
• Machine translation
Main Challenges of Machine
Learning
• Poor-Quality Data
• Irrelevant Features
• Testing and Validating
Big Data & Machine Learning (How Do They
Relate?)
According to recap, Big data refers to vast amounts of data that traditional storage
methods cannot handle. Machine learning is the ability of computer systems to learn to
make predictions from observations and data. Machine learning can use the information
provided by the study of big data to generate valuable business insights.
Machine learning tools use data-driven algorithms and statistical models to analyze data
sets and then draw inferences from identified patterns or make predictions based on them.
The algorithms learn from the data as they run against it, as opposed to traditional rules-
based analytics systems that follow explicit instructions.
Big data provides ample amounts of raw material from which machine learning systems
can derive insights. By combining them, organizations are producing significant analytics
findings and results.
Features of Machine Learning with
Big Data
•Sparse Representation
•Mining Structured Relations
•High Scalability and High Speed.
Reference Framework Based on Machine Learning for Big Data Processing
Big data processing procedure with
machine learning:
We suppose the big data processing procedure mainly consists of the following four
phases:
• pre-processing phase
• analysis phase
• model establishment phase
• model updating phase
Tools and technologies for big
data and ML:
Snowflake Data
Science
Matplotlib TensorFlow Bigml Apache Spark Knime Cloudera
Key difference b/w Big data and ML:
Summary of lecture
• In this lecture , we firstly provided an overview about big data and summarized the characteristics of big data.
• Then give over wiew on machine learing. In order to highlight the differences of machine learning techniques in the context of
big data, we then analyzed the new features of machine learning with big data.
• Next we relate big data and machine learning .
• We also proposed a reference framework for processing big data based on machine learning techniques with the power of
distributed storage and parallel computing. Finally, we presented several research challenges and open issues.
• We hope that this lecture can stimulate more interest in research and development of techniques based on machine learning for
big data processing.
References
• https://towardsdatascience.com/machine-learning-and-big-data-
real-world-applications-3ba3a3345cf5
• https://www.salesforce.com/eu/blog/2020/06/real-world-
examples-of-machine-learning.html
• https://www.google.com/amp/s/www.techtarget.com/searchbus
inessanalytics/tip/Big-data-vs-machine-learning-How-they-differ-
and-relate%3famp=1
• https://geekflare.com/big-data-tools-for-data-scientist/
• https://www.salesforce.com/eu/blog/2020/06/real-world-
examples-of-machine-learning.html
• https://towardsdatascience.com/machine-learning-and-big-data-
real-world-applications-3ba3a3345cf5

BIG DATA AND MACHINE LEARNING

  • 1.
    BIG DATA ANDMACHINE LEARNING Big Data & IoT Lecture #3 Umair Shafique (03246441789) Scholar MS Information Technology - University of Gujrat
  • 2.
    Table of contents Definebig data Big data as 10V’s Some Pros and cons Of Big Data Perceived Challenges of Big Data Define machine learning Real-world examples Working flow of ML types of ML Challenges of ML Relate big data with ML Features of ML with big data Framework based on ML for big data processing Tools and technologies for big data and ML Difference b/w ML and Big data Research challenges and open issues Summary References
  • 3.
    What is BigData? Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.
  • 4.
    Who’s Generating BigData? The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion.
  • 5.
  • 6.
    Some Pros Of BigData : Better decision-making Increased productivity Reduce costs Improved customer service Fraud detection Greater innovation
  • 7.
    Cons of big data: Needfor talent Data quality Need for cultural change Rapid change Hardware needs Costs
  • 8.
  • 9.
    What is Machine Learning? Machinelearning is an application of AI that provides systems the ability to learn on their own and improve from experiences without being programmed externally. If your computer had machine learning, it might be able to play difficult parts of a game or solve a complicated mathematical equation for you.
  • 10.
    Real world examplesof machine learning Machine learning is relevant in many fields, industries, and has the capability to grow over time. Here are six real-life examples of how machine learning is being used. 1. Image recognition Image recognition is a well-known and widespread example of machine learning in the real world. It can identify an object as a digital image, based on the intensity of the pixels in black and white images or colour images. e.g. • Label an x-ray as cancerous or not • Assign a name to a photographed face (aka “tagging” on social media) • Recognise handwriting by segmenting a single letter into smaller images • Machine learning is also frequently used for facial recognition within an image. Using a database of people, the system can identify commonalities and match them to faces. This is often used in law enforcement.
  • 11.
    2. Speech recognition Machinelearning can translate speech into text. Certain software applications can convert live voice and recorded speech into a text file. The speech can be segmented by intensities on time-frequency bands as well. • Voice search • Voice dialling • Appliance control • Some of the most common uses of speech recognition software are devices like Google Home or Amazon Alexa. 3. Medical diagnosis Machine learning can help with the diagnosis of diseases. Many physicians use chatbots with speech recognition capabilities to discern patterns in symptoms. • Assisting in formulating a diagnosis or recommends a treatment option • Oncology and pathology use machine learning to recognise cancerous tissue • Analyse bodily fluids • In the case of rare diseases, the joint use of facial recognition software and machine learning helps scan patient photos and identify phenotypes that correlate with rare genetic diseases.
  • 12.
    4. Predictive analytics Machinelearning can classify available data into groups, which are then defined by rules set by analysts. When the classification is complete, the analysts can calculate the probability of a fault. • Predicting whether a transaction is fraudulent or legitimate • Improve prediction systems to calculate the possibility of fault • Predictive analytics is one of the most promising examples of machine learning. It's applicable for everything; from product development to real estate pricing. 5. Extraction Machine learning can extract structured information from unstructured data. Organizations amass huge volumes of data from customers. A machine learning algorithm automates the process of annotating datasets for predictive analytics tools. • Generate a model to predict vocal cord disorders • Develop methods to prevent, diagnose, and treat the disorders • Help physicians diagnose and treat problems quickly • Typically, these processes are tedious. But machine learning can track and extract information to obtain billions of data samples.
  • 13.
    How Machine LearningWorks? Consider a system with input data that contains photos of various kinds of fruits. You want the system to group the data according to the different types of fruits. First, the system will analyze the input data. Next, it tries to find patterns, like shapes, size, and color. Based on these patterns, the system will try to predict the different types of fruit and segregate them. Finally, it keeps track of all the decisions it made during the process to ensure it is learning. The next time you ask the same system to predict and segregate the different types of fruits, it won't have to go through the entire process again. That’s how machine learning works.
  • 14.
    Types of Machine Learning •Supervised machine learning: You supervise the machine while training it to work on its own. This requires labeled training data • Unsupervised learning: There is training data, but it won’t be labeled • Reinforcement learning: The system learns on its own
  • 15.
    Supervised Learning To understandhow supervised learning works, look at the example below, where you have to train a model or system to recognize an apple. • First, you have to provide a data set that contains pictures of a kind of fruit, e.g., apples. • Then, provide another data set that lets the model know that these are pictures of apples. This completes the training phase. • Next, provide a new set of data that only contains pictures of apples. At this point, the system can recognize what the fruit it is and will remember it. • That's how supervised learning works. You are training the model to perform a specific operation on its own. This kind of model is often used in filtering spam mail from your email accounts.
  • 16.
    Supervised learning include: Classification:A typical supervised learning is a classification. The spam filter that we spoke above is one such example. It is trained with many example emails along with its class (Spam, Not-Spam) and then works automatically in classifying new emails. Used for: • Spam filtering • Sentiment analysis • Recognition of handwritten characters and numbers • Fraud detection Popular algorithms: Naive Bayes, Decision Tree, Linear Regression, Logistic Regression, K-Nearest Neighbors, Support Vector Machine, Neural Networks Regression: Regression is basically a classification where we forecast a number instead of category. Examples are car price by its mileage, traffic by time of the day, demand volume by the growth of the company, etc. Regression is perfect when something depends on time. • Used for: • Stock price forecasts • Demand and sales volume analysis • Medical diagnosis • Any number-time correlations
  • 17.
    Unsupervised Learning consider a cluttereddataset: a collection of pictures of different fruit. You feed this data to the model, and the model analyzes it to recognize any patterns. In the end, the machine categorizes the photos into three types, as shown in the image, based on their similarities. Flipkart uses this model to find and recommend products that are well suited for you. It include: • Clustering: Clustering algorithm tries to find similar (by some features) objects and merge them in a cluster. Those that have lots of similar features are joined in one class. With some algorithms, you even can specify the exact number of clusters you want. Used: • For market segmentation (types of customers, loyalty) • For image compression • To analyze and label new data • To detect abnormal behavior Popular Clustering algorithms are: • K-Means
  • 18.
    Reinforcement Learning Used today for: •Replacement of all algorithms above • Object identification of photos and videos • Speech recognition and synthesis • Image processing, style transfer • Machine translation
  • 19.
    Main Challenges ofMachine Learning • Poor-Quality Data • Irrelevant Features • Testing and Validating
  • 20.
    Big Data &Machine Learning (How Do They Relate?) According to recap, Big data refers to vast amounts of data that traditional storage methods cannot handle. Machine learning is the ability of computer systems to learn to make predictions from observations and data. Machine learning can use the information provided by the study of big data to generate valuable business insights. Machine learning tools use data-driven algorithms and statistical models to analyze data sets and then draw inferences from identified patterns or make predictions based on them. The algorithms learn from the data as they run against it, as opposed to traditional rules- based analytics systems that follow explicit instructions. Big data provides ample amounts of raw material from which machine learning systems can derive insights. By combining them, organizations are producing significant analytics findings and results.
  • 21.
    Features of MachineLearning with Big Data •Sparse Representation •Mining Structured Relations •High Scalability and High Speed.
  • 22.
    Reference Framework Basedon Machine Learning for Big Data Processing
  • 23.
    Big data processingprocedure with machine learning: We suppose the big data processing procedure mainly consists of the following four phases: • pre-processing phase • analysis phase • model establishment phase • model updating phase
  • 24.
    Tools and technologiesfor big data and ML: Snowflake Data Science Matplotlib TensorFlow Bigml Apache Spark Knime Cloudera
  • 25.
    Key difference b/wBig data and ML:
  • 27.
    Summary of lecture •In this lecture , we firstly provided an overview about big data and summarized the characteristics of big data. • Then give over wiew on machine learing. In order to highlight the differences of machine learning techniques in the context of big data, we then analyzed the new features of machine learning with big data. • Next we relate big data and machine learning . • We also proposed a reference framework for processing big data based on machine learning techniques with the power of distributed storage and parallel computing. Finally, we presented several research challenges and open issues. • We hope that this lecture can stimulate more interest in research and development of techniques based on machine learning for big data processing.
  • 28.
    References • https://towardsdatascience.com/machine-learning-and-big-data- real-world-applications-3ba3a3345cf5 • https://www.salesforce.com/eu/blog/2020/06/real-world- examples-of-machine-learning.html •https://www.google.com/amp/s/www.techtarget.com/searchbus inessanalytics/tip/Big-data-vs-machine-learning-How-they-differ- and-relate%3famp=1 • https://geekflare.com/big-data-tools-for-data-scientist/ • https://www.salesforce.com/eu/blog/2020/06/real-world- examples-of-machine-learning.html • https://towardsdatascience.com/machine-learning-and-big-data- real-world-applications-3ba3a3345cf5

Editor's Notes

  • #7 Better decision-making: In the NewVantage Partners survey, 36.2 percent of respondents said that better decision-making was the number one goal of their big data analytics efforts. In addition, 84.1 percent had started working toward that goal, and 59.0 percent had experienced some measurable success, for an overall success rate of 69.0 percent. Analytics can give business decision-makers the data-driven insights they need to help their companies compete and grow. Increased productivity: A separate survey from vendor Syncsort found that 59.9 percent of respondents were using big data tools like Hadoop and Spark to increase business user productivity. Modern big data tools are allowing analysts to analyze more data, more quickly, which increases their personal productivity. In addition, the insights gained from those analytics often allow organizations to increase productivity more broadly throughout the company. Reduce costs: Both the Syncsort and the NewVantage surveys found that big data analytics were helping companies decrease their expenses. Nearly six out of ten (59.4 percent) respondents told Syncsort big data tools had helped them increase operational efficiency and reduce costs, and about two thirds (66.7 percent) of respondents to the NewVantage survey said they had started using big data to decrease expenses. Interestingly, however, only 13.0 percent of respondents selected cost reduction as their primary goal for big data analytics, suggesting that for many this is merely a very welcome side benefit. Improved customer service: Among respondents to the NewVantage survey, improving customer service was the second most common primary goal for big data analytics projects, and 53.4 percent of companies had experienced some success in this regard. Social media, customer relationship management (CRM) systems and other points of customer contact give today’s enterprises a wealth of information about their customers, and it is only natural that they would use this data to better serve those customers. Fraud detection: Another common use for big data analytics — particularly in the financial services industry — is fraud detection. One of the big advantages of big data analytics systems that rely on machine learning is that they are excellent at detecting patterns and anomalies. These abilities can give banks and credit card companies the ability to spot stolen credit cards or fraudulent purchases, often before the cardholder even knows that something is wrong. Greater innovation: Innovation is another common benefit of big data, and the NewVantage survey found that 11.6 percent of executives are investing in analytics primarily as a means to innovate and disrupt their markets. They reason that if they can glean insights that their competitors don’t have, they may be able to get out ahead of the rest of the market with new products and services.
  • #8 Need for talent: Data scientists and big data experts are among the most highly coveted —and highly paid — workers in the IT field. The AtScale survey found that the lack of a big data skill set has been the number one big data challenge for the past three years. And in the Syncsort survey, respondents ranked skills and staff as the second biggest challenge when creating a data lake. Hiring or training staff can increase costs considerably, and the process of acquiring big data skills can take considerable time. Data quality:In the Syncsort survey, the number one disadvantage to working with big data was the need to address data quality issues. Before they can use big data for analytics efforts, data scientists and analysts need to ensure that the information they are using is accurate, relevant and in the proper format for analysis. That slows the reporting process considerably, but if enterprises don’t address data quality issues, they may find that the insights generated by their analytics are worthless — or even harmful if acted upon. Need for cultural change: Many of the organizations that are utilizing big data analytics don’t just want to get a little bit better at reporting, they want to use analytics to create a data-driven culture throughout the company. In fact, in the NewVantage survey, a full 98.6 percent of executives said that their firms were in the process of creating this new type of corporate culture. However, changing culture is a tall order. So far, only 32.4 percent were reporting success on this front. Rapid change: Another potential drawback to big data analytics is that the technology is changing rapidly. Organizations face the very real possibility that they will invest in a particular technology only to have something much better come along a few months later. Syncsort respondents ranked this disadvantage of big data fourth among all the potential challenges they faced. Hardware needs: Another significant issue for organizations is the IT infrastructure necessary to support big data analytics initiatives. Storage space to house the data, networking bandwidth to transfer it to and from analytics systems, and compute resources to perform those analytics are all expensive to purchase and maintain. Some organizations can offset this problem by using cloud-based analytics, but that usually doesn’t eliminate the infrastructure problems entirely. Costs: Many of today’s big data tools rely on open source technology, which dramatically reduces software costs, but enterprises still face significant expenses related to staffing, hardware, maintenance and related services. It’s not uncommon for big data analytics initiatives to run significantly over budget and to take more time to deploy than IT managers had originally anticipated.
  • #20 Main Challenges of Machine Learning: In short, since our main task is to select a learning algorithm and train it on some data, the two things that can go wrong are “bad algorithm” and “bad data.” Machine Learning is not quite there yet; it takes a lot of data for most Machine Learning algorithms to work properly. Poor-Quality Data: Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well. Irrelevant Features: Your system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones. Testing and Validating The only way to know how well a model will generalize to new cases is to try it out on new cases. The recommended option is to split your data into two sets: the training set and the test set. As these names imply, you train the model using the training set, and you test it using the test set. The error rate on new cases is called the generalization error and by evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before.
  • #22 In this section, We will highlight three aspects of abilities that are useful to deal with big data problems for machine learning techniques in detail, i.e., sparse representation and feature selection, mining structured relations, high scalability and high speed. Sparse Representation For the high-dimensional data, it is difficult to handle by using traditional data processing methods. Therefore, effective dimension reduction is increasingly viewed as a necessary step in dealing with these problems. In terms of high-dimensional big data, we highlight the feature selection and sparse representation methods for machine learning techniques, which are two commonly adopted approaches in dealing with high-dimensional data. Feature selection is a key issue in building robust data processing models through the process of selecting a subset of meaningful features. It should be able to help visualize the data, to construct better statistical models, and improve prediction accuracy through mapping the high dimensional data into the underlying low dimensional manifold. And for high-dimensional big data, a sparse data representation is more and more important for many algorithms. Mining Structured Relations Big data is generally from different sources with obviously heterogeneous types including structured, unstructured and semi-structured representation forms.Dealing with such a heterogeneous dataset, the great challenge is perceivable, thus machine learning system needs infer the structure behind the data when it is not known beforehand. One way of structuring data is to discover the relevance based on inherent data properties through structured learning and structured prediction. The main purpose of mining structured relations from a set of data is to aggregate massive amounts of data and divide it into smaller chunks which can be easily handled by machine learning systems. High Scalability and High Speed. The unprecedented volumes of big data require quite high scalability of their data mining and processing tools. In current researches, the techniques which are used to enhance the scalability issue of machine learning algorithms mainly focus on the following two aspects: i) the scalability of cloud computing makes it possible to analyze enormous datasets, which aggregates multiple workloads with varying performance goalsinto multi-tenanted computing clusters. Machine learning with cloud computing owns more efficient and higher performance for processing and analyzing big data; ii) distributed storage and parallel computing have helped to solve machine learning algorithms’ scalability problems. A useful approach to boost the speed of big data processing is through maximally identifying and exploiting the potential parallelism in the machine learning algorithms. High scalability and high speed can give machine learning high power to handle big data
  • #24 pre-processing phase Because data sources almost cover all different kinds of domains, raw big data collecting from the environment are greatly complex and has tremendous redundancies. Therefore, we need delete the invalid and dirty data at first in pre-processing phase In addition, we frequently have to face massive uncertain and incomplete data in real life and we need append some important attributes to improve their processing practicability analysis phase After raw data pre-processing phase, we need analyze these valid and useful data to find out how to utilize the data through trial and error. Data visualization is a fundamental problem in the analysis of big data, and we can adopt sparse representation to achieve effective dimension reduction for the high-dimensional data model establishment phase Through essential parameters analysis, we should be able to select some important features to establish the feasible model for dealing with real problems. In terms of model establishment phase, we try to mine the structured relations between data to obtain statistical information and trend at first, and then split data into training and testing sets model updating phase In the end, we can decide what kind of model should be generated for utilization and build up the corresponding model. While the model is established, we need configure parameters for the model and apply the generated model obtained from the model establishment phase into actual operations to test the performance of the big data processing model. In this phase, we emphasize the input data is real-time. We should make dynamic adjustments to update the model based on effects of model application In terms of the four phases in the procedure of big data processing, the anterior three phases are offline processing. In these phases, we are able to adopt offline learning methods which include two categories of supervised learning and unsupervised learning. In the model testing and updating phase, we mainly focus on the real-time characteristic of input data. To deal with the problem of real-time processing, online learning methods are necessary and the reinforcement learning is preferred.