- 1. Introduction to BIG DATA Types, Characteristics & Benefits In order to understand 'Big Data', we first need to know what 'data' is. Oxford dictionary defines 'data' as – "The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. “
- 3. So, 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to describe collection of data that is huge in size and yet growing exponentially with time. In short, such a data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently. Mr. V.C.Bhagawat, ATRIA, Dept.of MCA.
- 4. Examples Of 'Big Data' The New York Stock Exchange generates about one terabyte of new trade data per day. Social Media Impact Statistic shows that 500+terabytes of new data gets ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
- 5. Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
- 6. Categories Of 'Big Data' Big data' could be found in three forms: Structured Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. Over the period of time, talent in computer science have achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, now days, we are foreseeing issues when size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabyte.
- 7. Employee_ID Employee_Name Gender Department Salary_In_lacs 2365 Rajesh Kulkarni Male Finance 650000 3398 Pratibha Joshi Female Admin 650000 7465 Shushil Roy Male Admin 500000 7500 Shubhojit Das Male Finance 500000 7699 Priya Sane Female Finance 550000 Examples Of Structured Data An 'Employee' table in a database is an example of Structured Data
- 8. Semi-structured Semi-structured data can contain both the forms of data. We can see semi-structured data as a strcutured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in XML file , Email, XML, JSON DOCUMENTS, HTML, EDI, RDF Examples Of Semi-structured Data Personal data stored in a XML file- <rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec> <rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec> <rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec> <rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec> <rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
- 9. Unstructured Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. Typical example of unstructured data is, a heterogeneous data source containing a combination of simple text files, images, videos etc. Now a day organizations have wealth of data available with them but unfortunately they don't know how to derive value out of it since this data is in its raw form or unstructured format.
- 10. Examples Of Un-structured Data Output returned by 'Google Search'
- 11. Characteristics Of 'Big Data' Volume The name 'Big Data' itself is related to a size which is enormous. Size of data plays very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with 'Big Data'. • 500 hours of video are uploaded to youetube/min
- 13. Velocity The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous. Facebook users for example upload more than 900 million photos a day. Facebook's data warehouse stores upwards of 300 petabytes of data, but the velocity at which new data is created should be taken into account. Facebook claims 600 terabytes of incoming data per day. Google alone processes on average more than "40,000 search queries every second," which roughly translates to more than 3.5 billion searches per day.
- 14. Variety Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Now days, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. is also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analysing data.
- 16. Veracity Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed. Veracity in data analysis is the biggest challenge when compares to things like volume and velocity. Uncertainty due to inconsistency & Incompleteness , ambiguity, etc..
- 18. Validity validity refers to how accurate and correct the data is for its intended use. According to Forbes, an estimated 60 percent of a data scientist's time is spent cleansing their data before being able to do any analysis. Volatility How old does your data need to be before it is considered irrelevant, historic, or not useful any longer? How long does data need to be kept for? Vulnerability A vulnerability may also refer to any type of weakness in a computer system itself, in a set of procedures, or in anything that leaves information security exposed to a threat.
- 19. Visualization You can't rely on traditional graphs when trying to plot a billion data points, so you need different ways of representing data such as data clustering or using tree maps, sunbursts, parallel coordinates, circular network diagrams, or cone trees. Value Last, but arguably the most important of all, is value. The other characteristics of big data are meaningless if you don't derive business value from the data.
- 20. Quiz Time : • 1. Data in bytes size is called big data.
- 21. Data Analytics
- 22. Data Analytics Data analysis, also known as analysis of data or data analytics, is a process of inspecting, cleansing, transforming, goal of discovering useful and modeling data with the information, suggesting conclusions, and supporting decision-making. has multiple facets Data analysis encompassing diverse techniques under and approaches, a variety of names, in different business, risk, health care, web, science, and social science domains. Analytics is not a tool or technology rather it is the way of thinking or acting.
- 23. Types Of Analyses Every Data Scientist Should Know Descriptive Predictive Prescriptive Diagnostic Exploratory Inferential
- 24. Types Of Analyses Every Data Scientist Should Know Descriptive (least amount of effort): Descriptive analytics uses existing information from the past to understand decisions in present and helps decide an effectivesource of action in future. The discipline of quantitatively describing the main features of a collection of data. In essence, it describes a set of data. –Typically the first kind of data analysis performed on a data set –Commonly applied to large volumes of data, such as census data -The description and interpretation processes are different steps –Univariate and Bivariate are two types of statistical descriptive analyses. –Type of data set applied to: Census Data Set – a whole population
- 25. Descriptive Analytics: What is happening? Insight into the past Descriptive analysis or statistics does exactly what the name implies they “Describe”, or summarize raw data and make it something that is interpretable by humans. They are analytics that describe the past. Are useful because they allow us to learn from past behaviors, and understand how they might influence future outcomes. Common examples of descriptive analytics are •reports that provide historical insights regarding the company’s production, financials, operations, sales, finance, inventory and customers.
- 26. Predictive Analytics: What is likely to happen? Understanding the future “Predict” what might happen. These analytics are about understanding the future. Predictive analytics provide estimates about the likelihood of a future outcome. The foundation of predictive analytics is based on probabilities. Few Examples are : •understanding how sales might close at the end of the year, •predicting what items customers will purchase together, or •forecasting inventory levels based upon a myriad of variables.
- 27. Prescriptive Analytics: What do I need to do? Advise on possible outcomes allows users to “prescribe” a number of different possible actions to and guide them towards a solution. prescriptive analytics predicts not only what will happen, but also why it will happen providing recommendations regarding actions that will take advantage of the predictions. Prescriptive analytics use a combination of techniques and tools such as business rules, algorithms, machine learning and computational modelling procedures. optimize production, scheduling and inventory in the supply chain to make sure that are delivering the right products at the right time and optimizing the customer experience.
- 28. Diagnostic Analytics is a form of advanced analytics which examines data or content to answer the question “Why did it happen?”, and is characterized by techniques such as drill-down, data discovery, data mining and correlations. Ex: healthcare provider compares patients’ response to a promotional campaign in different regions;
- 29. Exploratory An approach to analyzing data sets to find previously unknown relationships. –Exploratory models are good for discovering new connections –They are also useful for defining future studies/questions –Exploratory analyses are usually not the definitive answer to the question at hand, but only the start –Exploratory analyses alone should not be used for generalizing and/or predicting Ex: Microarray studies : aimed to find uncharacterised genes, which act at specific points during the cell cycle
- 30. Inferential Aims to test theories about the nature of the world in general (or some part of it) based on samples of “subjects” taken from the world (or some part of it). That is, use a relatively small sample of data to say something about a bigger population. Inference involves estimating both the quantity you care about and your uncertainty about your estimate – Inference depends heavily on both the population and the sampling scheme Type of data set applied to: Observational, Cross Sectional Time Study, and Retrospective Data Set – the right, randomly sampled population
- 33. EXAMPLE APPLICATIONS the relevance, importance, and impact of analytics are now bigger than ever before and, given that more and more data are being collected and that there is strategic value in knowing what is hidden in data, analytics will continue to grow.
- 34. ANALYTICS PROCESS MODEL • Problem Definition. • Identification of data source. • Selection of Data. • Data Cleaning. • Transformation of data. • Analytics. • Interpretation and Evaluation.
- 35. Problem Definition. • Problem identification and definition :The problem is a situation that is judged as something that needs to be corrected. • It is the job of the analyst to make sure that the right problem Is solved. Problem can be identified through : • Comparative / Benchmarking studies. Benchmarking is comparing one's business processes and performance metrics to industry bests and best practices from other companies. •Performance reporting. Assessment of present performance against goals objectives. •SWOT analysis.
- 36. Swot Analysis
- 37. Depending on type of the problem, source data need to be identified . As data is the key ingredient to any analytical exercise and the selection of data will have a deterministic impact on the analytical models that we are building Few Data Collection Technique: •Using data that has already been collected by others. •Systematically selecting and watching characteristics of people, object and events. •Oral questioning of respondents either individually or a group. •Facilitating free discussions on specific topics with selected group of participants.
- 38. Data Storage: All data will then be gathered in a staging area, which could be, for example, a data mart or data warehouse.
- 39. Data Exploration / Data Cleaning Before a formal data analysis can be conducted, the analyst must know how many cases are there in the data set, what variables are included, how many missing observations are there and what general hypothesis the data is likely to suffer. Analyst commonly use visualization for data exploration because it allows users to quickly and simply view most of the relevant features of their data set. basic exploratory analysis can be considered here. Like: online analytical processing (OLAP) facilities for multidimensional data analysis (e.g., roll‐up, drill down, slicing and dicing).
- 40. Analytics Model Building This is the entire process of implementing the solution. The majority of the project time is spent in the solution implementation step. the analytical approach of building a model is a very iterative process because there is final or perfect solution. Validate model: Like model building the process of validating a model is also iterative
- 41. Evaluation / monitoring: on going process essentially aimed at looking at the effectiveness of the solutions over time. Since analytical problem solving approach is diff from other approach: Points to remember are : •There is a clear confidence on data to drive solution identification •we are using analytical technique based on numeric theories •you need to have a good understanding of theoretical concepts to business situations in order to build a feasible solution.
- 43. ANALYTICAL MODEL REQUIREMENTS A good analytical model should satisfy several requirements, depending on the application area. Business relevance. should actually solve the business problem for which it was developed. it is of key importance that the business problem to be solved is appropriately defined, qualified, and agreed upon by all parties involved at the outset of the analysis.
- 44. Statistical Performance The model should have statistical significance and predictive power. Measurement of this depends on type of analytics selected. We have various measures to quantify it.
- 46. Interpretable and Justifiable. Interpretability refers to understanding the patterns that the analytical model captures. Ex:- in credit risk modeling or medical diagnosis, interpretable models are absolutely needed to get good insight into the underlying data patterns.
- 47. Justifiability refers to the degree to which a model corresponds to prior business knowledge and intuition (Intuition is the ability to acquire knowledge without proof, evidence, or conscious reasoning, or without understanding how the knowledge was acquired.) For example, a model stating that a higher debt ratio results in more creditworthy clients may be interpretable, but is not justifiable because it contradicts basic financial intuition. Note that both interpretability and justifiability often need to be balanced against statistical performance. Justifiability
- 48. Analytical models should also be operationally efficient. Effortsneeded to collect the data, preprocess it, evaluate the model, and feed its outputs to the business application Operational efficiency also entailsthe efforts needed to monitor and back test the model, and re-estimate it when necessary. Operationally Efficient.
- 49. Economic cost This includes the costs to gather and preprocess the data, the costs to analyze the data, and the costs to put the resulting analytical models into production. In addition, the software costs and human and computing resources should be taken into account here. Ex cost–benefit analysis Regulation and Legislation . Given the importance of analytics nowadays, more and more regulation is being introduced relating to the development and use of the analytical models Context of privacy, many new regulatory developments are taking place at various levels. Ex: the use of cookies in a web analytics context. Basel Accords in Credit risk models, solvency II Insurance sectors
- 50. Structured , low level ,Detailed , Dun & Bradstreet, thomson reuters, Verisk
- 51. DATA SAMPLING and PRE PROCESSING In real life data can be dirty because of inconsistencies, incompleteness, duplication, and merging problems. Statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined. It helps for data scientists, predictive modelers and other data analysts to work with a small, manageable amount of data Identifying and analyzing a representative sample is more efficient and cost-effective than surveying the entirety of the data or population.
- 52. Challenges of data sampling • the size of the required data sample and the possibility of introducing a sampling error. • [A sampling error is a statistical error that occurs when an analyst does not select a sample that represents the entire population of data and the results found in the sample do not represent the results that would be obtained from the entire population.]
- 53. Types of data sampling methods • There are many different methods for drawing samples from data; the ideal one depends on the data set and situation. • Sampling can be based on probability and non –probability.
- 54. • Probability Sampling is a sampling technique in which sample from a larger population are chosen using the theory of probability. For a participant to be considered as a probability sample, he/she must be selected using a random selection. • The most important requirement of probability sampling is that everyone in your population has a known and an equal chance of getting selected. • For example, if you have a population of 100 people every person would have odds of 1 in 100 for getting selected. Probability sampling gives you the best chance to create a sample that is truly representative of the population.
- 55. Types of Probability Sampling Simple random sampling as the name suggests is a completely random method of selecting the sample. This sampling method is as easy as assigning numbers to the individuals (sample) and then randomly choosing from those numbers through an automated process. Finally, the numbers that are chosen are the members that are included in the sample. samples are chosen in this method of sampling: Lottery system and using number generating software/ random number table.
- 56. Stratified Random sampling involves a method where a larger population can be divided into smaller groups, that usually don’t overlap but represent the entire population together. While sampling these groups can be organized and then draw a sample from each group separately.
- 57. A common method is to arrange or classify by Gender, age, ethnicity and similar ways. Splitting subjects into mutually exclusive groups and then using simple random sampling to choose members from groups. Members in each of these groups should be distinct so that every member of all groups get equal opportunity to be selected using simple probability. This sampling method is also called “random quota sampling”
- 58. Cluster random sampling is a way to randomly select participants when they are geographically spread out. For example, if you wanted to choose 100 participants from the entire population of the U.S., it is likely impossible to get a complete list of everyone. Instead, the researcher randomly selects areas (i.e. cities or counties) and randomly selects from within those boundaries. Cluster sampling usually analyzes a particular population in which the sample consists of more than a few elements, for example, city, family, university etc. The clusters are then selected by dividing the greater population into various smaller sections.
- 59. Systematic Sampling is when you choose every “nth” individual to be a part of the sample. For example, you can choose every 3rd person to be in the sample. Systematic sampling is an extended implementation of the same old probability technique in which each member of the group is selected at regular periods to form a sample. There’s an equal opportunity for every member of a population to be selected using this sampling technique.
- 60. non-probability sampling Non-probability sampling is defined as a sampling technique in which the researcher selects samples based on the subjective judgment of the researcher rather than random selection. It is a less stringent method. This sampling method depends heavily on the expertise of the researchers. It is carried out by observation, and researchers use it widely qualitative research.
- 61. Non-probability sampling is a sampling method in which not all members of the population have an equal chance of participating in the study, unlike probability sampling, where each member of the population has a known chance of being selected. Non-probability sampling is most useful for exploratory studies like a pilot survey (deploying a survey to a smaller sample compared to pre-determined sample size). Researchers use this method in studies where it is not possible to draw random probability sampling due to time or cost considerations.
- 62. Convenience sampling: Convenience sampling is a non-probability sampling technique where samples are selected from the population only because they are conveniently available to the researcher. Researchers choose these samples just because they are easy to recruit, and the researcher did not consider selecting a sample that represents the entire population. Ideally, in research, it is good to test a sample that represents the population. But, in some research, the population is too large to examine and consider the entire population. It is one of the reasons why researchers rely on convenience sampling, which is the most common non-probability sampling method, because of its speed, cost-effectiveness, and ease of availability of the sample.
- 63. Consecutive sampling: This non-probability sampling method is very similar to convenience sampling, with a slight variation. Here, the researcher picks a single person or a group of a sample, conducts research over a period, analyzes the results, and then moves on to another subject or group if needed. Consecutive sampling technique gives the researcher a chance to work with many topics and fine-tune his/her research by collecting results that have vital insights.
- 64. Quota sampling: Hypothetically consider, a researcher wants to study the career goals of male and female employees in an organization. There are 500 employees in the organization, also known as the population. To understand better about a population, the researcher will need only a sample, not the entire population. Further, the researcher is interested in particular strata within the population. Here is where quota sampling helps in dividing the population into strata or groups.
- 65. Judgmental or Purposive sampling: In the judgmental sampling method, researchers select the samples based purely on the researcher’s knowledge and credibility. In other words, researchers choose only those people who they deem fit to participate in the research study. Judgmental or purposive sampling is not a scientific method of sampling, and the downside to this sampling technique is that the preconceived notions of a researcher can influence the results. Thus, this research technique involves a high amount of ambiguity.
- 66. TYPES OF STATISTICAL DATA ELEMENTS Data Types are an important concept of statistics, which needs to be understood, to correctly apply statistical measurements to your data and therefore to correctly conclude certain assumptions about it. This blog post will introduce you to the different data types you need to know, to do proper exploratory data analysis (EDA), which is one of the most underestimated parts of a machine learning project.
- 67. Categorical data represent characteristics such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female), but those numbers don’t have mathematical meaning. You couldn’t add them together, for example. (Other names for categorical data are qualitative data, or Yes/No data.)
- 68. Numerical data. These data have meaning as a measurement, such as a person’s height, weight, IQ, or blood pressure; or they’re a count, such as the number of stock shares a person owns, how many teeth a dog has, or how many pages you can read of your favorite book before you fall asleep. (Statisticians also call numerical data quantitative data.) Numerical data can be further broken into two types: discrete and continuous.
- 69. Discrete data represent items that can be counted; they take on possible values that can be listed out. The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on to infinity (making it countably infinite). For example, the number of heads in 100 coin flips takes on values from 0 through 100 (finite case), but the number of flips needed to get 100 heads takes on values from 100 (the fastest scenario) on up to infinity (if you never get to that 100th heads). Its possible values are listed as 100, 101, 102, 103, . . . (representing the countably infinite case).
- 70. Continuous data represent measurements; their possible values cannot be counted and can only be described using intervals on the real number line. For example, the exact amount of gas purchased at the pump for cars with 20-gallon tanks would be continuous data from 0 gallons to 20 gallons, represented by the interval [0, 20], inclusive. You might pump 8.40 gallons, or 8.41, or 8.414863 gallons, or any possible number from 0 to 20. uncountably infinite. For In this way, continuous data ease can be thought of as being of recordkeeping, statisticians usually pick some point in the number to round off. Another example would be that the lifetime of a C battery can be anywhere from 0 hours to an infinite number of hours (if it lasts forever), technically, with all possible values in between. Granted, you don’t expect a battery to last more than a few hundred hours, but no one can put a cap on how long it can go (remember the Energizer
- 71. fundamental levels of measurement scales Nominal, Ordinal, Interval and Ratio are defined as the four fundamental levels of measurement scales that are used to capture data in the form of surveys and questionnaires
- 72. Nominal Scale: 1st Level of Measurement Nominal scale is a naming scale, where variables are simply “named” or labeled, with no specific order. called the categorical variable scale, is defined as a scale used for labeling variables into distinct classifications and doesn’t involve a quantitative value or order. This scale is the simplest of the four variable measurement scales. Where do you live? 1Suburbs 2City 3Town Nominal scale is often used in research surveys and questionnaires where only variable labels hold significance. Which brand of smartphones do you prefer?” Options : “Apple”- 1 , “Samsung”-2, “OnePlus”-3.
- 73. Ordinal Scale: 2nd Level of Measurement Ordinal scale has all its variables in a specific order, beyond just naming them. variable measurement scale used to simply depict the order of variables and not the difference between each of the variables. These scales are generally used to depict non-mathematical ideas such as frequency, satisfaction, happiness, a degree of pain etc. How satisfied are you with our services? Very Unsatisfied – 1 Unsatisfied – 2 Neutral – 3 Satisfied – 4 Very Satisfied – 5
- 74. Interval Scale: 3rd Level of Measurement as a numerical scale where the order of the variables is known as well as the difference between these variables. Variables which have familiar, constant and computable differences are classified using the Interval scale. What is your family income? What is the temperature in your city?
- 75. Ratio Scale: 4th Level of Measurement as a variable measurement scale that not only produces the order of variables but also makes the difference between variables known along with information on the value of true zero. With the option of true zero, varied inferential and descriptive analysis techniques can be applied to the variables. What is your weight in kilograms?Less than 50 kilograms 51- 70 kilograms 71- 90 kilograms 91-110 kilograms More than 110 kilograms
- 76. VISUAL DATA EXPLORATION AND EXPLORATORY STATISTICAL ANALYSIS Visual data exploration is a very important part of getting to know our data in an “informal” way. It allows analyst to get some initial insights into the data, which can then be usefully adopted throughout the modeling. Different plots/graphs can be useful here.
- 77. Chart Types: Pie Chart ivisions, at year-end its top eeing what portion of total A pie chart is a circular graph divided in to slices. The larger a slice is the bigger portion of the total quantity it represents. A pie chart represents a variable’s distribution as a pie, whereby each section represents the portion of the total percent taken by each value of the variable So, pie charts are best suited to depict sections of a whole. Example : If a company operates three separate d management would be interested in s revenue each division accounted for.
- 78. Chart Types: Bar charts Bar charts represent the frequency of each of the values absolute or relative) as bars. (either bar chart is composed of a series of bars illustrating a variable’s development. Given that bar charts are such a common chart type, people are generally familiar with them and can understand them easily. A bar chart with one variable is easy to follow. Bar charts are great when we want to track the development of one or two variables over time. For example, one of the most frequent applications of corporate presentation is s bar charts in to revenues have developed during a show how a company’s total period. given
- 79. Bar charts can work well for comparison of o o two variables m over time e Let’s say we would like t compare the tw revenues of companies in the timefra between 2014 and 2018.
- 80. Chart Types: Histogram charts A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal distribution), outliers, skewness, etc. An example of a histogram, and the raw data it was constructed from, is shown below: 36 25 38 46 55 68 72 55 36 38 67 45 22 48 91 46 52 61 58 55
- 81. To construct a histogram from a continuous variable you first need to split the data into intervals, called bins. In the example above, age has been split into bins, with eachbin representing a 10-year period starting at 20 years. Each bin contains the number of occurrences of scores in the data set that are contained within that bin. For the above data set, the frequencies in each bin have beentabulated along with the scores that contributed to the frequency in each bin (see below):
- 82. Score s Included in Bin Bin Frequency 20-30 2 25,22 30-40 4 36,38,36,38 40-50 4 46,45,48,46 50-60 5 55,55,52,58,55 60-70 3 68,67,61 70-80 1 72 80-90 900-100 1 -91 Notice that, unlike a bar chart, there are no "gaps" between the bars (although some bars might be "absent" reflecting no frequencies). This is because a histogram represents a continuous data set, and as such, there are no gaps in the data (although you will have to decide whether you round up or round down scores on the boundaries of bins).
- 83. Chart Types: Scatter plots A scatter plot is a type of chart that is often used in the fields of statistics and data science. It consists of multiple data points plotted across two axes. Each variable depicted in a scatter plot would have multiple observations. If a scatter plot includes more than two variables, then we would use different colours to signify that. A scatter plot chart is a great indicator that allows us to see whether there is a pattern to be found between two variables.
- 84. The x-axis contains information about house price, while the y-axis indicates house size. There is an obvious pattern to be found – a positive relationship between the two. The bigger a house is, the higher its price.
- 85. Chart Types: Box plot Box plots (also known as box and whisker plots) are a type of chart often used in explanatory data analysis to visually show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages.
- 86. Minimum Score The lowest score, excluding outliers (shown at the end of the left whisker). Lower Quartile Twenty-five percent of scores fall below the lower quartile value (also known as the first quartile). Median The median marks the mid-point of the data and is shown by the line that divides the box into two parts (sometimes known as the second quartile). Half the scores are greater than or equal to this value and half are less.
- 87. Upper Quartile Seventy-five percent of the scores fall below the upper quartile value (also known as the third quartile). Thus, 25% of data are above this value. Maximum Score The highest score, excluding outliers (shown at the end of the right whisker). Whiskers The upper and lower whiskers represent scores outside the middle 50% (i.e. the lower 25% of scores and the upper 25% of scores). The Interquartile Range (or IQR) This is the box plot showing the middle 50% of scores (i.e., the range between the 25th and 75th percentile).
- 88. Box plots divide the data into sections that each contain ap the data proximately 25% of in that set.
- 89. Box plots are useful as they show the skewness(Skewness is the degree of asymmetry observed in a probability distribution) of a data set. Box plots are useful as they show the average score of a data set. The median is the average value from a set of data and is shown by theline that divides the box into two parts. Half the scores are greater than or equal to this value and half are less.
- 90. When the median is in the middle of the box, and the whiskers are about the same on both sides of the box, then the distribution is symmetric. When the median is closer to the bottom of the box, and if the whisker is shorter on the lower end of the box, then the distribution is positively skewed (skewed right). When the median is closer to the top of the box, and if the whisker is shorter on the upper end of the box, then the distribution is negatively skewed (skewed left).
- 91. Box plots are useful as they show outliers within a data set. An outlier is an observation that is numerically distant from the rest of the data. When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot.
- 92. MISSING VALUE TREATMENT Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.
- 93. The reasons for occurrence of these missing values. They may occur at two stages: 1.Data Extraction: It is possible that there are problems with extraction process. In such cases, we should double-check for correct data with data guardians. Some hashing procedures can also be used to make sure data extraction is correct. Errors at data extraction stage are typically easy to find and can be corrected easily as well. 2.Data collection: These errors occur at time of data collection and are harder to correct. They can be categorized in four types:
- 94. 1.Missing completely at random: This is a case when the probability of missing variable is same for all observations.Data is missing independently of both observed and unobserved data. Example: A survey respondent randomly skips a question. For example: respondents of data collection process decide that they will declare their earning after tossing a fair coin. If an head occurs, respondent declares his / her earnings & vice versa. Here each observation has equal chance of missing value. 2.Missing at random: This is a case when variable is missing at random and missing ratio varies for different values / level of other input variables .The probability of missing data is related to the observed data but not the missing data. Example: People with higher income are less likely to report their income, but we have other related variables like job title and education level.
- 95. 3.Missing that depends on unobserved predictors: This is a case when the missing values are not random and are related to the unobserved input variable. The missingness is related to the unobserved data itself. For example: In a medical study, if a particular diagnostic causes discomfort, then there is higher chance of drop out from the study. This missing value is not at random unless we have included “discomfort” as an input variable for all patients. Example: People with very low incomes might be less likely to report their income because of stigma. 4.Missing that depends on the missing value itself: This is a case when the probability of missing value is directly correlated with missing value itself. For example: People with higher or lower income are likely to provide non-response to their earning.
- 96. Which are the methods to treat missing values ? Some analytical techniques (e.g., decision trees) can directly deal with missing values. Other techniques need some additional preprocessing. The following are the most popular schemes to deal with missing values: 1. Deletion . This is the most straightforward option and consists of deleting observations or variables with lots of missing values. This, of course, assumes that information is missing at random and has no meaningful interpretation and/or relationship to the target.
- 97. Deletion is of types: List Wise Deletion and Pair Wise Deletion. In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one of the major advantage of this method, but this method reduces the power of model because it reduces the sample size.
- 98. In pair wise deletion, we perform analysis with all cases in which the variables of interest are present. Advantage of this method is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses different sample size for different variables. Deletion methods are used when the nature of missing data is “Missing completely at random” else non random missing values can bias the model output.
- 99. 2. Replace (imputation ). This implies replacing the missing value with a known value. Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable.
- 100. Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the observed values. Simple but can distort data distribution. K-Nearest Neighbors (KNN) Imputation: Replace missing values using the nearest neighbors' values. Effective but computationally expensive. Multiple Imputation: Create multiple imputations for the missing values and average the results. This approach preserves the uncertainty of the missing data. Regression Imputation: Use regression models to predict and fill in missing values based on other variables. Hot Deck Imputation: Replace missing values with values from similar records in the dataset. Machine Learning Models: Advanced models like Random Forests or Neural Networks can be used to predict missing values.
- 101. Two Types of Imputation : Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and then replace missing value with it. Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25.
- 102. 3. Keep. Missing values can be meaningful (e.g., a customer did not disclose his or her income because he or she is currently unemployed). Obviously, this is clearly related to the target (e.g., good/bad risk) needs to be considered as a separate category.