Bruno Voisin from the Irish Centre for High End Computing presented this Introduction to Data Analytics Techniques and their Implementation in R during the Big Data Workshop hosted by the Social Sciences Computing Hub at the Whitaker Institute on the 14th November 2013
I am Watson A. I am a Statistics Assignment Expert at statisticsassignmenthelp.com. I hold a Masters in Statistics from, Liberty University, USA
I have been helping students with their homework for the past 6 years. I solve assignments related to Statistics.
Visit statisticsassignmenthelp.com or email info@statisticsassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with Statistics Assignments.
Applied Business Statistics ,ken black , ch 3 part 2AbdelmonsifFadl
This document contains excerpts from Chapter 3 and Chapter 12 of the 6th edition of the textbook "Business Statistics" by Ken Black. Chapter 3 discusses measures of shape such as skewness and the coefficient of skewness. Chapter 12 introduces regression analysis and correlation, covering topics like the Pearson correlation coefficient, least squares regression, and residual analysis. Examples are provided to demonstrate calculating the correlation coefficient and estimating the regression equation to predict costs from number of passengers for an airline.
Applied Business Statistics ,ken black , ch 6AbdelmonsifFadl
This chapter summary covers key concepts about continuous probability distributions discussed in Chapter 6 of the textbook "Business Statistics, 6th ed." by Ken Black. The chapter objectives are to understand the uniform distribution, appreciate the importance of the normal distribution, and know how to solve normal distribution problems. It discusses the uniform, normal, and exponential distributions. It explains how to calculate probabilities using the normal distribution and z-scores. It also discusses when the normal distribution can be used to approximate the binomial distribution.
Applied Business Statistics ,ken black , ch 5AbdelmonsifFadl
This document discusses discrete probability distributions from Chapter 5 of the textbook "Business Statistics, 6th ed." by Ken Black. It defines discrete and continuous random variables and distributions. It describes how to calculate the mean and variance of a discrete distribution. It also introduces the binomial and Poisson distributions and provides examples of how to calculate probabilities using them. For the binomial distribution example, it calculates the probability of getting two or fewer unemployed workers in a random sample of 20 Jackson, Mississippi residents. For the Poisson distribution example, it calculates the probability of more than 7 customers arriving at a bank in a 4-minute interval, given an average arrival rate of 3.2 customers every 4 minutes.
This presentation covered the following topics :
1. Variance
2. Standard Deviation
3. Meaning and Types of Skewness
4. Related Examples
and is useful for B.Sc & M.Sc students.
Introduction about analytics with sas+r programming.Ravi Mandal, MBA
This document discusses modern analytics compared to traditional analytics. Modern analytics uses predictive and prescriptive models, relies on full data analysis rather than samples, and allows for mass personalization down to the individual level. There will be a shortage of analytics professionals by 2018. The most common types of users of analytics projects are line of business executives, marketing/finance analysts, and data scientists. The document discusses using SAS and R for real-time data preparation, model training, and analytics applications in different fields. It provides overviews of the SAS programming language and R programming language.
An introduction to analytics is a small presentation made for increasing awareness on analytics with some case studies of applying analytics in different functions.
These case studies are from informs.org which were openly available when the presentation was made. Due to confidentiality related obligations my personal experiences were shared - without naming clients - during the presentation. However, the case studies cannot be share on the PPT here. For more details or inputs on analytics you can reach me at twitter - @krdpravin or LinkedIn - https://in.linkedin.com/in/krdpravin
This document provides an overview of analytics and its increasing use in business. It discusses:
- What analytics is and common business questions it can answer with different techniques.
- Examples of how various industries use analytics in areas like marketing, risk management, and operations.
- Data showing analytics adoption is growing as more companies recognize its benefits.
- Different analytics tools and techniques ranging from basic to advanced.
- Potential career paths in analytics fields like statistical modeling, software development, and domain expertise.
- How even small and medium enterprises can leverage analytics for solutions like management systems, data-driven decisions, and online marketing optimization.
I am Watson A. I am a Statistics Assignment Expert at statisticsassignmenthelp.com. I hold a Masters in Statistics from, Liberty University, USA
I have been helping students with their homework for the past 6 years. I solve assignments related to Statistics.
Visit statisticsassignmenthelp.com or email info@statisticsassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with Statistics Assignments.
Applied Business Statistics ,ken black , ch 3 part 2AbdelmonsifFadl
This document contains excerpts from Chapter 3 and Chapter 12 of the 6th edition of the textbook "Business Statistics" by Ken Black. Chapter 3 discusses measures of shape such as skewness and the coefficient of skewness. Chapter 12 introduces regression analysis and correlation, covering topics like the Pearson correlation coefficient, least squares regression, and residual analysis. Examples are provided to demonstrate calculating the correlation coefficient and estimating the regression equation to predict costs from number of passengers for an airline.
Applied Business Statistics ,ken black , ch 6AbdelmonsifFadl
This chapter summary covers key concepts about continuous probability distributions discussed in Chapter 6 of the textbook "Business Statistics, 6th ed." by Ken Black. The chapter objectives are to understand the uniform distribution, appreciate the importance of the normal distribution, and know how to solve normal distribution problems. It discusses the uniform, normal, and exponential distributions. It explains how to calculate probabilities using the normal distribution and z-scores. It also discusses when the normal distribution can be used to approximate the binomial distribution.
Applied Business Statistics ,ken black , ch 5AbdelmonsifFadl
This document discusses discrete probability distributions from Chapter 5 of the textbook "Business Statistics, 6th ed." by Ken Black. It defines discrete and continuous random variables and distributions. It describes how to calculate the mean and variance of a discrete distribution. It also introduces the binomial and Poisson distributions and provides examples of how to calculate probabilities using them. For the binomial distribution example, it calculates the probability of getting two or fewer unemployed workers in a random sample of 20 Jackson, Mississippi residents. For the Poisson distribution example, it calculates the probability of more than 7 customers arriving at a bank in a 4-minute interval, given an average arrival rate of 3.2 customers every 4 minutes.
This presentation covered the following topics :
1. Variance
2. Standard Deviation
3. Meaning and Types of Skewness
4. Related Examples
and is useful for B.Sc & M.Sc students.
Introduction about analytics with sas+r programming.Ravi Mandal, MBA
This document discusses modern analytics compared to traditional analytics. Modern analytics uses predictive and prescriptive models, relies on full data analysis rather than samples, and allows for mass personalization down to the individual level. There will be a shortage of analytics professionals by 2018. The most common types of users of analytics projects are line of business executives, marketing/finance analysts, and data scientists. The document discusses using SAS and R for real-time data preparation, model training, and analytics applications in different fields. It provides overviews of the SAS programming language and R programming language.
An introduction to analytics is a small presentation made for increasing awareness on analytics with some case studies of applying analytics in different functions.
These case studies are from informs.org which were openly available when the presentation was made. Due to confidentiality related obligations my personal experiences were shared - without naming clients - during the presentation. However, the case studies cannot be share on the PPT here. For more details or inputs on analytics you can reach me at twitter - @krdpravin or LinkedIn - https://in.linkedin.com/in/krdpravin
This document provides an overview of analytics and its increasing use in business. It discusses:
- What analytics is and common business questions it can answer with different techniques.
- Examples of how various industries use analytics in areas like marketing, risk management, and operations.
- Data showing analytics adoption is growing as more companies recognize its benefits.
- Different analytics tools and techniques ranging from basic to advanced.
- Potential career paths in analytics fields like statistical modeling, software development, and domain expertise.
- How even small and medium enterprises can leverage analytics for solutions like management systems, data-driven decisions, and online marketing optimization.
This document discusses building regression and classification models in R, including linear regression, generalized linear models, and decision trees. It provides examples of building each type of model using various R packages and datasets. Linear regression is used to predict CPI data. Generalized linear models and decision trees are built to predict body fat percentage. Decision trees are also built on the iris dataset to classify flower species.
A matrix is a two-dimensional rectangular data structure that can be created in R using a vector as input to the matrix function. The matrix function arranges the vector elements into rows and columns based on the number of rows and columns specified. Basic matrix operations include accessing individual elements and submatrices, computing transposes, products, and inverses. Matrices allow efficient storage and manipulation of multi-dimensional data.
This document discusses two methods of unsupervised learning: principal component analysis (PCA) and clustering. It applies PCA and clustering to cancer microarray gene expression data (NCI60) to explore patterns in the data without a response variable. PCA of the NCI60 data finds the first seven principal components explain 40% of the variance. Scatter plots of the first seven principal components show cancer types cluster together, though imperfectly. Hierarchical clustering with complete linkage also tends to cluster cell lines within a single cancer type together.
software engineering modules iii & iv.pptxrani marri
The document discusses software quality and the ongoing issues related to it. It notes that in 2005 and 2006, publications lamented the prevalence of bad software plaguing organizations and the sorry state of software quality. Today, software quality remains a problem, with customers blaming developers for low-quality software due to sloppy practices, while developers blame unrealistic schedules and changing requirements for quality issues. The document examines perspectives on who or what is responsible for ongoing software quality problems.
CLUTO is a software toolkit used for clustering high-dimensional datasets and analyzing cluster characteristics. It contains two main algorithms: Vcluster, which clusters based on the actual multi-dimensional data representation, and Scluster, which clusters based on a pre-computed similarity matrix. CLUTO can be run from the command line with various optional parameters to control the clustering method, analysis, and visualization of results.
This document provides an overview of statistical concepts and analysis techniques in R, including measures of central tendency, data variability, correlation, regression, and time series analysis. Key points covered include mean, median, mode, variance, standard deviation, z-scores, quartiles, standard deviation vs variance, correlation, ANOVA, and importing/working with different data structures in R like vectors, lists, matrices, and data frames.
The document describes building regression and classification models in R, including linear regression, generalized linear models, decision trees, and random forests. It uses examples of CPI data to demonstrate linear regression and predicts CPI values in 2011. For classification, it builds a decision tree model on the iris dataset using the party package and visualizes the tree. The document provides information on evaluating and comparing different models.
The document discusses descriptive data summarization techniques for data cleaning and preprocessing. It describes common issues with real-world data including missing values, noise, and inconsistencies. Common techniques for data cleaning are then presented, such as data cleaning, integration, transformation, and reduction. Methods for handling missing values, smoothing noisy data, and resolving inconsistencies are outlined. Finally, descriptive statistical techniques for summarizing data distributions are reviewed, including measures of central tendency, dispersion, and graphical displays like histograms and scatter plots.
This document summarizes R and data mining. It introduces R language features including vectors, factors, arrays, matrices, data frames, lists, and functions. It also discusses R text mining frameworks like the 'tm' package, and preprocessing text data in R using packages like rmmseg4j, openNLP, Rstem, and Snowball. Finally, it briefly mentions high performance computing in R, network analysis in R, and statistical graphics.
This document summarizes Arthur Charpentier's presentation on data science and actuarial science using R. It discusses S3 and S4 classes in R for defining custom object classes. It also covers matrices, vectors, numbers and memory management in R. The presentation references advanced R techniques and packages for insurance and data mining applications.
This document provides an agenda for an R programming presentation. It includes an introduction to R, commonly used packages and datasets in R, basics of R like data structures and manipulation, looping concepts, data analysis techniques using dplyr and other packages, data visualization using ggplot2, and machine learning algorithms in R. Shortcuts for the R console and IDE are also listed.
This document discusses descriptive statistics and numerical measures used to describe data sets. It introduces measures of central tendency including the mean, median, and mode. The mean is the average value calculated by summing all values and dividing by the number of values. The median is the middle value when values are arranged in order. The mode is the most frequently occurring value. The document also discusses measures of dispersion like range and standard deviation which describe how spread out the data is. Examples are provided to demonstrate calculating the mean, median and other descriptive statistics.
If you are worried about completing your R homework, you can connect with us at Statisticshomeworkhelper.com. We have a team of experts who are professionals in R programming homework help and have years of experience in working on any problem related to R. Visit statisticshomeworkhelper.com or email info@statisticshomeworkhelper.com. You can also call +1 (315) 557-6473 for assistance with Statistics Homework.
This document outlines an introduction to advanced R concepts taught in a Data Science for Actuaries course in March-June 2015. It covers topics like classes, matrices, numbers, and memory management in R. References for further reading on R and data mining are also provided.
Random Forest and Generalized Boosted Model classification models were used to predict if participants correctly or incorrectly performed a bicep curl exercise based on accelerometer data from wearable devices. Random Forest achieved 98.49% average accuracy on the training data and 100% accuracy on the test data. Generalized Boosted Model achieved 92.59% average accuracy on the training data. Both models produced promising results for classifying the exercise performances.
This document discusses data preprocessing techniques. It explains that data is often incomplete, noisy, or inconsistent when collected from the real world. Common preprocessing steps include data cleaning to handle these issues, data integration and transformation to combine multiple data sources, and data reduction to reduce the volume of data for analysis while maintaining analytical results. Specific techniques covered include filling in missing values, identifying and smoothing outliers, resolving inconsistencies, schema integration, attribute construction, data cube aggregation, dimensionality reduction, and discretization.
The document provides examples of using MATLAB to perform calculations and functions. It demonstrates operations like matrix multiplication, taking the inverse of a matrix, reshaping arrays, and using functions like sort, sin, and cos. It also shows accessing elements of matrices and arrays, defining vectors and matrices, and returning errors when inputs are invalid.
The document provides examples of using MATLAB to perform calculations and functions. It demonstrates operations like matrix multiplication, taking the inverse of a matrix, reshaping arrays, and using functions like sort, sin, and cos. It also shows accessing elements of matrices and arrays, defining vectors and matrices, and returning errors when inputs are invalid.
Vincenzo MacCarrone, Explaining the trajectory of collective bargaining in Ir...NUI Galway
Vincenzo MacCarrone, UCD, Explaining the trajectory of collective bargaining in Ireland: 2000-2017 presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Tom Turner, Tipping the scales for labour in Ireland? NUI Galway
Dr Tom Turner, University of Limerick, Tipping the scales for labour in Ireland? Collective bargaining and the industrial relations amendment) act 2015 presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
More Related Content
Similar to 2013.11.14 Big Data Workshop Bruno Voisin
This document discusses building regression and classification models in R, including linear regression, generalized linear models, and decision trees. It provides examples of building each type of model using various R packages and datasets. Linear regression is used to predict CPI data. Generalized linear models and decision trees are built to predict body fat percentage. Decision trees are also built on the iris dataset to classify flower species.
A matrix is a two-dimensional rectangular data structure that can be created in R using a vector as input to the matrix function. The matrix function arranges the vector elements into rows and columns based on the number of rows and columns specified. Basic matrix operations include accessing individual elements and submatrices, computing transposes, products, and inverses. Matrices allow efficient storage and manipulation of multi-dimensional data.
This document discusses two methods of unsupervised learning: principal component analysis (PCA) and clustering. It applies PCA and clustering to cancer microarray gene expression data (NCI60) to explore patterns in the data without a response variable. PCA of the NCI60 data finds the first seven principal components explain 40% of the variance. Scatter plots of the first seven principal components show cancer types cluster together, though imperfectly. Hierarchical clustering with complete linkage also tends to cluster cell lines within a single cancer type together.
software engineering modules iii & iv.pptxrani marri
The document discusses software quality and the ongoing issues related to it. It notes that in 2005 and 2006, publications lamented the prevalence of bad software plaguing organizations and the sorry state of software quality. Today, software quality remains a problem, with customers blaming developers for low-quality software due to sloppy practices, while developers blame unrealistic schedules and changing requirements for quality issues. The document examines perspectives on who or what is responsible for ongoing software quality problems.
CLUTO is a software toolkit used for clustering high-dimensional datasets and analyzing cluster characteristics. It contains two main algorithms: Vcluster, which clusters based on the actual multi-dimensional data representation, and Scluster, which clusters based on a pre-computed similarity matrix. CLUTO can be run from the command line with various optional parameters to control the clustering method, analysis, and visualization of results.
This document provides an overview of statistical concepts and analysis techniques in R, including measures of central tendency, data variability, correlation, regression, and time series analysis. Key points covered include mean, median, mode, variance, standard deviation, z-scores, quartiles, standard deviation vs variance, correlation, ANOVA, and importing/working with different data structures in R like vectors, lists, matrices, and data frames.
The document describes building regression and classification models in R, including linear regression, generalized linear models, decision trees, and random forests. It uses examples of CPI data to demonstrate linear regression and predicts CPI values in 2011. For classification, it builds a decision tree model on the iris dataset using the party package and visualizes the tree. The document provides information on evaluating and comparing different models.
The document discusses descriptive data summarization techniques for data cleaning and preprocessing. It describes common issues with real-world data including missing values, noise, and inconsistencies. Common techniques for data cleaning are then presented, such as data cleaning, integration, transformation, and reduction. Methods for handling missing values, smoothing noisy data, and resolving inconsistencies are outlined. Finally, descriptive statistical techniques for summarizing data distributions are reviewed, including measures of central tendency, dispersion, and graphical displays like histograms and scatter plots.
This document summarizes R and data mining. It introduces R language features including vectors, factors, arrays, matrices, data frames, lists, and functions. It also discusses R text mining frameworks like the 'tm' package, and preprocessing text data in R using packages like rmmseg4j, openNLP, Rstem, and Snowball. Finally, it briefly mentions high performance computing in R, network analysis in R, and statistical graphics.
This document summarizes Arthur Charpentier's presentation on data science and actuarial science using R. It discusses S3 and S4 classes in R for defining custom object classes. It also covers matrices, vectors, numbers and memory management in R. The presentation references advanced R techniques and packages for insurance and data mining applications.
This document provides an agenda for an R programming presentation. It includes an introduction to R, commonly used packages and datasets in R, basics of R like data structures and manipulation, looping concepts, data analysis techniques using dplyr and other packages, data visualization using ggplot2, and machine learning algorithms in R. Shortcuts for the R console and IDE are also listed.
This document discusses descriptive statistics and numerical measures used to describe data sets. It introduces measures of central tendency including the mean, median, and mode. The mean is the average value calculated by summing all values and dividing by the number of values. The median is the middle value when values are arranged in order. The mode is the most frequently occurring value. The document also discusses measures of dispersion like range and standard deviation which describe how spread out the data is. Examples are provided to demonstrate calculating the mean, median and other descriptive statistics.
If you are worried about completing your R homework, you can connect with us at Statisticshomeworkhelper.com. We have a team of experts who are professionals in R programming homework help and have years of experience in working on any problem related to R. Visit statisticshomeworkhelper.com or email info@statisticshomeworkhelper.com. You can also call +1 (315) 557-6473 for assistance with Statistics Homework.
This document outlines an introduction to advanced R concepts taught in a Data Science for Actuaries course in March-June 2015. It covers topics like classes, matrices, numbers, and memory management in R. References for further reading on R and data mining are also provided.
Random Forest and Generalized Boosted Model classification models were used to predict if participants correctly or incorrectly performed a bicep curl exercise based on accelerometer data from wearable devices. Random Forest achieved 98.49% average accuracy on the training data and 100% accuracy on the test data. Generalized Boosted Model achieved 92.59% average accuracy on the training data. Both models produced promising results for classifying the exercise performances.
This document discusses data preprocessing techniques. It explains that data is often incomplete, noisy, or inconsistent when collected from the real world. Common preprocessing steps include data cleaning to handle these issues, data integration and transformation to combine multiple data sources, and data reduction to reduce the volume of data for analysis while maintaining analytical results. Specific techniques covered include filling in missing values, identifying and smoothing outliers, resolving inconsistencies, schema integration, attribute construction, data cube aggregation, dimensionality reduction, and discretization.
The document provides examples of using MATLAB to perform calculations and functions. It demonstrates operations like matrix multiplication, taking the inverse of a matrix, reshaping arrays, and using functions like sort, sin, and cos. It also shows accessing elements of matrices and arrays, defining vectors and matrices, and returning errors when inputs are invalid.
The document provides examples of using MATLAB to perform calculations and functions. It demonstrates operations like matrix multiplication, taking the inverse of a matrix, reshaping arrays, and using functions like sort, sin, and cos. It also shows accessing elements of matrices and arrays, defining vectors and matrices, and returning errors when inputs are invalid.
Similar to 2013.11.14 Big Data Workshop Bruno Voisin (20)
Vincenzo MacCarrone, Explaining the trajectory of collective bargaining in Ir...NUI Galway
Vincenzo MacCarrone, UCD, Explaining the trajectory of collective bargaining in Ireland: 2000-2017 presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Tom Turner, Tipping the scales for labour in Ireland? NUI Galway
Dr Tom Turner, University of Limerick, Tipping the scales for labour in Ireland? Collective bargaining and the industrial relations amendment) act 2015 presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Tom McDonnell, Medium-term trends in the Irish labour market and possibilitie...NUI Galway
The document summarizes medium-term trends in Ireland's labor market from 1998-2017. It finds that while employment doubled over this period, the employment rate remains below other Northern European countries. There was a shift away from industry and agriculture towards healthcare and education. Female labor force participation lags the EU average, and regional employment growth has not significantly favored Dublin. Wage and productivity growth in Ireland has also been comparatively weak. Key barriers to employment include the high cost of childcare and lack of an industrial policy following industry declines. Volatility in employment may be difficult to avoid in small open economies.
Stephen Byrne, A non-employment index for IrelandNUI Galway
Stephen Byrne, Central Bank of Ireland, A non-employment index for Ireland presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Sorcha Foster, The risk of automation of work in IrelandNUI Galway
Sorcha Foster, Oxford University, The risk of automation of work in Ireland – both sides of the border presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Sinead Pembroke, Living with uncertainty: The social implications of precario...NUI Galway
Dr Sinéad Pembroke, TASC, Living with uncertainty: The social implications of precarious work presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Paul MacFlynn, A low skills equilibrium in Northern IrelandNUI Galway
Paul Mac Flynn, NERI, A low skills equilibrium in Northern Ireland presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Nuala Whelan, The role of labour market activation in building a healthy work...NUI Galway
Dr Nuala Whelan, Maynooth University & Ballymun Job Club, The role of labour market activation in building a healthy workforce: Enhancing well-being for the long-term unemployed through positive psychological interventions presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Michéal Collins, and Dr Michelle Maher, Auto enrolmentNUI Galway
Dr Michéal Collins, UCD and Dr Michelle Maher, Maynooth University, Auto enrolment: into what, for whom and how much? presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Michael Taft, SIPTU, A new enterprise model: The long march through the market economy presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Luke Rehill, Patterns of firm-level productivity in IrelandNUI Galway
The document summarizes results from an analysis of firm-level productivity in Ireland between 2006-2014 using a multi-factor productivity model. Key findings include: productivity growth has declined since the 1990s both in Ireland and globally; a small number of large firms account for most value added and employment; foreign-owned firms have significantly higher productivity and wages than domestic firms; and productivity dispersion between the most and least productive firms has widened over time. The analysis finds potential for improving efficiency of resource allocation across firms.
Lucy Pyne, Evidence from the Social Inclusion and Community Activation ProgrammeNUI Galway
Ms Lucy Pyne, Pobal, Evidence from the Social Inclusion and Community Activation Programme (SICAP) presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Lisa Wilson, The gendered nature of job quality and job insecurityNUI Galway
Dr Lisa Wilson, NERI, The gendered nature of job quality and job insecurity presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Karina Doorley, axation, labour force participation and gender equality in Ir...NUI Galway
The document summarizes a presentation on taxation, work and gender equality in Ireland. It finds that Ireland's partial individualization of the income tax system in 2000 increased the employment rate of married women by 5-6 percentage points and their weekly work hours by 2 hours on average. It also reduced the weekly hours of unpaid childcare performed by married women with children by 3 hours. The policy achieved its goal of increasing incentives for spouses, especially women, to join the labor force. Further individualization may be considered but must account for distributional impacts and ways to address fixed costs of work.
Jason Loughrey, Household income volatility in IrelandNUI Galway
Dr Jason Loughrey, Teagasc, Household income volatility in Ireland presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Ivan Privalko, What do Workers get from Mobility?NUI Galway
Voluntary job mobility, such as quits and promotions, is assumed to lead to improved wages and working conditions. However, studies have found mixed and inconsistent results regarding the effects of different types of voluntary mobility on objective and subjective work outcomes. This document analyzes data from the British Household Panel Survey to compare the effects of internal voluntary mobility (promotions), external voluntary mobility (quits), and involuntary mobility (demotions, layoffs) on subjective satisfaction and objective pay. It finds that external voluntary mobility most increases subjective satisfaction, while internal voluntary mobility provides the largest objective pay benefits. Voluntary mobility within versus between employers leads to different work rewards.
Helen Johnston, Labour market transitions: barriers and enablersNUI Galway
Dr Helen Johnston, NESC, Labour market transitions: barriers and enablers presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Gail Irvine, Carnegie UK Trust, Fulfilling work in Ireland presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Frank Walsh, Assessing competing explanations for the decline in trade union ...NUI Galway
Dr Frank Walsh, UCD, Assessing competing explanations for the decline in trade union density in Ireland presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Eamon Murphy, An overview of labour market participation in Ireland over the ...NUI Galway
Eamon Murphy, Social Justice Ireland, An overview of labour market participation in Ireland over the last two decades presented at the 6th Annual NERI Labour Market Conference in association with the Whitaker Institute, NUI Galway, 22nd May, 2018.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
1. Introduction to Data Analytics Techniques and their
Implementation in R
Dr Bruno Voisin
Irish Centre for High End Computing (ICHEC)
November 14, 2013
Introduction to analytics techniques 1
2. Outline
Preparing vs Processing
Preparing the Data
◮ Outliers
◮ Missing values
◮ R data types: numerical vs factors
◮ Reshaping data
Forecasting, Predicting, Classifying...
◮ Linear Regression
◮ K nearest neighbours
◮ Decision Trees
◮ Time Series
Going Further
◮ Ensembles of models
◮ Rattle
Introduction to analytics techniques 2
3. Preparing vs Processing
Before considering what mathematical models could fit your data, ask
yourself: ”is my data ready for this?”
Pro-tip: the answer is no. Sorry. Chances are...
It’s ”noisy”.
It’s wrong.
It’s incomplete.
It’s not in shape.
Spending 90% of your time preparing data, 10% fitting models isn’t
necessarily a bad ratio!
Introduction to analytics techniques 3
5. Outliers
Outliers are records with unusual values for an attribute or
combination of attributes. As a rule, we need to:
◮ detect them
◮ understand them (typo vs genuine but unusual value)
◮ decide what to do with them (remove them or not, correct them)
Introduction to analytics techniques 5
6. Detecting outliers: mean vs median
Both mean and median provide an expected ’typical’ value useful to
detect outlier.
Mean has some nice useful properties (standard deviation).
Median is more tolerant of outliers and asymetrical data.
Rule of thumb:
◮ nicely symetrical data with mean ≈ median: safe to use mean.
◮ noisy, asymetrical data where mean = median: use median.
Introduction to analytics techniques 6
8. Detecting outliers: the boxplot
Graphical representation of median, quartiles, and last observations
not considered as outliers.
> data(iris)
> boxplot(iris[,c(1,2,3,4)], col=rainbow(4), notch=TRUE)
Introduction to analytics techniques 8
9. Detecting outliers: the boxplot
Use identify to turn outliers into clickable dots and have R return
their indices:
> boxplot(iris[,c(1,2,3,4)], col=rainbow(4), notch=TRUE)
> identify(array(2,length(iris[[2]])),iris$Sepal.Width)
[1] 16 33 34 61
> outliers <- identify(array(2,length(iris[[1]])),
iris$Sepal.Width)
> iris[outliers,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Speci
16 5.7 4.4 1.5 0.4 seto
33 5.2 4.1 1.5 0.1 seto
34 5.5 4.2 1.4 0.2 seto
61 5.0 2.0 3.5 1.0 versicol
Introduction to analytics techniques 9
10. Detecting outliers: the boxplot
For automated tasks, use the boxplot object itself:
> x <- iris$Sepal.Width
> bp <- boxplot( iris$Sepal.Width )
> iris[x %in% bp$out,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Speci
16 5.7 4.4 1.5 0.4 seto
33 5.2 4.1 1.5 0.1 seto
34 5.5 4.2 1.4 0.2 seto
61 5.0 2.0 3.5 1.0 versicol
Introduction to analytics techniques 10
11. Detecting outliers: Mk1 Eyeballs
Some weird cases may always show up which quick stats won’t pick
up.
Visual approach: show visual identification of weird cases like:
*******.........*...........*********
^
outlier
Introduction to analytics techniques 11
12. Understanding Outliers
No general rule, pretty much a domain-dependent task.
Data analyst/domain experts work together and identify genuine
record vs obvious errors (127 years old driver renting a car).
Class information is at the centre of automated classification.
Consider outliers in regards to their own class if available.
Introduction to analytics techniques 12
13. Understanding Outliers
Iris example: for Setosa, of the three extreme Sepal.Width values,
only one genuinely out of range. For Versicolor, odd one out
disappears. New outliers appear on other variables:
> par(mfrow = c(1,2))
> boxplot(iris[iris$Species=="setosa",c(1,2,3,4)], main="Seto
> boxplot(iris[iris$Species=="versicolor",c(1,2,3,4)], main="
Introduction to analytics techniques 13
14. Managing outliers
Incorrect data should be treated as missing (to be ignored or
simulated, see below).
Genuine but unusual data is processed according to context:
◮ should generally be kept (sometimes even, particular interest in the
exceptions, ex: fraud detection)
◮ may eventually be removed (bad practice, but sometimes there’s
interest in modelling only the mainstream data)
Introduction to analytics techniques 14
15. Missing values
Missing values are represented in R by a special ‘NA‘ value.
Amusingly, ’NA’ in some data sets may mean ’North America’, ’North
American Airlines’, etc. Keep it in mind while importing/exporting
data.
Finding/counting them from a variable or data frame:
> sum(is.na(BreastCancer[,7]))
[1] 16
> incomplete <- BreastCancer[!complete.cases(BreastCancer),]
> nrow(incomplete)
[1] 16
Introduction to analytics techniques 15
16. Strategies for missing values
removing NAs:
> nona <- BreastCancer[complete.cases(BreastCancer),]
replacing NAs:
◮ mean
> x <- iris$Sepal.Width
> x[sample(length(x),5)] <- NA
> x[is.na(x)] <- mean(x, na.rm=TRUE)
◮ median (can be grabbed from the boxplot), most common nominative
variable
◮ value of closest case in other dimensions
◮ domain expert decided value (caveat: DM aims at finding unknowns
from domain experts)
◮ etc.
Introduction to analytics techniques 16
17. R data types: numerical vs factors
Mainly number crunching algorithms.
However, discrete variables can be managed by some techniques. R
modules generally require those to be stored as factors.
Discrete variables better fit for some techniques (decision trees)
◮ consider conversion of numerical to meaningful ranges (ex: customer
age range)
◮ integer variables can be used as either numerical or factor
Introduction to analytics techniques 17
18. Factor to numerical
as.numeric isn’t sufficient since it would simply return the the factor
levels of a variable. Need to ’translate’ the level into its value.
> library(mlbench)
> data(BreastCancer)
> f <- BreastCancer$Cell.shape[1:10]
> as.numeric(levels(f))[f]
[1] 1 4 1 8 1 10 1 2 1 1
Introduction to analytics techniques 18
19. Numerical to factor
Converting numerical to factor ”as is” with as.factor:
> s <- c(21, 43, 55, 18, 21, 50, 20, 67, 36, 33, 36)
> as.factor(s)
[1] 21 43 55 18 21 50 20 67 36 33 36
Levels: 18 20 21 33 36 43 50 55 67
Converting numerical ranges to a factor with cut:
> cut(s, c(-Inf, 21, 26, 30, 34, 44, 54, 64, Inf), labels=
c("21 and Under", "22 to 26", "27 to 30", "31 to 34",
"35 to 44", "45 to 54", "55 to 64", "65 and Over"))
[1] 21 and Under 35 to 44 55 to 64 21 and Under 21 a
[6] 45 to 54 21 and Under 65 and Over 35 to 44 31 t
[11] 35 to 44
8 Levels: 21 and Under 22 to 26 27 to 30 31 to 34 35 to 44 ..
Introduction to analytics techniques 19
20. Reshaping
More often than not, the ’shape’ of the data as it comes won’t be
convenient.
Look at the following example:
> pop <- read.csv("http://2010.census.gov/2010census/data/pop
> pop <- pop[,1:12]
> colnames(pop)
[1] "STATE_OR_REGION" "X1910_POPULATION" "X1920_POPULATION"
[5] "X1940_POPULATION" "X1950_POPULATION" "X1960_POPULATION"
[9] "X1980_POPULATION" "X1990_POPULATION" "X2000_POPULATION"
> pop[1:10,]
STATE_OR_REGION X1910_POPULATION X1920_POPULATION X19
1 United States 92228531 106021568
2 Alabama 2138093 2348174
3 Alaska 64356 55036
4 Arizona 204354 334162
5 Arkansas 1574449 1752204
6 California 2377549 3426861
7 Colorado 799024 939629
8 Connecticut 1114756 1380631
9 Delaware 202322 223003
10 District of Columbia 331069 437571
Introduction to analytics techniques 20
21. Reshaping: melt
The reshape2 package provides convenient functions for reshaping
data:
> library(reshape2)
> colnames(pop) <- c("state", seq(1910, 2010, 10))
> mpop <- melt(pop, id.vars="state", variable.name="year",
value.name="population")
> mpop[1:10,]
state year population
1 United States 1910 92228531
2 Alabama 1910 2138093
3 Alaska 1910 64356
4 Arizona 1910 204354
5 Arkansas 1910 1574449
6 California 1910 2377549
7 Colorado 1910 799024
8 Connecticut 1910 1114756
9 Delaware 1910 202322
10 District of Columbia 1910 331069
more friendly to a relational database table too.
Introduction to analytics techniques 21
22. Reshaping: cast
acast and dcast reverse the melt and produce respectively an
array/matrix or a data frame:
> dcast(mpop, state˜year, value_var="population")[1:10,]
Using population as value column: use value.var to override.
state 1910 1920 1930 1940 1
1 Alabama 2138093 2348174 2646248 2832961 3061
2 Alaska 64356 55036 59278 72524 128
3 Arizona 204354 334162 435573 499261 749
4 Arkansas 1574449 1752204 1854482 1949387 1909
5 California 2377549 3426861 5677251 6907387 10586
6 Colorado 799024 939629 1035791 1123296 1325
7 Connecticut 1114756 1380631 1606903 1709242 2007
8 Delaware 202322 223003 238380 266505 318
9 District of Columbia 331069 437571 486869 663091 802
10 Florida 752619 968470 1468211 1897414 2771
Introduction to analytics techniques 22
23. Forecasting, Predicting, Classifying...
Ultimately, we’re trying to understand a behaviour from our data.
To this end, various mathematical models have been developed,
matching various known behaviours.
Each model will come with its own sweet/blind spots and its own
scaling issues when moving towards Big Data.
Today’s overview of models will cover: Linear Regression, kNN,
Decision Trees and basic Time Series, but there’s a lot more models
around...
Introduction to analytics techniques 23
24. Linear Regression
One of the simplest models.
Establish a linear relationship between variables, predicting one
variable’s value (the response) from the others (the predictors).
Intuitively, it’s all about drawing a line. But the right line.
Introduction to analytics techniques 24
25. Simple Linear Regression
> data(trees)
> plot(trees$Girth, trees$Volume)
Introduction to analytics techniques 25
26. Simple Linear Regression
> lm(formula=Volume˜Girth, data=trees)
Call:
lm(formula = Volume ˜ Girth, data = trees)
Coefficients:
(Intercept) Girth
-36.943 5.066
> abline(-36.943, 5.066)
Introduction to analytics techniques 26
27. Simple Linear Regression
For a response variable r and predictor variables p1, p2, . . ., pn the
lm() function generates a simple linear model based on a formula
object of the form:
r ∼ p1 + p2 + · · · + pn
Example: building a linear model using both Girth and Height as
predictors for a tree’s Volume:
> lm(formula=Volume˜Girth+Height, data=trees)
Call:
lm(formula = Volume ˜ Girth + Height, data = trees)
Coefficients:
(Intercept) Girth Height
-57.9877 4.7082 0.3393
By default, lm() fits the model that minimizes the sum of square
errors.
Introduction to analytics techniques 27
28. Linear Model Evaluation
> fit <- lm(formula=Volume˜Girth+Height, data=trees)
> summary(fit)
Call:
lm(formula = Volume ˜ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Introduction to analytics techniques 28
29. Refining the model
Low-significance predictor attributes complexify a model for a small
gain.
The anova() function helps evaluate predictors.
The update() function allows us to remove predictors from the
model.
The step() function can repeat such a task using a different
criterion.
Introduction to analytics techniques 29
30. Refining the model: anova()
> data(airquality)
> fit <- lm(formula = Ozone ˜ . , data=airquality)
> anova(fit)
Analysis of Variance Table
Response: Ozone
Df Sum Sq Mean Sq F value Pr(>F)
Solar.R 1 14780 14780 33.9704 6.216e-08 ***
Wind 1 39969 39969 91.8680 5.243e-16 ***
Temp 1 19050 19050 43.7854 1.584e-09 ***
Month 1 1701 1701 3.9101 0.05062 .
Day 1 619 619 1.4220 0.23576
Residuals 105 45683 435
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Sum Sq shows the reduction in the residual sum of squares as each
predictor is added. Small values contribute less.
Introduction to analytics techniques 30
31. Refining the model: update()
> fit2 <- update (fit, . ˜ . - Day)
> summary(fit2)
Call:
lm(formula = Ozone ˜ Solar.R + Wind + Temp + Month, data = airqua
[...]
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -58.05384 22.97114 -2.527 0.0130 *
Solar.R 0.04960 0.02346 2.114 0.0368 *
Wind -3.31651 0.64579 -5.136 1.29e-06 ***
Temp 1.87087 0.27363 6.837 5.34e-10 ***
Month -2.99163 1.51592 -1.973 0.0510 .
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 20.9 on 106 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055
F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16
We removed the Day predictor from the model.
New model is slightly worse... but simpler...
Introduction to analytics techniques 31
32. Refining the model: step()
The step() function can automatically reduce the model:
> final <- step(fit)
Start: AIC=680.21
Ozone ˜ Solar.R + Wind + Temp + Month + Day
Df Sum of Sq RSS AIC
- Day 1 618.7 46302 679.71
<none> 45683 680.21
- Month 1 1755.3 47438 682.40
- Solar.R 1 2005.1 47688 682.98
- Wind 1 11533.9 57217 703.20
- Temp 1 20845.0 66528 719.94
Step: AIC=679.71
Ozone ˜ Solar.R + Wind + Temp + Month
Df Sum of Sq RSS AIC
<none> 46302 679.71
- Month 1 1701.2 48003 681.71
- Solar.R 1 1952.6 48254 682.29
- Wind 1 11520.5 57822 702.37
- Temp 1 20419.5 66721 718.26
Introduction to analytics techniques 32
33. K-nearest neighbours (KNN)
K-nearest neighbour classification is amongst the simplest
classification algorithms.
consists in classifying an element as the majority of the k elements of
the learning set closest to it in the multidimensional feature space.
no training needed.
classification can be compute-intensive for high k values (many
distances to evaluate) and requires access to learning data set.
very intuitive for end-user, but does not provide any insight into the
data.
Introduction to analytics techniques 33
34. An example
With k = 5, the central dot would be classified as red.
Introduction to analytics techniques 34
35. What value for k?
smaller k value faster to process.
higher k values more robust to noise.
n-fold cross validation can be used on incremental values of k to
select a k value that minimises error.
Introduction to analytics techniques 35
36. KNN with R
The knn function (package class) provides KNN classification for R.
knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TR
Arguments:
train: matrix or data frame of training set cases.
test: matrix or data frame of test set cases. A vector wi
interpreted as a row vector for a single case.
cl: factor of true classifications of training set
k: number of neighbours considered.
l: minimum vote for definite decision, otherwise ’doub
precisely, less than ’k-l’ dissenting votes are all
if ’k’ is increased by ties.)
prob: If this is true, the proportion of the votes for th
class are returned as attribute ’prob’.
use.all: controls handling of ties. If true, all distances e
the ’k’th largest are included. If false, a random
of distances equal to the ’k’th is chosen to use ex
neighbours.
Introduction to analytics techniques 36
37. Using knn()
> library(class)
> train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
> test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3]
> cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
> knn(train, test, cl, k = 3, prob=TRUE)
[1] s s s s s s s s s s s s s s s s s s s s s s s s s c c v c c
[39] c c c c c c c c c c c c v c c v v v v v c v v v v c v v v v
attr(,"prob")
[1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[15] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[22] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[29] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667
[36] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[43] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[50] 1.0000000 1.0000000 0.6666667 0.7500000 1.0000000 1.0000000
[57] 1.0000000 1.0000000 0.5000000 1.0000000 1.0000000 1.0000000
[64] 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[71] 1.0000000 0.6666667 1.0000000 1.0000000 0.6666667
Levels: c s v
Introduction to analytics techniques 37
38. Counting errors with cross-validation for different k values
‘knn.cv‘ does leave one out cross-validation: classifies each item while
leaving it out of the learning set.
> train <- rbind(iris3[,,1], iris3[,,2], iris3[,,3])
> cl <- factor(c(rep("s",50), rep("c",50), rep("v",50)))
> sum( (knn.cv(train, cl, k = 1) == cl) == FALSE )
[1] 6
> sum( (knn.cv(train, cl, k = 5) == cl) == FALSE )
[1] 5
> sum( (knn.cv(train, cl, k = 15) == cl) == FALSE )
[1] 4
> sum( (knn.cv(train, cl, k = 30) == cl) == FALSE )
[1] 8
Introduction to analytics techniques 38
39. Decision trees
A decision tree is a tree-structured representation of a dataset and its
class-relevant partitioning.
The root node ’contains’ the entire learning dataset.
Each non-terminal node is split according to a particular
attribute/value combination.
Class distribution in terminal nodes is used to affect a class
probability to further unclassified data.
Human readable!
Introduction to analytics techniques 39
41. Building the tree
A tree is built by successive partitioning.
Starting from the root, every attribute is considered for a potential
split of the data set.
For each attribute, every possible split is considered.
The ”best split” is picked by comparing the resulting distribution of
classes in the generated child nodes.
Each child node is then considered for further partitioning, and so on
until:
◮ partitioning a node doesn’t improve the class distribution (ex: only 1
class represented in a node),
◮ a node’s ”population” is too small (min split),
◮ a node’s potential partitioning would generate a child node with a too
small population (min bucket).
Introduction to analytics techniques 41
42. Decision trees with R: the rpart module
rpart is a R module providing functions for generating decision trees
(among other things).
rpart(formula, data, weights, subset, na.action = na.rpart,
method, model = FALSE, x = FALSE, y = TRUE,
parms, control, cost, ...)
formula: class ˜ att1 + att2 + · · · + attn
data: name of dataframe whose columns include attributes used in
the formula.
weights: optional case weights.
subset: optional subsetting of the data set for use in the fit.
na.action: strategies for missing values.
method: defaults to ”class” which applies to class-based decision
trees.
control: rpart control options (like min split/bucket, refer to
?rpart.control for details).
Introduction to analytics techniques 42
44. Plotting the tree model
> # basic R graphics plot of tree:
> plot(model)
> text(model)
> # fancier postscript plot of tree:
> post(model, file="mytree.ps", title="Iris Classification")
Introduction to analytics techniques 44
46. rpart model evaluation
Use a confusion matrix to measure accuracy of predictions:
> pred <- predict(model, iris[,c(1,2,3,4)], type="class")
> conf <- table(pred, iris$Species)
> sum(diag(conf)) / sum(conf)
[1] 0.96
Introduction to analytics techniques 46
47. Time Series
Another type of model, applying to time-related data.
Additive or multiplicative decomposition of signal into components.
Many models and parameters used to fit the series, to then be used
for forecasting.
Some automated fitting is available in R.
Introduction to analytics techniques 47
49. Time Series
Simple things like changing the series frequency are handled natively
too:
> ts(AirPassengers, frequency=4, start=c(1949,1),
end=c(1960,4))
Qtr1 Qtr2 Qtr3 Qtr4
1949 112 118 132 129
1950 121 135 148 148
1951 136 119 104 118
1952 115 126 141 135
1953 125 149 170 170
1954 158 133 114 140
1955 145 150 178 163
1956 172 178 199 199
1957 184 162 146 166
1958 171 180 193 181
1959 183 218 230 242
1960 209 191 172 194
Introduction to analytics techniques 49
50. Time Series
As are plotting and decomposition:
> plot(AirPassengers)
> plot(decompose(AirPassengers))
Introduction to analytics techniques 50
51. Time Series Decomposition
In a simple seasonal time series, the signal can decomposed into three
components that can then be analysed separately:
◮ the Trend component, that shows the progression of the series.
◮ the Seasonal component, that shows the periodic variation.
◮ the Irregular component, that shows the rest of the variations.
In an additive decomposition, our signal is
Trend + Seasonal + Irregular.
In a multiplicative decomposition, our signal is
Trend ∗ Seasonal ∗ Irregular.
Multiplicative decomposition makes sense when absolute difference in
values are of less interest that percentage changes.
A multiplicative signal can also be decomposed in additive fashion
through working on log(data).
Introduction to analytics techniques 51
52. Additive/Multiplicative Decomposition
Our example shows typical multiplicative behaviour.
> plot(decompose(AirPassengers))
> plot(decompose(AirPassengers, type="multiplicative"))
Introduction to analytics techniques 52
53. Log of a multiplicative series
Using log() to decompose our series in additive fashion:
> plot(log(AirPassengers))
> plot(decompose(log(AirPassengers)))
Introduction to analytics techniques 53
54. The ARIMA model
ARIMA stands for AutoRegressive Integrated Moving Average.
ARIMA is one of the most general class of models for time series
forecasting.
An ARIMA model is characterized by three non-negative integer
parameters commonly called (p, d, q):
◮ p is the autoregressive order (AR).
◮ d is the integrated order (I).
◮ q is the moving average order (MA)
An ARIMA model with zero for some of those values is in fact a
simpler model, be it AR, MA or ARMA...
Like for linear regression, an information criterion can be used to
evaluate which values of (p, d, q) provide a better fit.
A Seasonal ARIMA model (p, d, q) × (P, D, Q) has three additional
parameters modelling the seasonal behaviour of the series in the same
fashion.
Introduction to analytics techniques 54
55. Automated ARIMA fitting and forecasting
The auto.arima() function will explore a range of values for
(p, d, q) × (P, D, Q) and return the best fitting model, which can
then be used for forecasting:
> library(forecast)
> fit <- auto.arima(AirPassengers)
> plot(forecast(fit, h=20))
Introduction to analytics techniques 55
57. Ensembles of models
Models built with a specific set of parameters have a limit to the data
relationship they can express.
Choice of model or initial parameter will create specific recurring
misclassification.
Solution: build several competing models and average classification.
Some techniques are built around the idea, like random forests (see
’rf’ module in R).
Introduction to analytics techniques 57
58. Rattle
Rattle is a data mining framework for R. Installable as a CRAN
module, it features:
◮ Graphical user interface to common mining modules
◮ Full mining framework: data preprocessing, analysis, mining, validating
◮ Automatic generation of R code
In addition to fast hands-on data mining, the rattle log is a great R
learning resource.
Introduction paper at:
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf
> install.packages("RGtk2")
> install.packages("rattle")
> library(rattle)
Rattle: Graphical interface for data mining using R.
Version 2.5.40 Copyright (c) 2006-2010 Togaware Pty Ltd.
Type ’rattle()’ to shake, rattle, and roll your data.
> rattle()
Introduction to analytics techniques 58