A companion slide deck for this chapter:
Stanton, J. M. (2013). Data Mining: A Practical Introduction for Organizational Researchers. In Cortina, J. M., & Landis, R. S., Modern Research Methods for the Study of Behavior in Organizations. New York: Routledge Academic.
Abstract: Knowledge has played a significant role on human activities since his development. Data mining is the process of
knowledge discovery where knowledge is gained by analyzing the data store in very large repositories, which are analyzed
from various perspectives and the result is summarized it into useful information. Due to the importance of extracting
knowledge/information from the large data repositories, data mining has become a very important and guaranteed branch of
engineering affecting human life in various spheres directly or indirectly. The purpose of this paper is to survey many of the
future trends in the field of data mining, with a focus on those which are thought to have the most promise and applicability
to future data mining applications.
Keywords: Current and Future of Data Mining, Data Mining, Data Mining Trends, Data mining Applications.
A companion slide deck for this chapter:
Stanton, J. M. (2013). Data Mining: A Practical Introduction for Organizational Researchers. In Cortina, J. M., & Landis, R. S., Modern Research Methods for the Study of Behavior in Organizations. New York: Routledge Academic.
Abstract: Knowledge has played a significant role on human activities since his development. Data mining is the process of
knowledge discovery where knowledge is gained by analyzing the data store in very large repositories, which are analyzed
from various perspectives and the result is summarized it into useful information. Due to the importance of extracting
knowledge/information from the large data repositories, data mining has become a very important and guaranteed branch of
engineering affecting human life in various spheres directly or indirectly. The purpose of this paper is to survey many of the
future trends in the field of data mining, with a focus on those which are thought to have the most promise and applicability
to future data mining applications.
Keywords: Current and Future of Data Mining, Data Mining, Data Mining Trends, Data mining Applications.
This lecture gives various definitions of Data Mining. It also gives why Data Mining is required. Various examples on Classification , Cluster and Association rules are given.
Introduction to Data Mining(Chapter 1)......Data Mining concepts and techniques by R. Deepa (IT) ..Batch(2016-2019) published on Oct-13 2018 from NS college of Arts and Science,Theni
Data Mining, KDD Process, Data mining functionalities, Characterization,
Discrimination ,
Association,
Classification,
Prediction,
Clustering,
Outlier analysis, Data Cleaning as a Process
This lecture gives various definitions of Data Mining. It also gives why Data Mining is required. Various examples on Classification , Cluster and Association rules are given.
Introduction to Data Mining(Chapter 1)......Data Mining concepts and techniques by R. Deepa (IT) ..Batch(2016-2019) published on Oct-13 2018 from NS college of Arts and Science,Theni
Data Mining, KDD Process, Data mining functionalities, Characterization,
Discrimination ,
Association,
Classification,
Prediction,
Clustering,
Outlier analysis, Data Cleaning as a Process
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaEdureka!
This Hadoop tutorial on Big Data Career Path and Learning Path ( Why Big Data Career blog: https://goo.gl/Hx1hbk ) will tell you why Big Data analytics is the best career move. Learn about various job roles, salary trends and learning paths in Big Data domain. Below are the topics covered in this Big Data Career Path and Learning Path Tutorial:
1) Big Data Domains
2) Big Data Job Roles and Trends
3) Big Data Salary Trends
4) Big Data Career Path
5) Big Data Learning Path
6) Edureka Big Data Certification Courses
Subscribe to our channel to get video updates. Hit the subscribe button above.
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
#BigDataCareer #HadoopCareer #BigDataLearningPath
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
This Edureka Data Science course slides will take you through the basics of Data Science - why Data Science, what is Data Science, use cases, BI vs Data Science, Data Science tools and Data Science lifecycle process. This is ideal for beginners to get started with learning data science.
You can read the blog here: https://goo.gl/OoDCxz
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
All marketing aspects including financial and HR policies are explained elaborately . Subsidiaries, value system , competitors. A comparison study among TCS INFOSYS and Wipro is given Briefly.
Suggestions:
1) For best quality, download the PDF before viewing.
2) Open at least two windows: One for the Youtube video, one for the screencast (link below), and optionally one for the slides themselves.
3) The Youtube video is shown on the first page of the slide deck, for slides, just skip to page 2.
Screencast: http://youtu.be/VoL7JKJmr2I
Video recording: http://youtu.be/CJRvb8zxRdE (Thanks to Al Friedrich!)
In this talk, we take Deep Learning to task with real world data puzzles to solve.
Data:
- Higgs binary classification dataset (10M rows, 29 cols)
- MNIST 10-class dataset
- Weather categorical dataset
- eBay text classification dataset (8500 cols, 500k rows, 467 classes)
- ECG heartbeat anomaly detection
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
2. • Rapid Miner
– [Plain text URL:
https://rapidminer.com/products/studio/]
• RapidMiner Studio: 6.5 or greater
Software
Please follow link above to download the
free software.
3. • Define the discipline of Data Mining
• List and define various types of data
• List and define various sources of data
• Explain the fundamental differences
between databases, data warehouses,
and data sets
Learning Objectives (1 of 2)
4. • Explain some of the ethical dilemmas
associated with data mining and outline
possible solutions
• Explain the CRISP-DM Method
Learning Objectives (2 of 2)
6. • 15 out of 17 sectors in the United States have more data
stored per company than the US Library of Congress
• $5 million vs. $400: Price of the fastest supercomputer in
1975 and an iPhone with equal performance
• $600 to buy a disk drive that can store all of the world’s music
• 5 billion mobile phones in use in 2010
• 30 billion pieces of content shared on Facebook every month
• 40% projected growth in global data generated per year vs.
5% growth in IT spending
• 235 terabytes of data collected by the US Library of Congress
by April 2011
Why Data Mining?
7. Why Mine Data?
Lots of data is being collected and stored
Web data, e-commerce, point of sale
Credit card transactions, social media
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge
e.g. in Customer Relationship Management
Information is valuable and can be monetized
8. • Demand for deep analytical talent in the
United States could be 50-60% greater
than its projected supply by 2018
Demand for Data Mining
9. • Data contains value and knowledge, but to
extract the knowledge, data needs to be
– Stored
– Managed
– Analyzed this class
• Data Mining ≈ Big Data ≈
Predictive Analytics ≈ Data Science
Why Data Mining?
10. • “An interdisciplinary subfield of computer
science. It is the computational process of
discovering patterns in large data sets
involving methods at the intersection of
artificial intelligence, machine learning,
statistics, and database systems. The overall
goal of data mining is to extract information
from a data set and transform it into an
understandable structure for further use.”
– (Wikipedia) [Plain text URL:
https://en.wikipedia.org/wiki/Data_mining]
What is Data Mining? (1 of 4)
11. • Data Enormity Issue
• Discover patterns and models that are:
• Valid: data has some certainty
• Useful: should be able to act on the insight
• Unexpected: non-obvious to the system
• Understandable: humans can interpret the patterns
What is Data Mining? (2 of 4)
12. • Descriptive methods
– Find patterns that describe the data
• Example: Clustering with k-means
• Predictive methods
– Use target variables to predict unknown or
future values of other variables
• Example: Scoring with neural networks
What is Data Mining? (3 of 4)
13. What is Data Mining? (4 of 4)
Data Mining
• Prevalence of names in
US locations
• O’Brien, O’Rurke,
O’Reilly in Boston Area
• Group together similar
documents
• Returned by search
engine according to
their context
Not Data Mining
• Look up phone number
in a phone directory
• Query a web search
engine for information
about “Amazon”
16. • Scalability
• Dimensionality
• Complex and Heterogeneous Data
• Data Quality
• Data Ownership and Distribution
• Privacy Preservation
• Streaming Data
Challenges of Data Mining
17. • Garbage In, Garbage Out (GIGO)
– Collected incorrectly
– Out-of-date
• Day-to-day:
– Use available resources
– Acceptable risk
– Professional experience
– Common sense
Limits to Data Mining
18. • A risk with “Data mining” is that an analyst can
“discover” patterns that are meaningless.
• Statisticians call it Bonferroni’s principle:
– “If you look in more places for interesting patterns
than your amount of data will support, you are bound
to find crap”
Meaningfulness of Analytic Answers (1 of 2)
18
19. Meaningfulness of Analytic Answers (2 of 2)
19
National Security Agency example:
“We consider suspicious when a pair of (unrelated) people stayed at least twice
in the same hotel on the same day”
◦ Suppose 1 billion people tracked during 1,000 days
◦ Each person stays in a hotel 1% of the time (1 day out of 100)
◦ Each hotel holds 100 people (so need 100,000 hotels)
“If everyone behaves randomly (i.e. no terrorist), can we still detect something
suspicious?”
• Probability that a specific pair of people visit same hotel on same day is 10-9
• Probability this happens twice is 10-18 (really, really, really small)
Expected number of “suspicious” pairs is actually about 250,000!
Example taken from Rajamaran et al., Mining of Massive Datasets
20. • To mine different types of data:
– Data is highly dimensional
– Data is a graph
– Data is infinite / never-ending
– Data is labeled
What will we learn? (1 of 4)
20
21. • To solve real-world problems:
– Market basket analysis
– Customer segmentation
– Forecasting new product demand
– Evaluating athletic talent
– Probabilities of a health risk
– Text sentiment analysis
What will we learn? (2 of 4)
21
22. • Use of various “tools”:
– Association Rules
– Clustering with K-means
– Logistic and Linear regression
– Decision Trees
– Neural Networks
– Text Mining
What will we learn? (3 of 4)
22
23. • Regression
• Decision Trees
• Cluster Analysis
• Text Mining
• Ensemble Models
• Neural Nets
• Association Rules
What will we learn? (4 of 4)
25. • Consider the real people behind the data
• Ethical and moral obligations
• Protect against crimes including identity
theft
• Objectives should never justify unethical
means
Privacy & Security
26. • Things to consider in data mining efforts:
– Protection of privacy
– Respect for individual rights
– Willingness to embrace transparency of
actions and methods
– Ask for permission to gather and use data
– Ensure you are doing fair and just work that
will help and benefit others
Privacy & Security (1 of 2)
27. • We can protect privacy by:
– Aggregating data
– Anonymizing observations through removal of
names and personally identifiable information
(PII)
– Storing data in secure and protected
environments
Privacy & Security (2 of 2)
29. • Organized grouping of information within a specific structure
• Table - a database container made
• Relational databases more common today
– Relate tables to one another in a logical fashion
– Tables are broken apart to reduce redundancy through normalization
Database
30. • Handles high volume of reads and writes
• Not efficient for analysis due to lengthy
retrieval of data
– Must use a query containing joins
– Intensive and time consuming
Online Transactional Processing (OLTP)
31. • Denormalized to intentionally combine
multiple tables into a single table
– Results in duplicate data in some columns
– Reduces number of joins necessary to query
related data
– Online Analytical Processing (OLAP)
Data Warehouse (1 of 2)
32. Data Warehouse (2 of 2)
• Contain archived data copied from transactional database
o Can become out-of-sync if source data is updated
• Can contain data moved from transactional system
o Data may be unavailable for updates or viewing
33. • Organizational data store created in
conjunction to meet needs of specific
business unit
• One-stop shop
• Must be known, current, accurate, and
well-managed (privacy and security)
Data Mart
34. • Subset of a database or data warehouse
• Usually denormalized
• Typically related to a specific:
– Business question
– Business problem
– Business unit
Data Set
35. • Database
– Rows = Records
– Columns = Fields
• RapidMiner
– Rows = Examples
• Data Warehouses and Data Sets
– Rows = Observations, Examples, or Cases
– Columns = Variables or Attributes
Rows and Columns
36. The Data Mining Process
• For this course, we will channel every
homework assignment through the
CRISP-DM process.
38. – Define the questions you want to answer.
– Who will you work with to understand the
issue?
– Design what you are going to build.
– Get buy-in of the problem to be solved
1. Business Understanding
39. – What is the source of the data?
– How was it collected?
– How accurate or reliable is it?
– What are the correct variables to collect?
2. Data Understanding
40. – Join necessary data sets
– Reduce data sets to only include pertinent
variables
– Scrub data to remove anomalies- outliers or
missing data
– Reformat for consistency
3. Data Preparation
41. – Two types:
• Classification (Descriptive)
• Prediction
– Can be overlapping (Decision Trees)
– Note: We will spend most of our time in this
step
4. Modeling
42. – Is the insight useful?
• Should another technique be used?
– What can be done with the results?
– Testing for false positives
– Human experience and operational
knowledge
5. Evaluation
43. – Automation of model
– Communication with end-users
– Integration with existing systems
– Continuous monitoring and gaining feedback
for improvement (fine-tuning)
6. Deployment
44. • Clearly communicate model’s:
– Function
– Utility to stakeholders
• Thoroughly test and prove the model
• Plan for and monitor implementation
Keys to Successful Deployment
46. • Data mining is the statistical and logical methods of analysis
to describe large data sets and create predictive models to
uncover insights
• Databases, data warehouses, and data sets are unique kinds
of digital record keeping systems with some similarities
• Data mining is most effective on data sets extracted from
OLAP rather than OLTP
• Data is highly dimensional and has inherent risks, such as
quality
• Remember human factor behind manipulation of numbers
and figures- ethical responsibilities
• CRISP-DM is the most used standard method for analysis
Summary
47. “This workforce solution was funded by a grant awarded by the U.S. Department of
Labor’s Employment and Training Administration. The solution was created by the
grantee and does not necessarily reflect the official position of the U.S. Department of
Labor. The Department of Labor makes no guarantees, warranties, or assurances of any
kind, express or implied, with respect to such information, including any information on
linked sites and including, but not limited to, accuracy of the information or its
completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building
Capacity in Business Analytics, a Department of Labor, TAACCCT funded project, is
licensed under the Creative Commons Attribution 4.0 International License. To view a
copy of this license, visit http://creativecommons.org/licenses/by/4.0/
Copyright Information