This presentation includes what is datamining, which technics and algorithms are available in datamining. This presentation helps you to understand the concepts of datamining.
Datastax - The Architect's guide to customer experience (CX)DataStax
From scalability to data access to data governance, learn the specific performance and data requirements of a customer experience-ready data management platform.
Datastax - The Architect's guide to customer experience (CX)DataStax
From scalability to data access to data governance, learn the specific performance and data requirements of a customer experience-ready data management platform.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
In this workshop, we will discuss the core techniques in anomaly detection and discuss advances in Deep Learning in this field.
Through case studies, we will discuss how anomaly detection techniques could be applied to various business problems. We will also demonstrate examples using R, Python, Keras and Tensorflow applications to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
What you will learn:
Anomaly Detection: An introduction
Graphical and Exploratory analysis techniques
Statistical techniques in Anomaly Detection
Machine learning methods for Outlier analysis
Evaluating performance in Anomaly detection techniques
Detecting anomalies in time series data
Case study 1: Anomalies in Freddie Mac mortgage data
Case study 2: Auto-encoder based Anomaly Detection for Credit risk with Keras and Tensorflow
This lecture gives various definitions of Data Mining. It also gives why Data Mining is required. Various examples on Classification , Cluster and Association rules are given.
This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.
Integrating data across systems has been a perpetual challenge. Unfortunately, the current technology-focused solutions have not helped IT to improve its dismal project success statistics. Data warehouses, BI implementations, and general analytical efforts achieve the same levels of success as other IT projects – approximately 1/3rd are considered successes when measured against price, schedule, or functionality objectives. The first step is determining the appropriate analysis approach to the data system integration challenge. The second step is understanding the strengths and weaknesses of various approaches. Turns out that proper analysis at this stage makes actual technology selection far more accurate. Only when these are accomplished can proper matching between problem and capabilities be achieved as the third step and true business value be delivered. This webinar will illustrate that good systems development more often depends on at least three data management disciplines in order to provide a solid foundation.
Find more Data-Ed webinars here: http://www.datablueprint.com/resource-center/webinar-schedule/
Slide helps in generating an understand about the intuition and mathematics / stats behind association rule mining. This presentation starts by highlighting the difference between causal and correlation. This is followed Apriori algorithm and the metrics which are used with it. Each metric is discussed in detail. Then a formulation has been generated in classification setting which can be used to generate rules i.e. rule mining.
Other Reference: https://www.slideshare.net/JustinCletus/mining-frequent-patterns-association-and-correlations
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
In this workshop, we will discuss the core techniques in anomaly detection and discuss advances in Deep Learning in this field.
Through case studies, we will discuss how anomaly detection techniques could be applied to various business problems. We will also demonstrate examples using R, Python, Keras and Tensorflow applications to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
What you will learn:
Anomaly Detection: An introduction
Graphical and Exploratory analysis techniques
Statistical techniques in Anomaly Detection
Machine learning methods for Outlier analysis
Evaluating performance in Anomaly detection techniques
Detecting anomalies in time series data
Case study 1: Anomalies in Freddie Mac mortgage data
Case study 2: Auto-encoder based Anomaly Detection for Credit risk with Keras and Tensorflow
This lecture gives various definitions of Data Mining. It also gives why Data Mining is required. Various examples on Classification , Cluster and Association rules are given.
This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.This slide is about Data mining rules.
Integrating data across systems has been a perpetual challenge. Unfortunately, the current technology-focused solutions have not helped IT to improve its dismal project success statistics. Data warehouses, BI implementations, and general analytical efforts achieve the same levels of success as other IT projects – approximately 1/3rd are considered successes when measured against price, schedule, or functionality objectives. The first step is determining the appropriate analysis approach to the data system integration challenge. The second step is understanding the strengths and weaknesses of various approaches. Turns out that proper analysis at this stage makes actual technology selection far more accurate. Only when these are accomplished can proper matching between problem and capabilities be achieved as the third step and true business value be delivered. This webinar will illustrate that good systems development more often depends on at least three data management disciplines in order to provide a solid foundation.
Find more Data-Ed webinars here: http://www.datablueprint.com/resource-center/webinar-schedule/
Slide helps in generating an understand about the intuition and mathematics / stats behind association rule mining. This presentation starts by highlighting the difference between causal and correlation. This is followed Apriori algorithm and the metrics which are used with it. Each metric is discussed in detail. Then a formulation has been generated in classification setting which can be used to generate rules i.e. rule mining.
Other Reference: https://www.slideshare.net/JustinCletus/mining-frequent-patterns-association-and-correlations
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
Secure distributed data storage can shift the burden of maintaining a large number of files from the owner to proxy servers.
Proxy servers can convert encrypted files for the owner to encrypted files for the receiver without the necessity of knowing the content of the original files. In practice, the original files will be removed by the owner for the sake of space efficiency. Hence, the issues on confidentiality and integrity of the outsourced data must be addressed carefully. In this paper, we propose two identity-based secure distributed data storage (IBSDDS) schemes. Our schemes can capture the following properties: (1) The file owner can decide the access permission independently without the help of the private key generator (PKG).
(2) For one query, a receiver can only access one file, instead of all files of the owner.
(3) Our schemes are secure against the collusion attacks, namely even if the receiver can compromise the proxy servers, he cannot obtain the owner’s secret key.
Although the first scheme is only secure against the chosen plaintext attacks (CPA), the second scheme is secure against the chosen ciphertext attacks (CCA). To the best of our knowledge, it is the first IBSDDS schemes where an access permissions is made by the owner for an exact file and collusion attacks can be protected in the standard model.
You Can Run, But You Can't Hide! - How far should the government regulate the Internet? -Has data mining gone too far? (Great secondary research project.) Digital footprint makeover handouts available at: digitalmakeover.wikispaces.com Find ideas for classroom instruction, also, on this wikispace.
This presentation was a breakout session at WLMA14 and has been updated to reflect recent revelations on medical information mining. We did not even get to discuss the EDUCATIONAL DATA MINING that is occurring on our students and children! Caveat!
Aloha is a web portal which allows users to connect with their friends and family through a
common platform. Furthermore, users’ can share scribbles and ChitChat with their friends. These
chats can be saved or deleted as per the users’ wishes. Users can also maintain, update or delete
their account. Spring MVC / WebSockets / AJAX / Javascript
Designing systems to use Cassandra can be difficult for programmers used to relational databases, normalisation and entity relationship diagrams. One of the first steps in designing the tables for Cassandra is to model the CQL commands that will be run and use these to designate tables, but how to get these statements from the users requirements? Sun modelling is a technique used by designers of OLAP systems to gather requirements from users without having to explain the intricacies of Star Schema. In this talk we'll introduce Sun modelling and show how it can be simply modified to gather requirements for CQL queries and hence table design.
The Shifting Landscape of Data IntegrationDATAVERSITY
Enterprises and organizations from every industry and scale are working to leverage data to achieve their strategic objectives — whether they are to be more profitable, effective, risk-tolerant, prepared, sustainable, and/or adaptable in an ever-changing world. Data has exploded in volume during the last decade as humans and machines alike produce data at an exponential pace. Also, exciting technologies have emerged around that data to improve our abilities and capabilities around what we can do with data.
Behind this data revolution, there are forces at work, causing enterprises to shift the way they leverage data and accelerate the demand for leverageable data. Organizations (and the climates in which they operate) are becoming more and more complex. They are also becoming increasingly digital and, thus, dependent on how data informs, transforms, and automates their operations and decisions. With increased digitization comes an increased need for both scale and agility at scale.
In this session, we have undertaken an ambitious goal of evaluating the current vendor landscape and assessing which platforms have made, or are in the process of making, the leap to this new generation of Data Management and integration capabilities.
Many believe that regression testing an application with minimal data is sufficient. With big data applications, the data testing methodology becomes far more complex. Testing can now be done within the data fabrication process as well as in the data delivery process. Today, comprehensive testing is often mandated by regulatory agencies—and more importantly by customers. Finding issues before deployment and saving your company’s reputation—and in some cases preventing litigation—are critical. Jason Rauen presents an overview of the architecture, processes, techniques, and lessons learned by an original big data company. Detecting defects up-front is vital. Learn how to test thousands, millions, and in some cases billions—yes, billions—of records directly, rendering sampling procedures obsolete. Save time and money for your organization with better data test coverage than ever before.
The New Role of Data in the Changing Energy & Utilities LandscapeDenodo
Watch full webinar here: https://bit.ly/3PrxEx2
Energy companies - both producers and utilities - are facing a challenging and changing business and regulatory environment over the next decade or so. As governments around the world pledge to be 'net zero' by 2050, new regulations are putting pressure on energy companies to accelerate the move to renewable energy sources whilst at the same time gearing up for more widespread electrification as consumers move away from carbon fuels.
The growth of renewable energy sources has also changed the way that utilities manage demand response. The old way of bringing generating units (typically coal or gas-fueled generators) online for peak demand hours no longer works. The distributed utility infrastructure that is used today requires a lot more flexibility and planning to meet - and to shape - consumer demand.
At the heart of the energy company challenges is data. Data to better manage and optimize the generating resources. Data to better inform the consumers about their energy consumption. And data to deliver better services and new product offerings to those consumers.
In this webinar, we will look at how energy companies and utilities can liberate and democratize their data to better utilize the strategic data assets that they already own. We will look at how the Denodo Platform, powered by Data Virtualization, has helped energy companies around the world access real-time data to drive their operations and allow them to respond to the ever-changing business environment.
Using Data Platforms That Are Fit-For-PurposeDATAVERSITY
We must grow the data capabilities of our organization to fully deal with the many and varied forms of data. This cannot be accomplished without an intense focus on the many and growing technical bases that can be used to store, view, and manage data. There are many, now more than ever, that have merit in organizations today.
This session sorts out the valuable data stores, how they work, what workloads they are good for, and how to build the data foundation for a modern competitive enterprise.
How can I become a data scientist? What are the most valuable skills to learn for a data scientist now? Could I learn how to be a data scientist by going through online tutorials? What does a data scientist do?
These are only some of the questions that are being discussed online, on blogs, on forums and on knowledge-sharing platforms like Quora.
Let me share the Beginner's Guide to Data Science which will be really helpful to you.
Also Checkout: http://bit.ly/2Mub6xP
Platforming the Major Analytic Use Cases for Modern EngineeringDATAVERSITY
We’ll describe some use cases as examples of a broad range of modern use cases that need a platform. We will describe some popular valid technology stacks that enterprises use in accomplishing these modern use cases of customer churn, predictive analytics, fraud detection, and supply chain management.
In many industries, to achieve top-line growth, it is imperative that companies get the most out of existing customer relationships. Customer churn use cases are about generating high levels of profitable customer satisfaction through the use of knowledge generated from corporate and external data to help drive a more positive customer experience (CX).
Many organizations are turning to predictive analytics to increase their bottom line and efficiency and, therefore, competitive advantage. It can make the difference between business success or failure.
Fraudulent activity detection is exponentially more effective when risk actions are taken immediately (i.e., stop the fraudulent transaction), instead of after the fact. Fast digestion of a wide network of risk exposures across the network is required in order to minimize adverse outcomes.
Supply chain leaders are under constant pressure to reduce overall supply chain management (SCM) costs while maintaining a flexible and diverse supplier ecosystem. They will leverage IoT, sensors, cameras, and blockchain. Major investments in advanced analytics, warehouse relocation, and automation, both in distribution centers and stores, will be essential for survival.
Data Warehousing, Data Mining, Data Marts, Data Cube, OLAP Operations, Introduction to Common Messaging System, Web Tier Deployment, Application Servers & Clustered Deployment, IBM Notes and IBM Domino
Similar to What is Datamining? Which algorithms can be used for Datamining? (20)
A Quick Start To Blockchain by Seval CaprazSeval Çapraz
Blockchain is one of the most innovative discoveries of the past century.
The first cryptocurrency, Bitcoin, was proposed in 2008 by Satoshi Nakamoto with a white paper.
Assembly Dili İle Binary Search GerçekleştirimiSeval Çapraz
Sayıların küçükten büyüğe doğru sıralı bir şekilde verildiği bir dizide istenen sayının var olup olmadığını buluyoruz. Assembly Dili İle Binary Search Gerçekleştirimi.
Importance of software quality assurance to prevent and reduce software failu...Seval Çapraz
Importance of software quality assurance to prevent and reduce software failures:
Document Management System In Defence Industry Case Study by Seval Çapraz
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Knowledge engineering: from people to machines and back
What is Datamining? Which algorithms can be used for Datamining?
1. DATAMINING
Seval Ünver
E1900810 | CENG 553
Middle East Technical University
Computer Engineering Department
14.05.2013 CENG 553
In Summary
2. Outline
• Introduction
• Data vs. Information
• Who uses datamining?
• Common uses of datamining
• Datamining is…
• Supervised and Unsupervised Learning
• Predictive Models
• Datamining Process
• Some Popular Datamining Algorithms
• Data Warehouse
• Conceptual Modelling of Data Warehouse
• Example of Star Schema, Snowflake Schema, Fact Constellation
• Evolution of OLTP, OLAP and Data Warehouse
08.10.2013 Seval Ünver | CENG 553 2
3. Introduction
• Nowadays, large data sets have become available
due to advances in technology.
• As a result, there is an increasing interest in
various scientific communities to explore the use
of emerging data mining techniques for the
analysis of these large data sets *.
• Data mining is the semi-automatic discovery of
patterns, associations, changes, anomalies, and
statistically significant structures and events in
data **.
* Grossman et al., 2001
** Shmueli G, 2012
08.10.2013 Seval Ünver | CENG 553 3
4. What is Datamining?
• Process of semi-automatically analyzing large
databases to find patterns that are *:
– valid: hold on new data with some certainty
– novel: non-obvious to the system
– useful: should be possible to act on the item
– understandable: humans should be able to
interpret the pattern
• Also known as Knowledge Discovery in
Databases
08.10.2013 Seval Ünver | CENG 553 4
* Prof. S. Sudarshan CSE Dept, IIT Bombay
5. Big data: Cash Register
• Past: It was a
calculator.
• Now: It saves every
detail of every
action.
– The movements of
each product.
– The movements of
each user.
08.10.2013 Seval Ünver | CENG 553 5
6. Data vs. Information
• Data is useless by itself.
• Data is not just numbers
or letters. It consists of
numbers, letters and
their meaning. The
meaning is called
metadata.
• Information is
interpreted data.
• Converting the data to
information is called data
processing.
08.10.2013 Seval Ünver | CENG 553 6
7. Who uses Datamining?
• CapitalOne Bank
– future prediction
• Netflix (the largest DVD-by-mail rental company)
– Recommendation (you might also be interested in…)
• Amazon.com
– recommendation
• British law enforcement
– crime trends or security threats
• Facebook
– prediction how active a user will be after 3 months.
• Children's Hospital in Boston
– detecting domestic abuse
• Pandora (an Internet music radio)
– chooses the next song to play
08.10.2013 Seval Ünver | CENG 553 7
8. Common uses of Datamining:
• Direct mail marketing
• Web site personalization
• Credit card fraud detection
• Gas & jewelry
• Bioinformatics
• Text analysis
– SAS lie detector
• Market basket analysis
– Beer & baby diapers:
08.10.2013 Seval Ünver | CENG 553 8
9. Application Areas
08.10.2013 Seval Ünver | CENG 553 9
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
11. Datamining is not…
• Data warehousing
• SQL / Ad Hoc Queries / Reporting
• Software Agents
• Online Analytical Processing (OLAP)
• Data Visualization
08.10.2013 Seval Ünver | CENG 553 11
12. Supervised vs. Unsupervised Learning
• Supervised:
– Problem solving
– Driven by a real business problems and historical data
– Quality of results dependent on quality of data
• Unsupervised:
– Exploration (aka clustering)
– Relevance often an issue
• Beer and baby diapers
– Useful when trying to get an initial understanding of the data
– Non-obvious patterns can sometimes pop out of a completed
data analysis project
08.10.2013 Seval Ünver | CENG 553 12
26. 08.10.2013 Seval Ünver | CENG 553 26
· Pros
+ Can learn more complicated
class boundaries
+ Fast application
+ Can handle large number of
features
· Cons
- Slow training time
- Hard to interpret
- Hard to implement: trial
and error for choosing
number of nodes
Pros and Cons of Neural Networks
27. Supervised Algorithm Summary
• Decision Trees
– Understandable
– Relatively fast
– Easy to translate into SQL queries
• kNN
– Quick and easy
– Models tend to be very large
• Neural Networks
– Difficult to interpret
– Can require significant amounts of time to train
08.10.2013 Seval Ünver | CENG 553 27
28. K-Means Clustering
• User starts by specifying the number of clusters (K)
• K datapoints are randomly selected
• Repeat until no change:
– Hyperplanes separating K points are generated
– K Centroids of each cluster are computed
08.10.2013 Seval Ünver | CENG 553 28
29. Data Warehouse
Data warehouse is a database used for
reporting and data analysis.
08.10.2013 Seval Ünver | CENG 553 29
30. Data Mining works with Warehouse Data
08.10.2013 Seval Ünver | CENG 553 30
• Data Mining provides
the Enterprise with
intelligence
• Data Warehousing
provides the Enterprise
with a memory
31. Conceptual Modeling of Data Warehouses
• Modeling data warehouses: dimensions & measures
– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
– Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
08.10.2013 Seval Ünver | CENG 553 31
32. Example of Star Schema
08.10.2013 32
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_or_province
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Seval Ünver | CENG 553
33. Example of Snowflake Schema
08.10.2013 33
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
state_or_province
country
city
Seval Ünver | CENG 553
34. Example of Fact Constellation
08.10.2013 34
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_state
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
Seval Ünver | CENG 553
35. Evolution of OLTP, OLAP and Data Warehouse
Time
08.10.2013 Seval Ünver | CENG 553 35
36. Evolutionary Step Business Question Enabling Technology
Data Collection
(1960s)
"What was my total revenue in the last
five years?"
computers, tapes, disks
Data Access
(1980s)
"What were unit sales in New England
last March?"
faster and cheaper
computers with more
storage, relational databases
Data Warehousing
And
Decision Support
"What were unit sales in New England
last March? Drill down to Boston."
faster and cheaper
computers with more
storage, On-line analytical
processing
(OLAP), multidimensional
databases,
data warehouses
Data Mining
"What's likely to happen to Boston
unit sales next month? Why?"
faster and cheaper
computers with more
storage, advanced computer
algorithms
08.10.2013 Seval Ünver | CENG 553 36
37. As a Result
• In order to apply data mining, a large amount of
quality data is required.
• The aim of datamining is acquiring rules and
equations which can be used to predict future.
• To be successful on such a work is dependent on
working with database experts and data mining
specialists. They need to work together.
• Work may take longer, you need time and
patience.
08.10.2013 Seval Ünver | CENG 553 37
38. Thank You
If you have question, you can contact with me
via email: e1900810@ceng.metu.edu.tr
Seval Ünver | METU CENG
08.10.2013 Seval Ünver | CENG 553 38
Editor's Notes
The US Government uses Data Mining to track fraudA Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingTarget MarketingHolding on to Good CustomersWeeding out Bad Customers
Regression: (linear or any other polynomial) a*x1 + b*x2 + c = Ci. Nearest neighourDecision tree classifier: divide decision space into piecewise constant regions.Probabilistic/generative modelsNeural networks: partition by non-linear boundaries
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Widely used learning methodEasy to interpret: can be re-represented as if-then-else rulesApproximates function by piece wise constant regionsDoes not require any prior knowledge of data distribution, works well on noisy data.Has been applied to: classify medical patients based on the disease, equipment malfunction by cause, loan applicant by likelihood of payment.
Pros Reasonable training time Fast application Easy to interpret Easy to implement Can handle large number of featuresCons Cannot handle complicated relationship between features simple decision boundaries problems with lots of missing data
Pros Fast trainingCons Slow during application. No feature selection. Notion of proximity vague
Set of nodes connected by directed weighted edges.Useful for learning complex data like handwriting, speech and image recognition
ProsCan learn more complicated class boundaries Fast application Can handle large number of featuresConsSlow training time Hard to interpret Hard to implement: trial and error for choosing number of nodes
Data warehouse mining: assimilate data from operational sourcesmine static dataMining log dataContinuous mining: example in process controlStages in mining:data selection pre-processing: cleaning transformation mining result evaluation visualization