Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
Content:
Introduction
What is Big Data?
Big Data facts
Three Characteristics of Big Data
Storing Big Data
THE STRUCTURE OF BIG DATA
WHY BIG DATA
HOW IS BIG DATA DIFFERENT?
BIG DATA SOURCES
BIG DATA ANALYTICS
TYPES OF TOOLS USED IN BIG-DATA
Application Of Big Data analytics
HOW BIG DATA IMPACTS ON IT
RISKS OF BIG DATA
BENEFITS OF BIG DATA
Future of big data
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
Content:
Introduction
What is Big Data?
Big Data facts
Three Characteristics of Big Data
Storing Big Data
THE STRUCTURE OF BIG DATA
WHY BIG DATA
HOW IS BIG DATA DIFFERENT?
BIG DATA SOURCES
BIG DATA ANALYTICS
TYPES OF TOOLS USED IN BIG-DATA
Application Of Big Data analytics
HOW BIG DATA IMPACTS ON IT
RISKS OF BIG DATA
BENEFITS OF BIG DATA
Future of big data
In this presentation, let's have a look at What is Data Science and it's applications. We discussed most common use cases of Data Science.
I presented this at LSPE-IN meetup happened on 10th March 2018 at Walmart Global Technology Services.
A Seminar Presentation on Big Data for Students.
Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
Data mining Course
Chapter 1
Definition of Data Mining
Data Mining as an Interdisciplinary field
The process of Data Mining
Data Mining Tasks
Challenges of Data Mining
Data mining application examples
Introduction to RapidMiner
Data Science is a form of science that focuses on dealing with huge chunks of data by using modern data analysis tools and techniques to discover hidden patterns, meaningful insights, and make critical business decisions.
A Data Science professional has to utilize complicated machine learning algorithms to develop predictive models. There could be multiple sources present in different formats used in data analysis.
text mining, data mining, machine learning, unstructured data, big data, database, data warehouse, text mining (industry), research (industry), text analysis, text, text analytics, unstructured, data science, structured data, advanced analytics, what is data mining, data mining lecture, data mining techniques, information, learning from data, computre technolog, technology, data process, data mining tutorial,
In this presentation, let's have a look at What is Data Science and it's applications. We discussed most common use cases of Data Science.
I presented this at LSPE-IN meetup happened on 10th March 2018 at Walmart Global Technology Services.
A Seminar Presentation on Big Data for Students.
Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
Data mining Course
Chapter 1
Definition of Data Mining
Data Mining as an Interdisciplinary field
The process of Data Mining
Data Mining Tasks
Challenges of Data Mining
Data mining application examples
Introduction to RapidMiner
Data Science is a form of science that focuses on dealing with huge chunks of data by using modern data analysis tools and techniques to discover hidden patterns, meaningful insights, and make critical business decisions.
A Data Science professional has to utilize complicated machine learning algorithms to develop predictive models. There could be multiple sources present in different formats used in data analysis.
text mining, data mining, machine learning, unstructured data, big data, database, data warehouse, text mining (industry), research (industry), text analysis, text, text analytics, unstructured, data science, structured data, advanced analytics, what is data mining, data mining lecture, data mining techniques, information, learning from data, computre technolog, technology, data process, data mining tutorial,
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
Machine Learning and Data Mining: 19 Mining Text And Web DataPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we overview text and web mining. The slides are mainly taken from Jiawei Han textbook.
There are as many views and definitions of Data Mining as there are people working in and on the topic. Confusion reigns and people ask; what is it; why do we need it; and isn’t it just Data Mining rebranded? In this slide deck and presentation we set the scene an highlight the differences and need for Data Mining in order to give a framework for case studies and future projects.
So - why do we need it?
The economic, industrial, commercial, social, political and sustainability problems we face cannot be successfully addressed using the management techniques and models largely inherited from the Industrial Revolution. The world no longer appears infinite in resources, slow paced, linear and stable. We now see the limitations; feel the impact of rapid change; and we can conceptualize the non-linear and unstable nature of it all! We are also starting to comprehend the scale and the need for machine assistance.
Modeling our situation !
Sophisticated computer models for weather systems are now complemented by ecological, economic, conflict and resource modeling of varying depth and accuracy. However, the key is always the accuracy and coverage of the primary data. We started with modest databases and data mining, but they mostly proved inadequate, and we are now amassing vast databases on every aspect of life - people, planet and machines. This ‘BIG DATA’ explosion demands a rethink of how, what, and where we gather data; the way we analyze and model; and the way we make decisions.
So - what is the big difference?
Data Mining was limited, planer, simple, linear and constrained to a few relationships amongst people: what they did, where they went, who they knew and so on. In contrast; Big Data is unbounded, spans all peoples and machines in all domains and activities with application to every aspect of life, business, industry, government and sustainability etc. It also takes into account the non-linear nature of relationships and events.
“Big Data is an almost unconscious outcome of the desire and need to sustain all peoples on a rapidly smaller looking planet”
meaning of data warehousing
needs of data warehousing
applications of data warehousing
architecture of data warehousing
advantages of data warehousing
disadvantages of data warehousing.
meaning of data mining
needs of data mining
applications of data mining
architecture of data mining
advantages of data mining
disadvantages of data mining
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
ODSC East 2017: Data Science Models For GoodKarry Lu
Abstract: The rise of data science has been largely fueled by the promise of changing the business landscape - enhancing one's competitive advantage, increasing business optimization and efficiency, and ultimately delivering a better bottom-line. This promise reaches across sectors as machine learning methods are getting better, data access continues to grow, and computation power is easily accessible. However, because the practice of doing data science can be expensive, there is a danger that this so-called promise of data science may only be available to the most well-resourced organizations with sophisticated data capabilities and staff. For the past five years, DataKind has been working to ensure social change organizations too have access to data science, teaming them up with data scientists to build machine learning and artificial intelligence solutions that aim to reduce human suffering. In doing so, DataKind has learned what it takes to apply data science in the social sector and the many applications it has for creating positive change in the world. This session presents DataKind projects showcasing the wide range of applications for ML/AI for social good. From using satellite imagery and remote sensing techniques to detect wheat farm boundaries to protect livelihoods in Ethiopia, to leveraging NLP to automate the time consuming process of synthesizing findings from academic studies to inform conservation efforts and to classifying text records to better understand human rights conditions across the world to using machine learning to reduce traffic fatalities in U.S. cities, learn about some of the latest breakthroughs and findings in the data science for social good space and learn how you can get involved
Linked Data and Semantic Technologies can support a next generation of science. This talk shows examples of discovery, access, integration, analysis, and shows directions towards prediction and vision.
Introduction To Data Mining: Introduction - The evolution of database
system technology - Steps in knowledge discovery from database process
- Architecture of a data mining systems - Data mining on different kinds
of data - Different kinds of pattern - Technologies used - Applications -
Major issues in data mining - Classification of data mining systems - Data
mining task primitives - Integration of a data mining system with a
database or data warehouse system.
First, Firster, Firstest: Three lessons from history on information overloadmark madsen
Keynote from the 2011 Strata New York conference.
The first person to conceive of something is usually not the first. They're the first to re-conceive at a point where the current technology caught up to someone else's idea. We're at a point today where many old ideas are being reinvented. Hear why looking to the past, beyond your core field of interest, is worthwhile.
Video can be found at http://www.youtube.com/watch?v=Qv0yF47L8WE
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
Presented by Rob Hanna at 2012 STC Summit in Rosemont, IL.
Take a journey into the Information Ecosystem where you will discover how structured information lives within your organization. Content is all around you—in places you may least expect. It exhibits predictable properties and behaviors that will help you capture and classify information for better management of your content.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
2. Lecture outline 2
Why Data Mining?
What is Data Mining?
What are the typical tasks?
What are the primitives?
What are the typical applications?
What are the major issues?
Prof. Pier Luca Lanzi
4. Why Data Mining? 4
“Necessity is the mother of invention”
Explosive Growth of Data
Terabytes of available data
Data collections and data availability
Major sources of abundant data
Pressing need for the automated analysis of massive data
Prof. Pier Luca Lanzi
5. Evolution of Database Technology 5
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models
(extended-relational, OO, deductive, etc.)
Application-oriented DBMS
(spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases,
and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration)
Global information systems
Prof. Pier Luca Lanzi
6. Examples 6
In vitro fertilization
Given: embryos described by 60 features
Problem: selection of embryos that will survive
Data: historical records of embryos and outcome
Cow culling
Given: cows described by 700 features
Problem: selection of cows that should be culled
Data: historical records and farmers’ decisions
Prof. Pier Luca Lanzi
7. Examples 7
Customer attrition
Given: customer information for the past months
Problem: predict who is likely to attrite next month,
or estimate customer value
Data: historical customer records
Credit assessment
Given: a loan application
Problem: predict whether the bank should
approve the loan
Data: records from other loans
Prof. Pier Luca Lanzi
9. What is Data Mining? 9
The non-trivial process of identifying
valid
novel
potentially useful, and
ultimately understandable patterns in data.
Alternative names,
Data Fishing, Data Dredging (1960-)
Data Mining (1990-), used by DB and business
Knowledge Discovery in Databases (1989-), used by AI
Business Intelligence, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Currently, Data Mining and Knowledge Discovery
are used interchangeably
Prof. Pier Luca Lanzi
11. Example: Credit Risk 11
IF salary<k THEN not repaid
loan
k salary
Prof. Pier Luca Lanzi
12. Example: Credit Risk 12
Is it valid?
The pattern has to be valid with respect
to a certainty level (rule true for the 86%)
Is it novel?
The value k should be previously
unknown or obvious
Is it useful?
The pattern should provide information
useful to the bank for assessing credit risk
Is it understandable?
Prof. Pier Luca Lanzi
13. What is the general idea? 13
Build computer programs that sift through databases
automatically, seeking regularities or patterns
There will be problems
Most patterns are banal and uninteresting
Most patterns are spurious, inexact, or contingent on
accidental coincidences in the particular dataset used
Real data is imperfect: Some parts will be garbled,
and some will be missing
Algorithms need to be robust enough to cope with imperfect
data and to extract regularities that are inexact but useful
Prof. Pier Luca Lanzi
14. What are the related fields? 14
Machine
Visualization
Learning
Knowledge Discovery
And Data Mining
Statistics Databases
Prof. Pier Luca Lanzi
15. Statistics, Machine Learning, 15
and Data Mining
Statistics:
more theory-based, focused on testing hypotheses
Machine learning
more heuristic, focused on building program
that learns, more general than Data Mining
Knowledge Discovery
integrates theory and heuristics
focus on the entire process of discovery, including
data cleaning, learning, integration and visualization
Data Mining
focus on the algorithms to extract patterns from data
Distinctions are blurred!
Prof. Pier Luca Lanzi
16. Why Not Traditional Data Analysis? 16
Tremendous amount of data
High scalability to handle terabytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks
and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
Prof. Pier Luca Lanzi
17. Knowledge Discovery Process 17
raw data
selection
cleaning
evaluation
transformation
mining
Prof. Pier Luca Lanzi
18. Knowledge Discovery Process 18
What are the main steps?
Learning the application domain to extract
relevant prior knowledge and goals
Data selection
Data cleaning
Data reduction and transformation
Mining
Select the mining approach: classification,
regression, association, clustering, etc.
Choosing the mining algorithm(s)
Perform mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation,
removing redundant patterns, etc.
Use of discovered knowledge
Prof. Pier Luca Lanzi
19. Knowledge Discovery and 19
Business Intelligence
Increasing potential
to support business End User
decisions Making
Decisions
Data Presentation Business
Visualization Techniques Analyst
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
OLAP, MDA
Data Warehouses / Data Marts
DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Prof. Pier Luca Lanzi
20. Architecture of a Typical 20
Knowledge Discovery System
Graphical user interface
Pattern evaluation
KB
Data mining engine
Database or data warehouse server
DB DW
Prof. Pier Luca Lanzi
22. Major Data Mining Tasks 22
Classification: predicting an item class
Clustering: finding clusters in data
Associations: frequent occurring events…
Visualization: to facilitate human discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous value
Link Analysis: finding relationship
Prof. Pier Luca Lanzi
23. Data Mining Tasks: classification 23
IF salary<k THEN not repaid
loan
?
?
k salary
Prof. Pier Luca Lanzi
24. Data Mining Tasks: classification 24
Classification and Prediction
Finding models (functions) that describe
and distinguish classes or concepts
The goal is to describe the data or
to make future prediction
E.g., classify countries based on climate,
or classify cars based on gas mileage
Presentation: decision-tree, classification rule, neural
network
Prediction: Predict some unknown numerical values
Prof. Pier Luca Lanzi
26. Data Mining Tasks: clustering 26
Cluster analysis
The class label is unknown
Group data to form new classes, e.g., cluster houses to
find distribution patterns
Clustering based on the principle: maximizing the intra-
class similarity and minimizing the interclass similarity
Prof. Pier Luca Lanzi
27. Data Mining Tasks: associations 27
Bread Bread Steak Jam
Peanuts Jam Jam Soda
Milk Soda Soda Peanuts
Fruit Chips Chips Milk
Jam Milk Bread Fruit
Fruit
Is there something interesting?
Jam Fruit Fruit Fruit
Soda Soda Soda Peanuts
Chips Chips Peanuts Cheese
Milk Milk Milk Yogurt
Bread
Prof. Pier Luca Lanzi
28. Data Mining Tasks: associations 28
Association Rule Mining
Finds interesting associations and/or correlation
relationships among large set of data items.
E.g., 98% of people who purchase tires and auto
accessories also get automotive services done
Prof. Pier Luca Lanzi
29. Data Mining Tasks: others 29
Outlier analysis
Outlier: a data object that does not comply with the
general behavior of the data
It can be considered as noise or exception but is quite
useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysis
Text Mining, Graph Mining, Data Streams
Other pattern-directed or statistical analyses
Prof. Pier Luca Lanzi
30. Are all the “Discovered” Patterns 30
Interesting?
Data Mining may generate thousands of patterns,
not all of them are interesting.
Suggested approach: Human-centered, query-based,
focused mining
Interestingness measures: a pattern is interesting if it is
easily understood by humans, valid on new or test data with
some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures:
Objective: based on statistics and structures of patterns,
e.g., support, confidence, etc.
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, etc.
Prof. Pier Luca Lanzi
31. Can we find all and only 31
interesting patterns?
Completeness: Find all the interesting patterns
Can a data mining system find all
the interesting patterns?
Association vs. classification vs. clustering
Optimization: Search for only interesting patterns:
Can a data mining system find only
the interesting patterns?
Approaches
• First general all the patterns and then filter out the
uninteresting ones.
• Generate only the interesting patterns—mining query
optimization
Prof. Pier Luca Lanzi
32. Data Mining tasks 32
General functionality
Descriptive data mining
Predictive data mining
Different views, different classifications
Kinds of data to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
Prof. Pier Luca Lanzi
34. Primitives that Define a Data Mining Task 34
Task-relevant data
Type of knowledge to be mined
Background knowledge
Pattern interestingness measurements
Visualization/presentation of discovered patterns
Prof. Pier Luca Lanzi
35. Primitive 1: 35
Task-Relevant Data
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
Prof. Pier Luca Lanzi
36. Primitive 2: 36
Types of Knowledge to Be Mined
Characterization
Discrimination
Association
Classification/prediction
Clustering
Outlier analysis
Other data mining tasks
Prof. Pier Luca Lanzi
37. Primitive 3: 37
Background Knowledge
A typical kind of background knowledge: Concept hierarchies
Schema hierarchy
E.g., Street < City < ProvinceOrState < Country
Set-grouping hierarchy
E.g., {20-39} = young, {40-59} = middle_aged
Operation-derived hierarchy
email address: hagonzal@cs.uiuc.edu
login-name < department < university < country
Rule-based hierarchy
LowProfitMargin (X) <= Price(X, P1) and Cost (X, P2)
and (P1 - P2) < $50
Prof. Pier Luca Lanzi
39. Primitive 5: 39
Presentation of Discovered Patterns
Different backgrounds/usages may require
different forms of representation
E.g., rules, tables, crosstabs, pie/bar chart, etc.
Concept hierarchy is also important
Discovered knowledge might be more understandable
when represented at high level of abstraction
Interactive drill up/down, pivoting, slicing and dicing
provide different perspectives to data
Different kinds of knowledge require different representation:
association, classification, clustering, etc.
Prof. Pier Luca Lanzi
40. Integration of Data Mining and 40
Data Warehousing
Data mining systems, DBMS, Data warehouse systems
coupling
No coupling, loose-coupling, semi-tight-coupling, tight-
coupling
On-line analytical mining data
integration of mining and OLAP technologies
Interactive mining multi-level knowledge
Necessity of mining knowledge and patterns at different
levels of abstraction by drilling/rolling, pivoting,
slicing/dicing, etc.
Integration of multiple mining functions
Characterized classification, first clustering and then
association
Prof. Pier Luca Lanzi
41. Coupling Data Mining with 41
Data bases and Datawarehouses
No coupling—flat file processing, not recommended
Loose coupling
Fetching data from DB/DW
Semi-tight coupling—enhanced DM performance
Provide efficient implement a few data mining primitives
in a DB/DW system, e.g., sorting, indexing, aggregation,
histogram analysis, multiway join, precomputation of
some stat functions
Tight coupling—A uniform information processing
environment
DM is smoothly integrated into a DB/DW system, mining
query is optimized based on mining query, indexing,
query processing methods, etc.
Prof. Pier Luca Lanzi
43. Major Issues in Data Mining 43
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge
fusion
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
Prof. Pier Luca Lanzi
45. Summary 45
Data mining: Discovering interesting patterns
from large amounts of data
A natural evolution of database technology,
in great demand, with wide applications
A KDD process includes data cleaning, data integration,
data selection, transformation, data mining,
pattern evaluation, and knowledge presentation
Data mining functionalities: characterization, discrimination,
association, classification, clustering,
outlier and trend analysis, etc.
Data mining systems and architectures
Major issues in data mining
Prof. Pier Luca Lanzi