1. Mining Complex Data
• Vast amounts of data are stored in various complex forms, such as structured or unstructured,
hypertext,andmultimedia.
• Thus, mining complex types of data, including object data, spatial data, multimedia data, text data, and
Web data, hasbecomeanincreasinglyimportant taskin datamining.
• Multidimensional analysisand datamining canbeperformed in object-relational and object-oriented
databases, by
• (1) class-based generalization of complex objects,
• including set-valued,list-valued,andother sophisticatedtypesof data,class/subclass
hierarchies,andclasscomposition hierarchies;
• (2) constructing object data cubes; and
• (3) performing generalization-based mining. A plan database canbeminedby a
• generalization-based,divide-and-conquerapproachin order to find interestinggeneral
• patternsat different levelsof abstraction.
• Spatial data mining is thediscoveryof interestingpatternsfrom largegeospatial
• databases. Spatial data cubes that contain spatial dimensionsandmeasurescanbe
• constructed. Spatial OLAP canbeimplementedto facilitate multidimensional spatial
• data analysis. Spatialdatamining includes mining spatial association and co-location
• patterns, clustering, classification, and spatial trend and outlier analysis.
• Multimedia data mining is thediscoveryof interestingpatternsfrom multimedia
• databasesthat storeandmanagelargecollectionsof multimediaobjects,including
• audio data,imagedata,videodata,sequencedata,andhypertext datacontaining
• text, text markups,andlinkages.Issuesin multimediadatamining include contentbased
• retrieval and similarity search, and generalization and multidimensional analysis.
• Multimediadatacubescontain additional dimensionsandmeasuresfor multimedia
• information. Other topicsin multimediamining include classification and prediction
• analysis, mining associations, and audio and video data mining.
• A substantial portion of theavailableinformation is storedin text or document
• databasesthat consistof largecollectionsof documents,suchasnewsarticles,technical
• papers,books,digital libraries,e-mail messages,andWebpages.Text information
• retrievalanddatamining hasthusbecomeincreasinglyimportant. Precision, recall,
• and the F-score arethreebasedmeasuresfrom Information Retrieval(IR). Various
• text retrieval methods havebeendeveloped.Thesetypically either focuson document
• selection (wherethequeryis regardedasproviding constraints) or document ranking
2. • (wherethequeryis usedto rank documentsin order of relevance).The vector-space
• model is apopular exampleof thelatter kind. LatexSementicIndexing(LSI), Locality
• PreservingIndexing(LPI), andProbabilistic LSI canbeusedfor text dimensionality
• reduction. Text mining goesonestepbeyondkeyword-basedandsimilarity-based
• information retrievalanddiscoversknowledgefrom semistructuredtext datausing
• methodssuchas keyword-based association analysis, document classification, and document
• clustering.
• TheWorld WideWebservesasahuge,widely distributed,globalinformation service
• centerfor news,advertisements,consumerinformation, financial management,
• education,government,e-commerce,andmanyother services.It alsocontainsarich
• anddynamiccollection of hyperlink information, andaccessandusageinformation,
• providing rich sourcesfor datamining. Web mining includesmining Web linkage
• structures, Web contents, and Web access patterns. This involvesmining the Web page
• layout structure, mining the Web’s link structures to identify authoritative Web pages,
• mining multimedia data on theWeb, automatic classification of Web documents, and
• Web usage mining.
• Trends in data mining
• Trends in Data Mining
• Thediversity of data,datamining tasks,anddatamining approachesposesmanychallenging
• researchissuesin datamining. Thedevelopmentof efficient andeffectivedata
• mining methodsandsystems,theconstruction of interactiveandintegrateddatamining
• environments,thedesignof datamining languages,andtheapplication of datamining
• techniquesto solvelargeapplication problemsareimportant tasksfor datamining
• researchersanddatamining systemandapplication developers.Thissection describes
• someof thetrendsin datamining that reflect thepursuit of thesechallenges:
• Application exploration: Early datamining applicationsfocusedmainly on helping
• businessesgain acompetitiveedge.Theexploration of datamining for businesses
• continuesto expandase-commerceande-marketing havebecomemainstreamelements
• of theretail industry. Datamining is increasinglyusedfor theexploration
• of applicationsin other areas,suchasfinancial analysis,telecommunications,
• biomedicine,andscience.Emergingapplication areasincludedatamining for counterterrorism
• (including andbeyondintrusion detection) andmobile(wireless)data
• mining. Asgenericdatamining systemsmayhavelimitationsin dealingwith
• application-specific problems,wemayseeatrend toward thedevelopmentof more
3. • application-specific datamining systems.
• Scalable and interactive data mining methods: In contrastwith traditional dataanalysis
• methods,datamining must beableto handlehugeamountsof dataefficiently
• and,if possible,interactively.Becausetheamount of databeingcollectedcontinues
• to increaserapidly, scalablealgorithmsfor individual andintegrateddatamining
• functionsbecomeessential.Oneimportant direction toward improving theoverall
• efficiencyof themining processwhile increasinguserinteraction is constraint-based
• mining. Thisprovidesuserswith addedcontrol by allowing thespecification anduse
• of constraintsto guidedatamining systemsin their searchfor interestingpatterns.
• Integration of data mining with database systems, data warehouse systems, and
• Web database systems: Databasesystems,datawarehousesystems,andtheWebhave
• becomemainstreaminformation processingsystems.It is important to ensurethat
• datamining servesasanessentialdataanalysiscomponent that canbesmoothly
• integratedinto suchaninformation processingenvironment. Asdiscussedearlier,
• adatamining systemshould betightly coupledwith databaseanddatawarehouse
• systems.Transaction management,queryprocessing,on-line analyticalprocessing,
• andon-line analyticalmining should beintegratedinto oneunified framework. This
• will ensuredataavailability, datamining portability, scalability, high performance,
• andanintegratedinformation processingenvironment for multidimensionaldata
• analysisandexploration.
• Standardization of data mining language: A standarddatamining languageor other
• standardization effortswill facilitatethesystematicdevelopment of datamining solutions,
• improveinteroperability amongmultiple datamining systemsandfunctions,
• andpromotetheeducation anduseof datamining systemsin industry andsociety.
• Recenteffortsin this direction includeMicrosoft’sOLEDB for DataMining (the
• appendix of this book providesanintroduction), PMML, andCRISP-DM.
• Visual data mining: Visualdatamining is aneffectivewayto discoverknowledge
• from hugeamountsof data.Thesystematicstudyanddevelopment of visualdata
• mining techniqueswill facilitatethepromotion anduseof datamining asatool for
• dataanalysis.
• New methods for mining complex types of data: Asshownin Chapters8 to 10,
• mining complextypesof datais animportant researchfrontier in datamining.
• Althoughprogresshasbeenmadein mining stream,time-series,sequence,graph,
• spatiotemporal,multimedia,andtext data,thereis still ahugegapbetweentheneeds
• for theseapplicationsandtheavailabletechnology.More researchis required,especially
4. • toward theintegration of datamining methodswith existingdataanalysis
• techniquesfor thesetypesof data.
• Biological data mining: Althoughbiologicaldatamining canbeconsideredunder
• “application exploration” or “mining complextypesof data,” theuniquecombination
• of complexity, richness,size,andimportanceof biologicaldatawarrants
• specialattention in datamining. Mining DNA andprotein sequences,mining highdimensional
• microarraydata,biologicalpathwayandnetwork analysis,link analysis
• acrossheterogeneousbiologicaldata,andinformation integration of biologicaldata
• by datamining areinterestingtopicsfor biologicaldatamining research.
• Data mining and software engineering: Assoftwareprogramsbecomeincreasingly
• bulky in size,sophisticatedin complexity, andtend to originatefrom theintegration
• of multiple componentsdevelopedby different softwareteams,it is anincreasingly
• challengingtaskto ensuresoftwarerobustnessandreliability. Theanalysisof the
• executionsof abuggysoftwareprogramis essentially adatamining process—
• tracingthedatageneratedduring programexecutionsmaydiscloseimportant
• patternsandoutliersthat mayleadto theeventualautomateddiscoveryof software
• bugs.Weexpectthat thefurther development of datamining methodologiesfor software
• debuggingwill enhancesoftwarerobustnessandbring newvigor to softwareengineering.
• Web mining: Issuesrelatedto Webmining werealsodiscussedin Chapter10.Given
• thehugeamount of information availableon theWebandtheincreasinglyimportant
• role that theWebplaysin today’ssociety,Webcontent mining, Weblogmining, and
• datamining serviceson theInternet will becomeoneof themost important and
• flourishing subfieldsin datamining.
• Distributed data mining: Traditional datamining methods,designedto work at a
• centralizedlocation, do not work well in manyof thedistributedcomputing environments
• presenttoday(e.g.,theInternet,intranets,localareanetworks,high-speed
• wirelessnetworks,andsensornetworks).Advancesin distributeddatamining methods
• areexpected.
• Real-time or time-critical data mining: Many applicationsinvolving streamdata
• (suchase-commerce,Webmining, stockanalysis,intrusion detection,mobiledata
• mining, anddatamining for counterterrorism) requiredynamicdatamining models
• to bebuilt in realtime.Additional developmentis neededin thisarea.
• Graph mining, link analysis, and social network analysis: Graphmining, link analysis,
• andsocialnetwork analysisareuseful for capturing sequential, topological,geometric,
• andother relational characteristicsof manyscientific datasets(suchasfor
5. • chemicalcompoundsandbiologicalnetworks)andsocialdatasets(suchasfor the
• analysisof hiddencriminal networks).Suchmodelingis alsouseful for analyzinglinks
• in Webstructuremining. Thedevelopment of efficient graphandlinkagemodelsis
• achallengefor datamining.
• Multirelational andmultidatabase data mining:Most datamining approachessearch
• for patternsin asinglerelational tableor in asingledatabase.However,most realworld
• dataandinformation arespreadacrossmultipletablesanddatabases.Multirelational
• datamining methodssearchfor patternsinvolvingmultiple tables(relations)
• from arelational database.Multidatabasemining searchesfor patternsacrossmultiple
• databases.Further researchis expectedin effectiveandefficient datamining
• acrossmultiple relationsandmultiple databases.
• Privacy protection and information security in data mining: An abundanceof
• recordedpersonalinformation availablein electronic formsandon theWeb,coupled
• with increasinglypowerful datamining tools,posesathreatto our privacy
• anddatasecurity. Growing interestin datamining for counterterrorism alsoadds
• to thethreat.Further developmentof privacy-preservingdatamining methodsis
Data Mining, Privacy, and Data Security
With moreandmoreinformation accessiblein electronic formsandavailableon the
• Web,andwith increasinglypowerful datamining toolsbeingdevelopedandput into
use,thereareincreasingconcernsthat datamining mayposeathreatto our privacy
anddatasecurity. However,it is important to notethat most of themajor datamining
applicationsdo not eventouch personaldata.Prominent examplesincludeapplications
involving natural resources,theprediction of floodsanddroughts,meteorology,
astronomy,geography,geology,biology,andother scientific andengineeringdata.Furthermore,
most studiesin datamining focuson thedevelopment of scalablealgorithms
andalsodo not involvepersonaldata.Thefocusof datamining technologyis on the
discovery of general patterns, not on specific information regardingindividuals.In this
sense,webelievethat therealprivacyconcernsarewith unconstrainedaccessof individual
records,like credit card andbankingapplications,for example,which must access
privacy-sensitiveinformation. For thosedatamining applicationsthat do involvepersonal
data,in manycases,simplemethodssuchasremovingsensitiveIDs fromdatamay
protect theprivacyof most individuals.Numerousdatasecurity–enhancingtechniques
havebeendevelopedrecently. In addition, therehasbeenagreatdealof recenteffort on
developingprivacy-preserving datamining methods.In this section,welook at someof
theadvancesin protecting privacyanddatasecurity in datamining.
6. In 1980,theOrganization for EconomicCo-operation andDevelopment(OECD)
establishedasetof international guidelines,referredto asfair information practices.
Theseguidelinesaim to protect privacyanddataaccuracy.Theycoveraspectsrelating
to datacollection, use,openness,security, quality, andaccountability. Theyincludethe
following principles:
Purpose specification and use limitation: Thepurposesfor which personaldataare
collectedshould bespecifiedat thetime of collection, andthedatacollectedshould
not exceedthestatedpurpose.Datamining is typically asecondarypurposeof the
datacollection. It hasbeenarguedthat attachingadisclaimer that thedatamayalso
beusedfor mining is generally not acceptedassufficient disclosureof intent. Dueto
theexploratory natureof datamining, it is impossibleto know what patternsmay
bediscovered;therefore,thereis no certainty overhow theymaybeused.
Openness: Thereshould beageneralpolicy of opennessabout developments,practices,
andpolicieswith respectto personaldata.Individualshavetheright to know the
natureof thedatacollectedabout them,theidentity of thedatacontroller (responsible
for ensuringtheprinciples),andhow thedataarebeingused.
Security Safeguards: Personaldatashould beprotectedby reasonablesecurity safeguards
againstsuchrisksaslossor unauthorizedaccess,destruction, use,modification,
or disclosureof data.
IndividualParticipation:Anindividual should havetheright to learnwhetherthedata
controller hasdatarelating to him or her, andif so,what that datais.Theindividual
mayalsochallengesuchdata.If thechallengeis successful,theindividual hastheright
to havethedataerased,corrected,or completed.Typically, inaccuratedataareonly
detectedwhenanindividual experiencessomerepercussionfromit, suchasthedenial
of credit orwithholding of apayment.Theorganization involvedusually cannot detect
• suchinaccuraciesbecausetheylackthecontextualknowledgenecessary.
“How can these principles help protect customers from companies that collect personal
client data?” Onesolution is for suchcompaniesto provideconsumerswith multiple
opt-out choices,allowing consumersto specifylimitationson theuseof their personal
data,suchas(1) theconsumer’spersonaldataarenot to beusedat all for datamining;
(2) theconsumer’sdatacanbeusedfor datamining, but theidentity of eachconsumer
or anyinformation that mayleadto thedisclosureof aperson’sidentity should be
removed;(3) thedatamaybeusedfor in-housemining only; or (4) thedatamaybe
usedin-houseandexternally aswell. Alternatively,companiesmayprovideconsumers
with positiveconsent,that is, by allowing consumersto opt in on thesecondaryuseof
their information for datamining. Ideally, consumersshould beableto call atoll-free
7. numberor accessacompanywebsitein order to opt in or out andrequestaccessto their
personaldata.
Counterterrorism is anewapplication areafor datamining that is gaining interest.
Data mining for counterterrorism maybeusedto detectunusualpatterns,terrorist
activities(including bioterrorism), andfraudulent behavior. Thisapplication areais in
its infancybecauseit facesmanychallenges.Theseincludedevelopingalgorithmsfor
real-time mining (e.g.,for building modelsin realtime,soasto detectreal-time threats
suchasthat abuilding is scheduledto bebombedby 10a.m.thenext morning); for
multimediadatamining (involving audio,video,andimagemining, in addition to text
mining); andin finding unclassifieddatato testsuchapplications.While thisnewform
of datamining raisesconcernsabout individual privacy,it is againimportant to note
that thedatamining researchis to developatool for thedetection of abnormal patterns
or activities,andtheuseof suchtoolsto accesscertain datato uncoverterrorist patterns
or activities is confinedonly to authorized security agents.
“What can we do to secure the privacy of individuals while collecting and mining data?”
Many data security–enhancing techniques havebeendevelopedto help protect data.
Databasescanemployamultilevel security modelto classifyandrestrict dataaccording
to varioussecurity levels,with userspermitted accessto only their authorizedlevel.
It hasbeenshown,however,that usersexecutingspecific queriesat their authorized
security levelcanstill infer moresensitiveinformation, andthat asimilar possibility can
occur throughdatamining. Encryption is anothertechniquein which individual data
itemsmaybeencoded.This mayinvolveblind signatures (which build on public key
encryption), biometric encryption (e.g.,wheretheimageof aperson’siris or fingerprint
is usedto encodehisor her personalinformation), andanonymous databases (which
permit theconsolidation of variousdatabasesbut limit accessto personalinformation to
only thosewho needto know; personalinformation is encryptedandstoredat different
locations).Intrusion detection is anotheractiveareaof researchthat helpsprotect the
privacyof personaldata.
Privacy-preserving data mining is anewareaof datamining researchthat is emerging
in responseto privacyprotection during mining. It is alsoknown asprivacy-enhanced or
privacy-sensitive datamining. It dealswith obtaining valid datamining resultswithout
learning theunderlying datavalues.Therearetwo common approaches:secure multiparty
computation anddata obscuration. In secure multiparty computation, datavalues
• areencodedusingsimulation andcryptographictechniquessothat no party canlearn
another’sdatavalues.This approachcanbeimpracticalwhenmining largedatabases.
In data obscuration, theactualdataaredistortedby aggregation (suchasusingtheaverage
8. incomefor aneighborhood,rather than theactualincomeof residents)or by adding
random noise.Theoriginal distribution of acollection of distorteddatavaluescanbe
approximatedusingareconstruction algorithm. Mining canbeperformedusingthese
approximatedvalues,rather than theactualones.Althoughacommon framework for
defining, measuring,andevaluatingprivacyis needed,manyadvanceshavebeenmade.
Thefield is expectedto flourish.
Likeanyother technology,datamining maybemisused.However,wemust not
losesight of all thebenefitsthat datamining researchcanbring,rangingfrom insights
gainedfrom medicalandscientific applicationsto increasedcustomersatisfaction by
helping companiesbetter suit their clients’needs.We expectthat computer scientists,
policy experts,andcounterterrorism expertswill continueto work with socialscientists,
lawyers,companiesandconsumersto takeresponsibility in building solutions
to ensuredataprivacyprotection andsecurity. In thisway,wemaycontinueto reap
thebenefitsof datamining in termsof time andmoneysavingsandthediscoveryof
• newknowledge.