Text mining tutorial


Published on

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Text mining tutorial

  1. 1. Getting Started with Text Mining: STM™, CART® and TreeNet® Dan Steinberg Mykhaylo Golovnya Ilya Polosukhin May, 2011
  2. 2. Text Mining and Data MiningText mining is an important and fascinating area of modern analyticsOn the one hand text mining can be thought of as just another applicationarea for powerful learning machinesOn the other hand, text mining is a distinct field with its own dedicatedconcepts, vocabulary, tools, and techniquesIn this tutorial we aim to illustrate some important analytical methods andstrategies from both perspectives on data mining  introducing tools specific to the analysis text, and,  deploying general machine learning technologyThe Salford Text Mining utility (STM) is a powerful text processing systemthat prepares data for advanced machine learning analyticsOur machine learning tools are the Salford Systems flagship CART® decisiontree and stochastic gradient boosting TreeNet®Evaluation copies of the the proprietary technology in CART and TreeNet aswell as the STM are available from http://www.salford-systems.com Salford Systems © Copyright 2011 2
  3. 3. For Readers of this TutorialTo follow along this tutorial we recommend that you have the analytical tools we useinstalled on your computer. Everything you need may already be on a CD diskcontaining this tutorial and analytical softwareCreate an empty folder named “stmtutor”, this is the root folder where all of the workfiles related to this tutorial will resideYou may also use the following link to download Salford Systems Predictive Modeler(SPM) http://www.salford-systems.com/dist/SPM/SPM680_Mulitple_Installs_2011_06_07.zipAfter downloading the package, unzip its contents into “stmtutor” which will create anew folder named “SPM680_Mulitple_Installs_2011_06_07”. Follow installation stepsdescribed on the next slide.For the original DMC2006 competition website visit http://www.data-mining-cup.de/en/review/dmc-2006/We recommend that you visit the above site for information only; data and tools forpreparing that data are available at the URL next belowFor the STM package, prepared data files, and other utilities developed for this tutorialplease visit http://www.salford-systems.com/dist/STM.zipAfter downloading the archive, unzip its contents into “stmtutor” Salford Systems © Copyright 2011 3
  4. 4. Important! Installing the SPM SoftwareThe Salford Systems software you‟ve just downloaded needs to be bothinstalled and licensed. No-cost license codes for a 30 day period areavailable on request to visitors of this tutorial*Double click on the “Install_a_Transform_SPM.exe” file located in the“SPM680_Mulitple_Installs_2011_06_07” folder (see the previous slide) toinstall the specific version of SPM used in this tutorial  Following the above procedure will ensure that all of the currently installed versions of SPM, if any, will remain intact!Follow simple installation steps on your screen* Salford Systems reserves the right to decline to offer a no-cost license at its sole discretion Salford Systems © Copyright 2011 4
  5. 5. Important! Licensing the SPM SoftwareWhen you launch the Salford Systems Predictive Modeler (SPM) you will begreeted with a License dialog containing information needed to secure alicense via email Please, send the necessary information to Salford Systems to secure your license by entering the “Unlock Code” which will be e- mailed back to you The software will operate for 3 days without any licensing; however, you can secure a 30-day license on request Salford Systems © Copyright 2011 5
  6. 6. Installing the Salford Text Miner (STM)In addition to the Salford Predictive Modeler (SPM) you will also work with theSalford Text Miner (STM) softwareNo installation is needed and you should already have the “stm.exe”executable in the “stmtutorSTMbin” folder as the result of unzipping the“STM.zip” package earlierSTM builds upon the Python 2.6 distribution and the NLTK (Natural LanguageTool Kit) but makes text data processing for analytics very easy to conductand manage  You do not need to add any other support software to use STMExpect to see several folders and a large number of files located under the“stmtutorSTM” folder. It is important to leave these files in the location towhich you have installed them.  Please do not MOVE or alter any of the installed files other than those explicitly listed as user-modifiable!“stm.exe” will expire in the middle of 2012, contact Salford Systems to get anupdated version beyond that Salford Systems © Copyright 2011 6
  7. 7. The Example ProjectThe best examples are drawn from real world data sets and we werefortunate to locate data publicly released by eBay.Good teaching examples also need to be simple.  Unfortunately, real world text mining could easily involve hundreds of thousands if not millions of features characterizing billions of records. Professionals need to be able to tackle such problems but to learn we need to start with simpler situations.  Fortunately, there are many applications in which text is important but the dimensions of the data set are radically smaller, either because the data available is limited or because a decision has been made to work with a reduced problem.We use our simpler example to illustrate many useful ideas for beginning textminers while pointing the way to working on larger problems. Salford Systems © Copyright 2011 7
  8. 8. The DMC2006 Text Mining ChallengeIn 2006 the DMC data mining competition (restricted to student competitorsonly) introduced a predictive modeling problem for which much of thepredictive information was in the form of unstructured text.The datasets for the DMC 2006 data mining competition can be downloadedfrom http://www.data-mining-cup.de/en/review/dmc-2006/  For your convenience we have re-packaged this data and made it somewhat easier to work with. This re-packaged data is included in the STMU package described near the beginning of this tutorial.The data summarizes 16,000 iPod auctions held at eBay from May 2005through May 2006 in GermanyEach auction item is represented by a text description written by the seller (inGerman) as well as a number of flags and features available to the seller atthe time of the auctionAuction items were grouped into 15 mutually exclusive categories based ondistinct iPod features: storage size, type (regular, mini, nano), and colorThe competition goal was to predict whether the closing price would be aboveor below the category average Salford Systems © Copyright 2011 8
  9. 9. Comments on the ChallengeOne might think that a challenge with text in German might not be of generalinterest outside of GermanyHowever, working with a language essentially unfamiliar to any member ofthe analysis team helps to illustrate one important point  Text mining via tools that have no “understanding” of the language can be strikingly effectiveWe have no doubt that dedicated tools which embed knowledge of thelanguage being analyzed can yield predictive benefits  We also believe we could have gained further valuable insight into the data if any of the authors spoke German! But our performance without this knowledge is still impressive.In contexts where simple methods can yield more than satisfactory results, orin contexts where the same methods must be applied uniformly acrossmultiple languages, the methods described in this tutorial will be an excellentguide. Salford Systems © Copyright 2011 9
  10. 10. Configuring Work Location in SPMThe original datasets from the DMC 2006 challenge reside in the“stmtutorSTMdmc2006” folderTo facilitate further modeling steps, we will configure SPM to use this locationas the default location:  Start SPM  Go to the Edit – Options menu  Switch to the Directories tab  Enter the “stmtutorSTMdmc2006” folder location in all text entry boxes except the last one  Press the [Save as Defaults] button so that the configuration is restored the next time you start SPM Salford Systems © Copyright 2011 10
  11. 11. Configuring TreeNet EngineNow switch to the TreeNet tab  Configure the Plot Creation section as shown on the screen shot  Press the [Save as Defaults] button  Press the [OK] button to exit Salford Systems © Copyright 2011 11
  12. 12. Steps in the Analysis: Data Overview1. Describe the data: (Data Dictionary and Dimensions of Data) a. What is the unit of observation? Each record of data is describing what? b. What is the dependent or target variable? c. What other variables (data base fields) are available? d. How many records are available?2. Statistical Summary a. Basic summary including means, quantiles, frequency tables b. Dimensions of categorical predictors c. Number of distinct values of continuous variables3. Outlier and Anomaly Assessment a. Detection of gross data errors such as extreme values b. Assessment of usability of levels of categorical predictors (rare levels) Salford Systems © Copyright 2011 12
  13. 13. Data FundamentalsThe original dataset is called “dmc2006.csv” and resides in the“stmtutorSTMdmc2006” folder16,000 records divided into two equal sized partitions  Part 1: Complete data including target, available for training during the competition  Part 2: Data to be scored; during the competition the target was not availabler25 database fields two of which were unstructured text written by the sellerEach line of data describes an auction of an iPod including the final winningbid priceAn eBay seller must construct a headline and a description of the productbeing sold. Sellers can also pay for selling assistance  E.g. Seller can pay to list the item title in BOLD Salford Systems © Copyright 2011 13
  14. 14. The Data: Available Fields The following variables describe general features of each auction eventVariable DescriptionAUCT_ID ID number of auctionITEM_LEAF_CATEGORY_NAME products categoryLISTING_START_DATE start date of auctionLISTING_END_DATE end date of auctionLISTING_DURTN_DAYS duration of auctionLISTING_TYPE_CODE type of auction (normal auction, multi auction, etc)QTY_AVAILABLE_PER_LISTING amount of offered items for multi auctionFEEDBACK_SCORE_AT_LISTIN feedback-rating of the seller of this auction listingSTART_PRICE start price in EURBUY_IT_NOW_PRICE buy it now price in EURBUY_IT_NOW_LISTING_FLAG option for buy it now on this auction listing Salford Systems © Copyright 2011 14
  15. 15. Available Data Fields In addition, there are binary indicators of various “value added” features that can be turned on for each auctionVariable DescriptionBOLD_FEE_FLAG option for bold font on this auction listingFEATUERD_FEE_FLAG show this auction listing on top of homepageCATEGORY_FEATURED_FEE_FLAG show this auction listing on top of categoryGALLERY_FEE_FLAG auction listing with picture galleryGALLERY_FEATURED_FEE_FLAG auction listing with gallery (in gallery view)IPIX_FEATURED_FEE_FLAG auction listing with IPIX (additional xxl, picture show, pack)RESERVE_FEE_FLAG auction listing with reserve-priceHIGHLIGHT_FEE_FLAG auction listing with background colorSCHEDULE_FEE_FLAG auction listing, including the definition of the starting timeBORDER_FEE_FLAG auction listing with frame Salford Systems © Copyright 2011 15
  16. 16. Target Variable Finally, the target variable is defined based on the winning bid price revenue relative to the category averageVariable DescriptionGMS scored sales revenue in EURCATEGORY_AVG_GMS Average sales revenue for the product categoryGMS_GREATER_AVG zero when the revenue is less than or equal to the category average sales and one otherwise The values were only disclosed on a randomly selected set of 8,000 auctions which we use to train a model 4199 auctions with the revenue below the category average 3801 auctions with the revenue above the category average During the competition the auction results for the remaining 8,000 auction results were kept secret, and used to score competitive entries We will only use these records at the very end of this tutorial to validate the performance of various models that will be built Salford Systems © Copyright 2011 16
  17. 17. Comments on MethodologyPredictive modeling and general analytics competitions are increasingly beinglaunched both by private companies and by professional organizations andprovide both public data sets and a wealth of illustrative examples usingdifferent analytic techniquesWhen reviewing results from a competition, and especially when comparingresults generated by analysts running models after the competition, it isimportant to keep in mind that there is an ocean of difference between beinga competitor during the actual competition and an after-the-fact commentatorRegardless of what is reported the after-the-fact analyst does have access to“what really happened” and it is nearly impossible to simulate the competitiveenvironment once the results have been published  We all learn in both direct and indirect ways from many sources including the outcomes of public competitions. This can affect anything that comes later in time.In spite of this, we have tried to mimic the circumstances of the competitorsby presenting analyses based only on the original training data, and usingwell-established guidelines we have been promoting for more than decade toarrive at a final modelWe urge you to never take as face value an analyst‟s report on what wouldhave happened if they had hypothetically participated Salford Systems © Copyright 2011 17
  18. 18. First Round Modeling: Ignoring the TEXT DataEven before doing any type of data preparation it is always valuable to run afew preliminary CART models  CART automatically handles missing values and is immune to outliers  CART is flexible enough to adapt to any type of nonlinearity and interaction effects among predictors. The analyst does not need to do any data preparation to assist CART in this regard  CART performs well enough out of the box that we are guaranteed to learn something of value without conducting any of the common data preparation operationsThe only requirement for useful results is that we exclude any possibleperfect or near perfect illegitimate predictors  Common examples of illegitimate predictors include repackaged versions of the dependent variable, ID variables, and data drawn from the future relative to the data to be predictedWe start with a quick model using 20 of the 25 available predictors. None ofthese involve any of the text data we will focus on later. Salford Systems © Copyright 2011 18
  19. 19. Quick Modeling Round with CARTWe start by building a quick CART model using original raw variables and all8,000 complete auction recordsAssuming that you already have SPM launched  Go to the File – Open – Data File menu  Note that we have already configured the default working folder for SPM  Make sure that the Files of Type is set to ASCII  Highlight the dmc2006.csv dataset  Press the [Open] button Salford Systems © Copyright 2011 19
  20. 20. Dataset Summary WindowThe resulting window summarizes basic facts about the datasetNote that even though the dataset has 16000 records, only top 8000 will beused for modeling as was already pointed out Salford Systems © Copyright 2011 20
  21. 21. The View Data WindowPress the [View Data…] button to have a quick impression of physicalcontents of the datasetOut goal is to eventually use the unstructured information contained in thetext fields right next to the auction ID Salford Systems © Copyright 2011 21
  22. 22. Requesting Basic Descriptive StatsWe next produce some basic stats for all available variables:  Go to the View – Data Info… menu  Set the Sort mode into File Order  Highlight the Include column  Check the Select box  Press the [OK] button Salford Systems © Copyright 2011 22
  23. 23. Data Information WindowAll basic descriptive statistics for all requested variables are now summarized in oneplaceNote that the target variable GMS_GREATER_AVG is not defined for the one half ofthe dataset (N Missing 8,000), all those records will be automatically discarded duringmodel buildingPress the [Full] button to see more details Salford Systems © Copyright 2011 23
  24. 24. Setting Up CART ModelWe are now ready to set up a basic CART run:  Switch to the Classic Output window active  Go to the Model – Construct Model… menu (alternatively, you could press one of the buttons located on the bar right below the menu bar)  In the resulting Model Setup window make sure that the Analysis Method is set to CART  In the Model tab make sure that the Sort is set to File Order and the Tree Type is set to Classification  Check GMS_GREATER_AVG as the Target  Check all of the remaining variables except AUCT_ID, LISTING_TITLE$, LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors  You should see something similar to what is shown on the next slide Salford Systems © Copyright 2011 24
  25. 25. Model Setup Window: Model TabSalford Systems © Copyright 2011 25
  26. 26. Model Setup Window: Testing TabSwitch to the Testing tab and confirm that the 10-fold cross-validation is usedas the optimal model selection method Salford Systems © Copyright 2011 26
  27. 27. Model Setup Window: Advanced TabSwitch to the Advanced tab and set the minimum required number of recordsfor the parent nodes and the child nodes at 15 and 5These limits were chosen to avoid extremely small nodes in the resulting tree Salford Systems © Copyright 2011 27
  28. 28. Building CART ModelPress the [Start] button, building progress window will appear for a while and then the Navigatorwindow containing model results will be displayedPress on the little button right above the [+][-] pair of buttons, along the left border of the Navigatorwindow, note that all trees within one standard error (SE) of the optimal tree are now marked in greenUse the arrow keys to select the 64-node tree from the tree sequence, which is the smallest 1SE tree Salford Systems © Copyright 2011 28
  29. 29. CART model observationsThe selected CART model contains 64 terminal nodes and it is the smallestmodel with the relative error still within one standard error of the optimalmodel (the model with the smallest relative error) pointed by the green bar  This approach to model selection is usually employed for easy comprehension  We might also want to require terminal nodes to contain more than the 6 record minimum we observe in this out of the box treeAll 20 predictor variables play a role in the tree construction  but there is more to observe about this when we look at the variable importance detailsArea under the ROC curve is a respectable 0.748 Salford Systems © Copyright 2011 29
  30. 30. CART Model Performance Press the [Summary Reports…] button in the Navigator, select Prediction Success tab, and press the [Test] button to display cross- validated test performance of 68.66% classification accuracy Now select the Variable Importance tab to review which variables entered into the model Interestingly enough, none of the “added value” paid options are important and exhibit practically no direct influence on the sales revenue A detailed look at the nodes might also be instructive for understanding the modelSalford Systems © Copyright 2011 30
  31. 31. Experimenting with TreeNetWe almost always follow initial CART models with similar TreeNet modelsWe start with CART because some glaring errors such as perfect predictorsare more quickly found and obviously displayed in CART  A perfect predictor often yields a single split tree (two terminal nodes) for classification treesTreeNet models have strengths similar to CART regarding flexibility androbustness and has advantages and disadvantages relative to CART  TreeNet is an ensemble of small CART trees that have been linked together in special ways. Thus TreeNet shares many desirable features of CART  TreeNet is superior to CART in the context of errors in the dependent variable (not relevant in this data)  TreeNet yields much more complex models but generally offers substantially better predictive accuracy. TreeNet may easily generate thousands of trees to arrive at an optimal model  TreeNet yields more reliable variable importance rankings Salford Systems © Copyright 2011 31
  32. 32. A few words about TreeNetTreeNet builds predictive models in stages. It first starts with a deliberatelyvery small first round tree (essentially a CART tree).Then TreeNet calculates the prediction error made by this simple model andbuilds a second tree to try to model that prediction error. The second treeserves as tool to update, refine, and improve the first stage model.A TreeNet model produces a “score” which is a simple of sum of all thepredictions made by each tree in the modelTypically the TreeNet score becomes progressively more accurate as thenumber of trees is increased up to an optimal number of treesRarely the optimal number of trees is just one! Occasionally, a handful oftrees are optimal. More typically, hundreds or thousands of trees are optimal.TreeNet models are very useful for the analysis of data with large numbers ofpredictors as the models are built up in layers each of which makes use ofjust a few predictorsMore detail on TreeNet can be found at http://www.salford-systems.com Salford Systems © Copyright 2011 32
  33. 33. Setting Up TN ModelSwitch to the Classic Output window and go to the Model – ConstructModel… menuChoose TreeNet as the Analysis MethodIn the Model tab make sure that the Tree Type is set to Logistic Binary Salford Systems © Copyright 2011 33
  34. 34. Setting Up TN ParametersSwitch to the TreeNet tab and do the following:  Set the Learnrate to 0.05  Set the Number of trees to use: to 800 trees  Leave all of the remaining options at their default values Salford Systems © Copyright 2011 34
  35. 35. TN Results WindowPress the [Start] button to initiate TN modeling run, the TreeNet Resultswindow will appear in the end Salford Systems © Copyright 2011 35
  36. 36. Checking TN PerformancePress on the [Summary] button and switch to the Prediction Success tabPress the [Test] button to view cross-validation resultsLower the Threshold: to 0.45 to roughly equalize classification accuracy inboth classes (this makes it easier to compare the TN performance with theearlier reported CART performance) Salford Systems © Copyright 2011 36
  37. 37. The Performance Has Improved!The overall classification accuracy goes up to about 71%Press the [ROC] button to see that the area under ROC is now a solid 0.800This comes at the cost of added model complexity – 796 trees each with about 6terminal nodesVariable importance remains similar to CART Salford Systems © Copyright 2011 37
  38. 38. Understanding the TreeNet ModelTreeNet produces partial dependency plots for every predictor thatappears in the model, the plots can be viewed by pressing on the [DisplayPlots…] buttonSuch plots are generally 2D illustrations of how the predictor in questionaffects an outcome  For example, in the graph below the Y axis represents the probability that an iPod will sell at an above category average price We see that for a BUY_IT_NOW price between 200 and 300 the probability of above average winning bid rises sharply with the BUY_IT_NOW_PRICE For prices above 300 or below 200 the curve is essentially flat meaning that changes in the predictor do not result in changes in the probable outcome Salford Systems © Copyright 2011 38
  39. 39. Understanding the Partial Dependency Plot (PD Plot)The PD Plot is not a simple description of the data. If you plotted the raw dataas say the fraction of above average winning bids against prices intervals youmight see a somewhat different curveThe PD Plot is a plot that is extracted from the TreeNet model and it isgenerated by examining TreeNet predictions (and not input data)The PD Plot appears to be relate two variables but in fact other variables maywell play a role in the graph constructionEssentially the PD Plot shows the relationship between a predictor and thetarget variable taking all other predictors into accountThe important points to understand are that  the graph is extracted from the model and not directly from raw data  the graph provides an honest estimate of the typical effect of a predictor  the graph displays not absolute outcomes but typical expected changes from some baseline as the predictor varies. The graph can be thought of as floating up or down depending on the values of other predictors Salford Systems © Copyright 2011 39
  40. 40. More TN Partial Dependency PlotsSalford Systems © Copyright 2011 40
  41. 41. Introducing the Text Mining Dimension To this point, we have been working only with the set of traditional structured data fields continuous and categorical variables Further substantial performance improvement can be achieved only if we utilize the text descriptions supplied by the seller in the following fieldsVariable DescriptionLISTING_TITLE title of auctionLISTING_SUBTITLE subtitle of auction Unfortunately, these two variables cannot be used “as is”. Sellers were free to enter free form text including misspellings, acronyms, slang, etc. So we must address the challenge of converting the unstructured text strings of the type shown here into a well structured representation Salford Systems © Copyright 2011 41
  42. 42. The Bag of Words Approach of Text MiningThe most straightforward strategy for dealing with free form text is torepresent each “word” that appears in the complete data set as a dummy(0/1) indicator variableFor iPods on eBay we could imagine sellers wanting to use words like “new”“slightly scratched”, “pink” etc. to describe their iPod. Of course thedescriptions may well be complete phrases like “autographed by AngelaMerkel” rather than just single term adjectivesNevertheless in the simplest Bag of Words (BOW) approach we just createdummy indicators for every wordEven though the headlines and descriptions are space limited the number ofdistinct words that can appear in collections of free text can be hugeText mining applications involving complete documents, e.g. newspaperarticles, the number of distinct words can easily reach several hundredthousands or even millions Salford Systems © Copyright 2011 42
  43. 43. The End Goal of the Bag of Words Record_ID RED USED SCRATCHED CASE 1001 0 1 0 1 1002 0 0 0 0 1003 1 0 0 0 1004 0 0 0 0 1005 1 1 1 0 1006 0 0 0 0• Above we see an example of a database intended to describe each auction item by indicating which words appeared in the auction announcement• Observe that Record_ID 1005 contains the three words “RED”, “USED” and “SCRATCHED”• Data in the above format looks just like the kind of numeric data used in traditional data mining and statistical modeling• We can use data in this form, as is, feeding it into CART, TreeNet, or regression tools such Generalized Path Seeker (GPS) or everyday regression• Observe that we have transformed the unstructured text into structured numerical data Salford Systems © Copyright 2011 43
  44. 44. Coding the Term Vector and TF weightingIn the sample data matrix on the previous slide we coded all of our indicatorsas 0 or 1 to indicate presence or absence of a termAn alternative coding scheme is based on the FREQUENCY COUNT of theterms with these variations:  0 or 1 coding for presence/absence  Actual term count (0,1,2,3,…)  Three level indicator for absent, one occurrence, and more than one (0,1,2)The text mining literature has established some useful weighted codingschemes. We start with term frequency weighting (tf)  Text mining can involve blocks of text of considerably different lengths  It is thus desirable to normalize counts based on relative frequency. Two text fields might each contain the term “RED” twice, but one of the fields contains 10 words while the other contains 40 words. We might want our coding to reflect the fact that 2/10 is more frequent than 2/40.  This is nothing more than making counts relative to the total length of the unit of text (or document) and such coding yields the term frequency weighting Salford Systems © Copyright 2011 44
  45. 45. Inverse Document Frequency (IDF) WeightingIDF weighting is drawn from the information retrieval literature and is intendedto reflect the value of a term in narrowing the search for a specific documentwithin a larger corpus of documentsIf a given term occurs very rarely in a collection of documents then that termis very valuable as a tag to target those documents accuratelyBy contrast, if a term is very common, then knowing that such a term occurswithin the document you are looking for is not helpful in narrowing the searchWhile text mining has somewhat different goals than information retrieval theconcept of IDF weighting has caught on. IDF weighting serves to upweightterms that occur relatively rarely.IDF(term) = log { (Number of documents)/Number of documents containing(term))}The IDF increases with the rarity of a term and is maximum for words thatoccur in only one documentA common coding of the term vector uses the product: tf * idf Salford Systems © Copyright 2011 45
  46. 46. Coding the DMC2006 Text DataThe DMC2006 text data is unusual principally because of the limit on the amount oftext a seller was allowed to uploadThis has the effect making the lengths of all the documents very similarIt also limits sharply the possibility that a term in a document would occur with a highfrequencyThese factors contribute to making the TF-IDF weighting irrelevant to this challenge. Infact, for this prediction task other coding schemes allow more accurate prediction.STM offers these options for term vector coding  0 – no/yes  1 – no/yes/many – this one will be used in the remainder of this tutorial  2 – 0/1  3 – 0/1/2  4 – term frequency (relative to document)  5 – inversed document frequency (relative to corpus)  6 – TF-IDF (traditional IR coding) Salford Systems © Copyright 2011 46
  47. 47. Text Mining Data PreparationThe heavy lifting in text mining technology is devoted to moving us from rawunstructured text to structured numerical dataOnce we have structured data we are free to use any of a large number oftraditional data mining and statistical tools to move forwardTypical analytical tools include logistic and multiple regression, predictivemodeling, and clustering toolsBut before diving into the analysis stage we need move through the texttransformation stage in detailThe first step is to extract and identify the words or “terms” which can bethought of as creating the list of all words recognized in the training data setThis stage is essentially one of defining the “dictionary”, the list of officiallyrecognized terms. Any new term encountered in the future will beunrecognizable by the dictionary and will represent an unknown itemIt is therefore very important to ensure that the training data set containsalmost all terms of interest that would be relevant for future prediction Salford Systems © Copyright 2011 47
  48. 48. Automatic Dictionary BuildingThe following steps will build an active dictionary for a collection ofdocuments (in our case, auction item description strings)  Read all text values into one character string  Tokenize this string into an array of words (token)  Remove words without any letters or digits  Remove “stop words” (words like “the”, “a”, “in”, “und”, “mit”, etc.) for both English and German languages  Remove words that have fewer than 2 letters and encountered less than 10 times across the entire collection of documents (rare small words)  At this point the too-common, too-rare, weird, obscure, and useless combinations of characters should have been eliminated  Lemmatize words using WordNet lexical database  This step combines words present in different grammatical forms (“go”, “went”, “going”, etc.) into the corresponding stem word (“go”)  Remove all resulting words that appear less than MIN times (5 in the remainder of this tutorial) Salford Systems © Copyright 2011 48
  49. 49. Build the Dictionary (or Term Vector)For purpose of automatic dictionary building and preprocessing data we developed theSalford Text Mining (STM) software - a stand alone collection of tools that perform allthe essential steps in preparing text documents for text miningSTM builds on the Python “Natural Language Toolkit” (NLTK)From NLTK we use the following tools  Tokenizer (extract items most likely to be “words”)  Porter Stemmer (recognize different simple forms of same word – e.g. plural)  Word Net lemmatizer (more complex recognition of same word variations)  stop word list (words that contribute little to no vale such as “the”, “a”)Future versions of STM might use other tools to accomplish these essential tasks“stm.exe” is a command line utility that must be run from a Command Prompt window(assuming you are running Windows, go to the Start – All Programs – Accessories –Command Prompt menu)The version provided here resides in the stmtutorSTMbin folder Salford Systems © Copyright 2011 49
  50. 50. STM Commands and OptionsOpen a Command Prompt window in Windows, then CD to the“stmtutorSTM” folder location, for example, on our system you would type incd c:stmtutorSTMTo obtain help type the following at the prompt: binstm --helpThis command will return very concise information about STM: stm [-h] [-data DATAFILE] [-dict DICTFILE] [-source-dict SRCDICTFILE] [-score SCOREFILE] [-spm SPMAPP] [-t TARGET] [-ex EXCLUDE] etc.The details for each command line option are contained in the softwaremanual appearing in the appendixYou will also notice the “stm.cfg” configuration file – this file controls the defaultbehavior of the STM module and relieves you of specifying a large number ofconfiguration options each time “stm.exe” is launched  Note the TEXT_VARIABLES : ITEM_LEAF_CATEGORY_NAME, LISTING_TITLE, LISTING_SUBTITLE‘ line which specifies the names of the text variables to be processed 50
  51. 51. Create Dictionary OptionsFor the purposes of this tutorial, we have prepackaged all of the text processingsteps into individual command files (extension *.bat). You can either double-click on the referenced command file or alternatively type its contents into theCommand Prompt window opened in the directory that contains the filesThe most important arguments for our purposes in this tutorial now are:  --dataset DATAFILE name and location of your input CSV format data set  --dictionary DICTFILE name and location of the dictionary to be createdThese two arguments are all you need to create your dictionary. By default,STM will process every text field in your input data set to create a singleomnibus dictionarySimply double click on the “stm_create_dictionary.bat” to create the dictionaryfile for the DMC 2006 dataset, which will be saved in the “dmc2006_ynm.dict”file in the “stmtutorSTMdmc2006” folderIn typical text mining practice the process of generating the final dictionary willbe iterative. A review of the first dictionary might reveal further words you wishto exclude (“stop” words) Salford Systems © Copyright 2011 51
  52. 52. Internal Dictionary Format The dictionary file is a simple text file with extension *.dict The file contents can be viewed and edited in a standard text editor The name of the text mining variable that will be created later on appears on the left of the “=“ sign on each un-indented line The default value that will be assigned to this variable appears on the right side of the “=“ sign of the un-indented lines and it usually means the absence of the word(s) of interest Each indented line represents the value (left of the “=“) which will be entered for a single occurrence in a document for any of the word(s) appearing on the right of the “=“  More than one occurrence will be recorded as “many” when requested (always the case in this tutorial)Salford Systems © Copyright 2011 52
  53. 53. Hand Made DictionaryTo use a multi-level coding you need to create a “hand made dictionary”, which is alreadysupplied to you as “hand.dict” in the “stmtutorSTMdmc2006” folderHere is an example of an entry in this file hand_model=standard mini nano standardThe un-indented line of an entry starts with the name we wish to give to the term(HAND_MODEL) and also indicates that a BLANK or missing value is to be coded withthe default value of “standard”The remaining indented entries are listed one-per-line and are an exhaustive list of theacceptable values which the term HAND-MODEL can receive in the term vectorAnother coding option is, for example: hand_unused=no yes=unbenutzt,ungeoffnetwhich sets “no” as the default value but substitutes “yes” if one of the two values listedabove is encounteredYou may study additional examples in our stmtutorSTMdmc2006hand.dict file on yourown, all of them were created manually based on common sense logic 53
  54. 54. Why Create Hand Made Dictionary EntriesLet‟s revisit the variable HAND_MODEL which brings together the terms  Standard, mini, nanoWithout a hand made dictionary entry we would have three terms created,one for each model type, with “yes” and “no” values, and possibly “many”By creating the hand made entry we  Ensure that every auction is assigned a model (default=“standard”)  All three models are brought together into one categorical variable with three possible values “standard”, “mini”, and “nano”This representation of the information is helpful when using tree-basedlearning machines but not helpful for regression-based learning machines  The best choice of representation may vary from project to project  Salford regression-based learning machines automatically repackage categorical predictors into 01 indicators meaning that you work with one representation  But if you need to use other tools you may not have this flexibility Salford Systems © Copyright 2011 54
  55. 55. Further Dictionary Customization The following table summarizes some of the important fields introduced in the custom dictionary for this tutorialVariable Values Combines word variantsCAPACITY 20 20gb,20 gb,20 gigabyte 30 30gb,30 gb,30 gigabyte 40 40gb,40 gb,40 gigabyte 80gb,80 gb,80 gigabyte 80 … …STATUS Wieneu Wie neu,super gepflegt,top gepflegt,top zustand,neuwertig Neu neu,new,brandneu,brandneues Unbenutzt Unbenu defekt defekt.,--defekt--,defekt,-defekt-,-defekt,defekter,defektesMODEL Mini, nano, Captures presence of the corresponding word in the auction standard descriptionCOLOR Black, white, Captures presence of the corresponding words or variants in Green, etc. the auction descriptionIPOD_GENE First, Identified iPod generation from the information available inRATION second, etc. the text description Salford Systems © Copyright 2011 55
  56. 56. Final Stage Dictionary ExtractionTo generate a final version of the dictionary in most real world applicationsyou would also need to prepare an expanded list of stopwordsThe NLTK provides a ready-made list of stopwords for English and another14 major languages spanning Europe, Russia, Turkey, and Scandinavia  These appear in the directory named stmtutorSTMdatacorporastopwords and should be left as they areAdditional stopwords, which might well vary from project to project, can beentered into the file named “stopwords.dat” in the “stmtutorSTMdata”folder  In the package distributed with this tutorial the “stopwords.dat” file is empty  You can freely add words to this file, with one stopword per lineOnce the custom “stopwords.dat” and “hand.dict” files have been preparedyou just run the dictionary extraction again but with the “--source-dictionary”argument added (see the command files introduced in the later slides)The resulting dictionary will now include all the introduced customizations Salford Systems © Copyright 2011 56
  57. 57. Creating Structured Text Mining VariablesThe resulting dictionary file “dmc2006_ynm.dict” contains about 600 individual stemsIn the final step of text processing the data dictionary is applied to each document entryEach stem from the dictionary is represented by a categorical variable (usually binary)with the corresponding nameThe preparation process checks whether any of the known word variants associatedwith each stem from the dictionary are present in the current auction description, and if“yes”, the corresponding value is set to “yes”, otherwise, it is set to “no”  When the “--code YNM” option is set, multiple instances of “yes” will be coded as “many”  You can also request integer codes 0, 1, 2 in place of the character “yes/no/many”  We have experimented with alternative variants of coding (see the “--code” help entry in the STM manual) and came to conclusion that the “YNM” approach works best in this tutorial  Feel free to experiment with alternative coding schemas on your ownThe resulting large collection of variables will be used as additional predictors in ourmodeling effortsEven though other more computationally intense text processing methods exist, furtherinvestigation failed to demonstrate their utility on the current data which is most likelyrelated to extremely terse nature of the auction descriptions Salford Systems © Copyright 2011 57
  58. 58. Creating Additional VariablesFinally, we spent additional efforts on reorganizing the original raw variablesinto more useful measures  MONTH_OF_START – based on the recorded start date of auction  MONTH_OF_SALE – based on the recorded closing date of auction  HIGH_BUY_IT_NOW – set to “yes” if BUY_IT_NOW_PRICE exceeds the CATEGORY_AVG_GMS as suggested by common sense and the nature of the classification problem  In the original raw data, BUY_IT_NOW_PRICE was set to 0 on all items where that option was not available – we reset all such 0s to missingAll of these operations are encoded in the “preprocess.py” Python filelocated in the “stmtutorSTMdmc2006” folder  This component of the STM is under active development  The file is automatically called by the main STM utility  You may add/modify the contents of this file to allow alternative transformations of the original predictors Salford Systems © Copyright 2011 58
  59. 59. Generation of the Analysis Data SetAs this point we are ready to move on to the next step which is data creationThis is nothing more than appending the relevant columns of data to theoriginal data set. Remember that the dictionary may contain tens ofthousands if not hundreds of thousands of termsFor the DMC2006 dataset the dictionary is quite small by text miningstandards containing just a little over 600 wordsTo generate processed dataset simply double-click on the stm_ynm.batcommand file or explicitly type in its contents in the Command Prompt  The “--dataset” option specifies the input dataset to be processed  The “--code YNM” option requests “yes/no/many” style of coding  The “--source-dictionary” option specifies the hand dictionary  The “--process” option specifies the output dataset  Of course you may add other options as you preferThis creates a processed dataset with the name dmc2006_res_ynm.csvwhich resides in the stmtutorSTMdmc2006 folder Salford Systems © Copyright 2011 59
  60. 60. Analysis Data Set ObservationsAt this point we have a new modeling dataset with the text informationrepresented by the extra variables  Note that he raw input data set is just shy of 3 MB in size in a plain text format while the prepared analysis data set is about 40 MB in size, 13 times largerProcess only training data or all data?  For prediction purposes all data needs to be processed, both the data that will be used to train the predictive models and the holdout or future data that will receive predictions later  In the DMC2006 data we happen to have access to both training and holdout data and thus have the option of processing all the text data at the same time  Generating the term vector based only on the training data would generally be the norm because future data flows have not yet arrived  In this project we elected to process all the data together for convenience knowing that the train and holdout partitions were created by random division of the data  It is worth pointing out, though, that the final dictionary generated from training data only might be slightly different due to the infrequent word elimination component of the text processor Salford Systems © Copyright 2011 60
  61. 61. Quick Modeling Round with CARTWe are now ready to proceed with another CART run this time using all of thenewly created text fields as additional predictorsAssuming that you already have SPM launched  Go to the File – Open – Data File menu  Make sure that the Files of Type is set to ASCII  Highlight the dmc2006_res_ynm.csv dataset  Press the [Open] button Salford Systems © Copyright 2011 61
  62. 62. Dataset Summary WindowAgain, the resulting window summarizes basic facts about the datasetNote the dramatic increase in the number of available variables Salford Systems © Copyright 2011 62
  63. 63. The View Data WindowPress the [View Data…] button to have a quick look at the physical contentsof the datasetNote how the individual dictionary word entries are now coded with the “yes”,“no”, or “many” values for each document row Salford Systems © Copyright 2011 63
  64. 64. Setting Up CART ModelProceed with setting up a CART modeling run as before:  Make the Classic Output window active  Go to the Model – Construct Model… menu (alternatively, you could use one of the buttons located on the bar right below the menu)  In the resulting Model Setup window make sure that the Analysis Method is set to CART  In the Model tab make sure that the Sort is set to File Order and the Tree Type is set to Classification  Check GMS_GREATER_AVG as the Target  Check all of the remaining variables except AUCT_ID, LISTING_TITLE$, LISTING_SUBTITLE$, GMS, and CATEGORY_AVG_GMS as predictors  You should see something similar to what is shown on the next slide Salford Systems © Copyright 2011 64
  65. 65. Model Setup Window: Model TabSalford Systems © Copyright 2011 65
  66. 66. Model Setup Window: Testing TabSwitch to the Testing tab and confirm that the 10-fold cross-validation is usedas the optimal model selection method Salford Systems © Copyright 2011 66
  67. 67. Model Setup Window: Advanced TabSwitch to the Advanced tab and set the minimum required number of recordsfor the parent nodes and the child nodes at 15 and 5These limits were chosen to avoid extremely small nodes in the resulting tree Salford Systems © Copyright 2011 67
  68. 68. Building CART ModelPress the [Start] button, building progress window will appear for a while and then the Navigatorwindow containing model results will be displayed (this time, the process takes a few minutes!)Press on the little button right above the [+][-] pair of buttons, along the left border of the Navigatorwindow, note that all trees within one standard error (SE) of the optimal tree are now marked in greenUse the arrow keys to select the 102-node tree from the tree sequence, which is the smallest 1SE tree Salford Systems © Copyright 2011 68
  69. 69. CART Model PerformanceThe selected CART model contains 102 terminal nodes where nearly all availablepredictor variables play a role in the tree constructionArea under the ROC curve (Test) is now an impressive 0.830, especially whencompared to the one reported earlier at 0.748 for the basic CART run or the 0.800 forthe basic TN runPress on the [Summary Reports] button in the Navigator window, select thePrediction Success tab, and finally press the [Test] button to see cross-validated testperformance at 76.58% classification accuracy – a significant improvement!Also note the presence of the original and derived variables on the list shown in theVariable Importance tab Salford Systems © Copyright 2011 69
  70. 70. Setting Up TN ModelNow switch to the Classic Output window and go to the Model – ConstructModel… menuChoose TreeNet as the Analysis MethodIn the Model tab make sure that the Tree Type is set to Logistic Binary Salford Systems © Copyright 2011 70
  71. 71. Setting Up TN ParametersSwitch to the TreeNet tab and do the following:  Set the Learnrate: to 0.05  Set the Number of trees to use: to 800  Leave all of the remaining options at their default values Salford Systems © Copyright 2011 71
  72. 72. TN Results WindowPress the [Start] button to initiate TN modeling run, the TreeNet Resultswindow will appear in the end, even though you might want to take a coffeebreak until the modeling run completes Salford Systems © Copyright 2011 72
  73. 73. Checking TN PerformancePress on the [Summary] button and switch to the Prediction Success tabPress the [Test] button to view cross-validation resultsLower the Threshold: to 0.47 to roughly equalize classification accuracy in both classes(this makes it easier to compare the TN performance with the earlier reported CARTand TN model performance)You can clearly see the improvement! Salford Systems © Copyright 2011 73
  74. 74. Requesting TN GraphsHere we present a sample collection of all 2-D contribution plots produced byTN for the resulting modelThe plots are available by pressing on the [Display Plots…] button in theTreeNet Results windowThe list is arranged according to the variable importance table 74
  75. 75. More GraphsSalford Systems © Copyright 2011 75
  76. 76. Insights Suggested by the ModelHere is a list of insights we arrived at by looking into the selection of plots  There is a distinct effect of the iPod category once all the other factors have been accounted for  Larger start price means above the average sale (most likely relates to the quality of an item)  A“new” and “unpacked” item should fetch a better price, while any “defect” brings the price down  End of the year means better sales  Having a good feedback score is important  It is best to wait 10 days or more before closing the deal  Interestingly, 1st and 3rd generations of iPod show poorer sales than the 2nd and 4th  2G started to fall out of favor in 2005-2006  Black is much more popular in Germany than other colors  Mentioning “photo”, “video”, “color display”, etc. helps get a better price  The paid advertising features are of little or marginal importance Salford Systems © Copyright 2011 76
  77. 77. Final Validation of ModelsAt this point we are ready to check the performance of all our models usingthe remaining 8,000 auctions originally not available for trainingThis way each model can be positioned with respect to all of the official 173entries originally submitted to the DMC 2006 competitionHowever, in order to proceed with the evaluation, we must first score theinput data using all of the models we have generated up until nowThe following slides explain how to score the most recently constructedCART and TN models, the earlier models can be scored using similar stepsYou may choose to skip the scoring steps as we have already included theresults of scoring in the “stmtutorSTMscored” folder:  Score_cart_raw.csv – simple CART model predictions  Score_tn_raw.csv – simple TN model predictions  Score_cart_txt.csv – text mining enhanced CART model predictions  Score_tn_txt.csv – text mining enhanced TN model predictions Salford Systems © Copyright 2011 77
  78. 78. Scoring a CART ModelSelect the Navigator window for the model you wish to scoreSelect the tree from the tree sequence (in our runs we pick the 1SE trees asmore robust)Press the [Score] button to open the “Score Data” windowMake sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not pressthe [Select…] button on the right and select the dataset to be scoredPlace a checkmark in the “Save results to a file” box, then press the [Select]button right next to it, this will open the “Save As” windowNavigate to the “stmtutorSTMscored” folder under “Save in:” selection box,enter “Scored_cart_txt.csv” in the “File name:” text entry box, and press the[Save] buttonYou should now see something similar to what‟s shown on the next slidePress the [OK] button to initiate the scoring processYou should now have the Scored_cart_txt.csv file in the stmtutorSTMscoredfolder Salford Systems © Copyright 2011 78
  79. 79. Scoring CARTSalford Systems © Copyright 2011 79
  80. 80. Scoring a TN ModelSelect the “TreeNet Results” window for the model you wish to scoreGo to the “Model – Score Data…“ menu to open the “Score Data” windowMake sure that the “Data file” is set to “dmc2006_res_ynm.csv”, if not pressthe [Select…] button on the right and select the dataset to be scoredPlace a checkmark in the “Save results to a file” box, then press the[Select] button right next to it, this will open the “Save As” windowNavigate to the “stmtutorSTMscored” folder under “Save in:” selectionbox, enter “Scored_tn_txt.csv” in the “File name:” text entry box, and pressthe [Save] buttonYou should now see something similar to what‟s shown on the next slidePress the [OK] button to initiate the scoring processYou should now have the Scored_tn_txt.csv file in thestmtutorSTMscored folder Salford Systems © Copyright 2011 80
  81. 81. Scoring TNSalford Systems © Copyright 2011 81
  82. 82. Using STM to Validate PerformanceWe can now use the STM machinery to do final model validationSimply double-click the “stm_validate.bat” command file to proceedNote the use of the following options inside of the command file:  “-score” – specifies the output dataset where the model predictions will be written  “--score-column” – specifies the name of the variable containing the actual model predictions (these variables are produced by CART or TN during the scoring process)  “--check” – specifies the name of the dataset that contains the originally withheld values of the target  this dataset was used by the organizers of the DMC 2006 competition to select the actual winners  STM is currently configured to validate only the bottom 8,000 of the 16,000 predictions generated by the model; the top 8,000 records (used for learning) are simply ignoredThe results will be saved into text files with extensions “*.result” appended tothe original score file names in the “stmtutorSTMscored” folder Salford Systems © Copyright 2011 82
  83. 83. Validation Results FormatThe following window shows the validation results of the final TN model webuilt 8000 validation records were scored, of which: 719 ones were misclassified as zeroes 807 zeroes were misclassified as ones Thus 1,526 documents were misclassified This gives the final score of 8,000 – (1,526 * 2) = 4,948 Salford Systems © Copyright 2011 83
  84. 84. Final Validation of ModelsBased on the predicted class assignments, the final performance score iscalculated as 8,000 minus twice the total number of auction itemsmisclassifiedThe following table summarizes how these virtually out-of-the-box elementarymodelings perform on the holdout data (the values are extracted from the four*.result files produced by the STM validator)Model ROC Area Missed 0s Missed 1s ScoreCART raw data 75% 1123 1387 2980TN raw data 80% 1308 926 3532CART text data 83% 981 848 4342TN text data 89% 807 719 4948 Salford Systems © Copyright 2011 84
  85. 85. Visual Validation of the ResultsThe following graph summarizes the positioning of the four basic models withrespect to the 173 official competition entriesThe TN model with text mining processing is among the top 10 winners! TN text CART text TN raw CART raw Salford Systems © Copyright 2011 85
  86. 86. Observations on the ResultsWe used the most basic form of text mining, the Bag of Words, with minoremendations  None of the authors speaks German although we did look up some of the words in an on-line dictionary. If there are any subtleties to be picked from seller wording choices we would have missed them.We chose the coding scheme that performed best on the training data. Wehave six coding options and one stands out as clearly bestWe used common settings for the controls for CART and TreeNetWe did not use any of the modeling refinement techniques we teach in ourCART and TreeNet tutorialsWe thus invite you to see if you can tweak the performances of these modelseven higher Salford Systems © Copyright 2011 86
  87. 87. Command Line Automation in SPMSPM has a powerful command line processing component which allows you to completelyreproduce any modeling activity by creating and later submitting a command fileWe have packaged the command files for the four modeling and scoring runs you have conductedin the course of this tutorial  SPM command files must have the extension *.cmd  The four command files are stored in the “stmtutorSTMdmc2006” folderYou can create, open, or edit a command file using a simple text editor, like Notepad, etc.SPM has a built-in editor, just go to the File – New Notepad… menuYou may also access the command line directly from inside of the SPM GUI, just make sure that theFile – Command Prompt menu item is checkedJust type in “help” in the Command Prompt part (starts with the “>” mark) of the Classic Outputwindow to get the listing of all available commandsThen you can request a more detailed help for any specific command of interest, for example “helpbattery” will produce a long list of various batteries of automated runs available in SPMFurthermore, you may view all of the commands issued during the current session by going to theView – Open Command Log… menu, this way you can quickly learn which commands correspondto the recent GUI activity you were involved with Salford Systems © Copyright 2011 87
  88. 88. Basic CART Model Command FileYou may now restart SPM to emulate a new fresh runGo to the File – Open – Command File… menuSelect the “cart_raw.cmd” command file and press the [Open] buttonThe file is now opened in the built-in Notepad window Salford Systems © Copyright 2011 88
  89. 89. CART Command File Contents OUT – saves the classic output into a text file USE – points to the modeling dataset GROVE – saves the model as a binary grove file MODEL – specifies the target variable CATEGORY – indicates which variables are categorical, including the target KEEP – specifies the list of predictors LIMIT – sets the node limits ERROR – requests cross-validation BUILD – builds a CART model SAVE – names the file where the CART model predictions will be saved HARVEST – specifies which tree is to be used in scoring IDVAR – requests saving of theNote the use of the relative paths in the GROVE and SAVE commands additional variables into the output datasetAlso note the use of the forward slash “/” to separate folder names SCORE – scores the CART model OUTPUT * – closes the current text output file Salford Systems © Copyright 2011 89
  90. 90. Submitting Command FileWith the Notepad window active, go to the File – Submit Window menu tosubmit the command file into SPMIn the end you will see the Navigator and the Score windows opened whichshould be identical to the ones you have already seen in the beginning of thistutorialFurthermore, you should now have  “cart_raw.dat” text file created in the “stmtutorSTMdmc2006” folder, the file contains the classic output you normally see in the “Classic Output” window  “cart_raw.grv” binary grove file created in the “stmtutorSTMmodels” folder, the file contains the CART model itself, it can be opened in the GUI using the File – Open – Open Grove… menu which reopens the Navigator window, this file will be also needed to future scoring or translation  “Score_cart_raw.csv” data file created in the “stmtutorSTMscored” folder, the file contains the selected CART model predictions on your dataYou may proceed now with opening up the “tn_raw.cmd” file using the File –Open – Command File… menu Salford Systems © Copyright 2011 90
  91. 91. TN Command File Contents OUT, USE, GROVE, MODEL, CATEGORY, KEEP, ERROR, SAVE, IDVAR, SCORE, OUTPUT – same as the CART command file introduced earlier MART TREES – sets the TN model size in trees MART NODES – sets the tree size in terminal nodes MART MINCHILD - set the minimum individual node size in records MART OPTIMAL – sets the evaluation criterion that will be used for optimal model selection MART BINARY – requests logistic regression processing in our case MART LEARNRATE – sets the learnrate parameter MART SUBSAMPLE – sets the sampling rate MART INFLUENCE – sets the influence trimming value The rest of the MART commands requests automatic saving of the 2-D and 3-D plots into the grove; type in “help mart” to get full descriptionsSalford Systems © Copyright 2011 91
  92. 92. Submitting the Rest of the Command FilesAgain, with the current Notepad window active, use the File – Submit Window menuto launch the basic TN modeling run automatically followed by scoringThis will create the output, grove, and scored data files in the corresponding locationsfor the chosen TN model; also note the use of the EXCLUDE command in place of theKEEP command inside of the command file – this saves a lot of typingNow go back to the Classic Output window and notice that the File menu haschangedGo to the File – Sumbit Command File… menu, select the “cart_txt.cmd” commandfile, and press the [Open] buttonNotice the modeling activity in the Classic Output window, but no Results window isproduced – this is how the Submit Command File… menu item is different from theSubmit Window menu item used previously; nonetheless, the output, grove, and scorefiles are still created in the specified locationsUse the File – Open – Open Grove… menu to open the “tn_raw.grv” file located inthe “stmtutorSTMmodels folder”, you will need to navigate into this folder using theLook in: selection box in the Open Grove File windowYou may now proceed with the final TN run by submitting the “tn_txt.cmd” commandfile using either the File – Open – Command File… / File – Submit Window or File –Submit Command File… menu routes – don‟t forget that it does take long time to run! Salford Systems © Copyright 2011 92
  93. 93. Final RemarksThis completes the Salford Systems Data Mining and Text Mining tutorialIn the process of going through the tutorial you have learned how to use bothGUI and command cine facilities of SPM as well as the command line textmining facility STMYou managed to build two CART models, two TN models, as well as enrichedthe original dataset with a variety of text mining fieldsThe final model puts you among the top winners in a major text miningcompetition – a proud achievementEven though we have barely scratched the surface, you are now ready toproceed with exploring the remainder of the vast data mining activities offeredwithin SPM and STM on your ownWe wish you best of luck on the exciting and never ending road of moderndata analysis and explorationAnd don‟t forget that you can always reach us at www.salford-systems.comshould you have further modeling questions and needs Salford Systems © Copyright 2011 93
  94. 94. ReferencesBreiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification andRegression Trees, Pacific Grove: WadsworthBreiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements ofStatistical Learning. Springer.Freund, Y. & Schapire, R. E. (1996). Experiments with a new boostingalgorithm. In L. Saitta, ed., Machine Learning: Proceedings of the ThirteenthNational Conference, Morgan Kaufmann, pp. 148-156.Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: StatisticsDepartment, Stanford University.Friedman, J.H. (1999). Greedy function approximation: a gradient boostingmachine. Stanford: Statistics Department, Stanford University.Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau (2004).Text Mining. Predictive Methods for Analyzing Unstructured Information.Springer. Salford Systems © Copyright 2011 94
  95. 95. STM Command ReferenceSalford Text Miner is simple utility that should make text mining processmuch easier. For this purpose application described in this manual havedifferent parameters and can execute Salford Predictive Miner at the datamining backendSTM Workflow:  Automatically generate dictionary based on dataset  Process dataset and generate new with additional columns based on dictionary  Generate model folder with dataset, command file and dictionary  Run Salford Predictive Miner with generated command file  Run checking process comparing results from scoring with real classesAll of these steps can be done in separate STM calls or in one call Salford Systems © Copyright 2011 95
  96. 96. STM Command ReferenceShort Option Long Option Description-data DATAFILE --dataset DATAFILE Specify dataset to work with-dict DICTFILE --dictionary Specify dictionary to work with DICTFILE-source-dict SDFILE --source-dictionary Dictionary that is used as source for SDFILE automatic dictionary retrieval process-score SFILE --scoreresult SFILE Specify file with score result, for checking process, default – „score.csv‟-spm SPMAPP --spmapplication Path to spm application, default – SPMAPP „spm.exe‟-t TARGET --target TARGET Target variable to generate command file, default – „GMS_GREATER_AVG‟-ex EXCLUDE --exclude EXCLUDE List of variables to exclude from keep list, when generate command file.-cat CATEGORY --category List of variables to select as category CATEGORY variables, when generate command file Salford Systems © Copyright 2011 96
  97. 97. STM Command ReferenceShort Option Long Option Description-templ CMDTEMPL --cmdtemplate Specify template of command file, that will CMDTEMPL be used for generation. Default – „data/template.cmd‟-md MODEL_DIR --modeldir Dir, where model‟s folders will be created. MODEL_DIR Default – „models‟-trees TREES --trees TREES Parameter for TreeNet command files, specify number of trees will be build. Default – 500-maxnodes --maxnodes Parameter for TreeNet command files,MAXNODES MAXNODES specify numbers of nodes in one tree will be build. Default – 6-fixwords --fixwords Enables heuristics that tries to fix words (find nearest by different metrics, searching spell checking, etc)-textvars VARLIST --text-variables List of variables separated by commas, VARLIST which will be used in dictionary retrieving process Salford Systems © Copyright 2011 97
  98. 98. STM Command ReferenceShort Option Long Option Description-outrmwords --output-removed- Enables outputting removed stop words to words file „data/removed.dat‟-code CODE --column-coding Specify how to code absence/presence of CODE word in row: YN or 0 – no/yes YNM or 1 – no/yes/many 01 or 2 – 0/1 012 or 3 – 0/1/2 TF or 4 – term frequency IDF or 5 – inversed document frequency TF-IDF or 6 – TF-IDF TC or 7 – term count (0,1,2,…) Default – YN-mp MODELPATH --model-path Specify path where model files would be MODELPATH created-cmd-path CMDPATH --command-file-path Specify path to command file, which will be CMDPATH executed by Salford Predictive Miner-ppfile PPFILE --preprocess-file Path to python code that will be executed PPFILE on process step for data manipulate data Salford Systems © Copyright 2011 98
  99. 99. STM Command ReferenceShort Option Long Option Description-rc NAME --realclass- Specify column name for in real class dataset for column-name check step. Default – GMS_GREATER_AVG-e --extract Run first step – automatic extraction of dictionary from dataset. Need to specify --dataset-p OUTFILE --process Run second step – process dataset and create new OUTFILE dataset with name OUTPUTFILE were depending on dictionary will be created new columns. Need to specify --dataset and --dictionary-g --generate Run third step – generate model folder with command file. Need specify --dataset, --dictionary-m --model Run forth step. Run Salford Predictive Miner with generate command file. Works only with –generate-c DATASET --check DATASET Run fives step. Check score file with real classes (from specified REALCLASSFILE) and outputs misclassification table. Need to specify --scoreresult-h --help Show help Salford Systems © Copyright 2011 99
  100. 100. STM Configuration FileName Description DefaultSPM_APPLICATION Path to Salford Predictive Miner spm.exeCMD_TREES Number of trees to build in TN models 500CMD_NODES Tree size for TN modes 6CMD_TEMPLATE Command file template data/template.cmdMODELS_DIR Dir, where model‟s folders will be created modelsLANGUAGES Languages, stop words which will be used English, GermanSPELLCHECKER_DICT Additional spell checker dictionary, with words that data/spellchecker_dict.dat are allowed (like “ipod”)SPELLCHECKER_LANGUAGE Language for spell checker de_DEADDITIONAL_STOPWORDS File with additional stop words, which user can edit data/stopwords.datREMOVED_WORDS_FILE File, where removed words will be written on data/removed.dat “extract” stepWORD_FREQUENCY_THRESH Lower threshold word frequency, which will be 5OLD deleted on “extract” stepPREPROCESS_FILE Include script to do additional processing dmc2006/preprocess.py Salford Systems © Copyright 2011 100
  101. 101. STM Configuration FileName Description DefaultCHECK_RESULTS_FILE data/score_results.csvLOGFILE Path to log file. Can be mask (%s for date). log/stm%s.logTARGET Default variable for target argument, which would be used to GMS_GREATER_AVG fill command file templateEXCLUDE Default variable for keep argument, which would be used to AUCT_ID, fill command file template LISTING_TITLE$, LISTING_SUBTITLE$, GMS, GMS_GREATER_AVGCATEGORY Default variable for category argument, which would be used GMS_GREATER_AVG to fill command file templateSCORE_FILE Name of score file which need to be checked Score.csvTEXT_VARIABLES List of text variables in dataset separated by comma ITEM_LEAF_CATEGORY_ NAME, LISTING_TITLE, LISTING_SUBTITLEDEFAULT_CODING Default coding for extract and preprocess steps YNREALCLASS_COLUMN_ Name of column in real class file, which would be used in GMS_GREATE_AVGNAME check stepSCORE_COLUMN_NAM Name of column in score file, which would be used in check PREDICTIONE step Salford Systems © Copyright 2011 101