We will talk about Auto-Classification and the place for Machine Learning . When a Spend transaction is added, what needs to happen is, the positioning of a spend in terms of a formal taxonomy might have to be dynamically changed. And that is not something that a person can manually do it in real time. We need an automated way of doing that. The spend transactions themselves have descriptions . When a tagging activity happens, when a review is written up , there is textual information. We could use UIMA, to pick out all the textual tokens – break them out into attributes and do Named Entity recognition. And then bring out a trained SVM engine which works on a model, that is able to pick up all the spend descriptions, and all its attributes from the Classification model, and tag it, and then position it appropriately in the Taxonomy. There are two flavors available: Neural Net Engine SVM They both have comparable performance. The bottom line is, we took in the spend Taxonomy , we took in the spend Ontology that describes the entire Spend model as well as the description of the spend - you can run it into a Neural Net Engine and then you can tag things, so that, as and when a new spend transaction is introduced, it is appropriately positioned in the Taxonomy, dynamically .
Semantic Spend Classification
Arivoli TirouvingadamePrincipal Member of Technical Staff, Oracle America, Inc.
Acknowledgements Sincere thanks to Keshava Rangarajan, Chief Architect, Halliburton Corporationfor all the contribution and guidance, without which this research would not have been possible.
What is Spend Classification ?•Definition: Process of determining a purchase code for each spend record(Requisitions, Purchase Orders, Receipts, Invoices, etc.) from a hierarchicalstructure (Taxonomy). Requisitions, POs, Receipts, Invoices, etc.
Why to classify spend ?•Once all spend transactions are classified with a standard code from ataxonomy – simple queries can be answered like •What are my top 10 spend categories ? •What is my travel spend ? •What is my spend for a given Supplier ? •What is my spend for a given Part ? •What is my spend for a given Business Unit ?•If your classification is done on a consolidated data across all systems in yourorganization, you get visibility across all systems with classification.
What is Taxonomy ?•A simple hierarchical level of coding structure used to classify spend atdifferent levels. Segment Family Class Commodity
What is the Spend Classification challenge ?•Categorization at source•Categorization itself is inconsistent or missing completely•Multiple disparate Taxonomies may exist in a company•Classifying into “MISCELLANEOUS” category•No standardization of Taxonomies
What is the “Categorization at source” challenge ?Exercise: Buying a work laptop and expensing via procurementX Category: Facility. Building.HardwareCategory: IT.Hardware.LaptopCharacteristics:•User entered, hence error-prone•No standardization across the supply chain – business units, customers, orsuppliers.
What is the “inconsistent/missing Categorization” challenge ?• Category: IT.Hardware.Laptop• Category: IT.Hardware.Computers.Laptop
What is the “multiple disparate Taxonomies” challenge ?•Multiple (and disparate) taxonomies may also exist in the organizationwhere classification could be carried out business unit-wise without regardto, or referring to, the taxonomies used in other business units. Business Unit 3 Business Unit 2 Business Unit 1 Taxonomy 3 Taxonomy 2 Taxonomy 1
What is the “MISCELLANEOUS category” challenge ?•Spend transactions are classified into the Miscellaneous category, making itvery difficult for business analysts to figure out which category the itemshould actually belong to.•Spend analytics data will then show a weighted Miscellaneous category,which is incorrect and thus does not reflect a true picture of spend bycategories for the organization.•Similar popular categories: OTHERS, UNCATEGORIZED
What is the standardization of Taxonomies need ?•An enterprise may have multiple taxonomies at different levels – corporate,strategic, business unit and regional center.•Multiple taxonomies at various levels creates a number of issues whenanalyzing spend, therefore it is important to create or use standardtaxonomies across the enterprise.
What are the types of Spend Classification Taxonomies ? SPEND CLASSIFICATION TAXONOMY Standard Custom
Standard Taxonomies•UNSPSC: United Nations Standard Products and Services Code. It is 5 levelhierarchy coded as an 8-digit number.Example:•Segment 44. Office Equipment and Accessories and Supplies.• Family 10. Office machines and their supplies and accessories.• Class 15. Duplicating machines.• Commodity 01. Photocopiers.• Business Function 14. Retail.
Custom Taxonomies•If your own coding structure is strong enough for your business, or you thinkyour business is more acquainted with your own structure
1) Requisitions ERP Category2) Purchase Orders3) Receipts4) Invoices Procurement & Spend Analysis Item Invoice Categories Supplier Description Description Description And Attribute And Attribute And Attribute ERP Taxonomy UNSPSC Code Custom Taxonomies Data Mining Spend Classification
What is Spend Analysis?•Process of collecting, cleansing, classifying and analyzing expenditure datawith the purpose of reducing procurement.•Process of aggregating, classifying, and leveraging spend data for the purposeof gaining visibility into cost reduction, performance improvement, and contractcompliance opportunities.•Enables to answer the following questions: •Who is buying ? •What ? •From whom ? •When ? •(optionally) Where ? •At what price ?
Who needs Spend Analysis?•It is the process of organizing a company’s spend in such a way that oneunderstand it, slice it, dice it and uncover hidden savings opportunities.•Impacts more than just the sourcing team•Spend analysis/ visibility serves three internal user community groups: •Leadership and CxOs: who need up-to-date reports to drive strategic direction •Managers, accountants: who need to drill down into a spend data set to explore specific areas of interest or track down payment specifics •Sourcing power users: who need to locate, drive, and monitor the next set of savings initiatives
What is Spend Management?•Process in which companies control and optimize the money they spend.•Involves cutting operating and other costs associated with doing business. •Includes spend analysis, sourcing, procurement, receiving, payment settlement and management of accounts payable and general ledger accounts.•In an enterprise, spend management is managing how to spend money to besteffect in order to build products and services. •Encompasses processes such as outsourcing, procurement, e-procurement, and supply chain management.
Benefits of Spend Management•Decreasing "maverick" spend•Increase of spend economies of scale •Strategic sourcing (also called "supplier rationalization") •Sourcing optimization •Co-operative sourcing•Increase process efficiencies•Increase procurement efficiency
Life cycle of a PO Create PO1 Add items to PO2 Add PO to Cart *3 Create Document for the PO in the Cart4 Create Requisition for the Document5 Note: PO needs to be classified before it hits the Cart. After the Order hits the Cart, then it is too late for classification.
Classifying Spend• We have a set of pre-defined fields chosen for classification from a Purchase Order. All these fields are concatenated to form one giant string. (Note: This textual string could have multi-lingual strings.)• Lexers can be used for detecting languages. (eg: Auto lexers, World lexers)• SVM could be used for Textual mining.
Where does Machine Learning fit in? (Spend Auto-Classification) Ontology (including Spend Descriptions + other textual attributes) Taxonomies Spend transaction Spend Auto-classifier Linguistics (UIMA) + Neural Net Engine/ Text SVM Auto-Classified Spend
Training data set• To begin with, customers provide a Training data set. This is from their historic data. They take some well known data set from their most common use cases. This would constitute a good representation of their problem.• We run our logic against this training set and get the results. The results are verified. We iterate this for some cycles to tune the logic.• Repeat the same over other use cases.
Data Mining Model Create a Model Model createdEnrich/Re-train Cleanse incorrect classification Support new categories (if needed)
What is Named Entity Recognition ?•“Named-entity recognition (NER) (also known as entity identification andentity extraction) is a subtask of information extraction that seeks to locateand classify atomic elements in text into predefined categories such as thenames of persons, organizations, locations, expressions of times, quantities,monetary values, percentages, etc.” -- Wikipedia•Most research on NER systems has been structured as taking anunannotated block of text, such as this one• Jim bought 300 shares of Acme Corp. in 2006.•And producing an annotated block of text, such as this one:• <ENAMEX TYPE="PERSON">Jim</ENAMEX>bought<NUMEXTYPE="QUANTITY">300</NUMEX>shares of<ENAMEX TYPE="ORGANIZATION">AcmeCorp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>.
Anatomy of a query …Query = “Find Approved Status POs with HighAmount”
Stemmed Entity Recognition & Linguistic Parsing yields… Search Verb: “Find”Find Approved Status POs with High Amount
Stemmed Entity Recognition & Linguistic Parsing yields… Search Verb: “Find” Attribute:Status= “Approved”Find Approved Status POs with High Amount
Stemmed Entity Recognition & Linguistic Parsing yields… Search Verb: Entity: “Find” Attribute:Type=“PO” Attribute:Status= “Approved”Find Approved Status POs with High Amount
Stemmed Entity Recognition & Linguistic Parsing yields… Search Verb: Entity: “Find” Attribute:Type=“PO” Attribute:Amount= “High” Attribute:Status= “Approved”Find Approved Status POs with High Amount
Stemmed Entity Recognition & Linguistic Parsing yields… Search Verb: Target Entity: “Find” Attribute:Type=“PO” Attribute:Amount= “High” Attribute:Status= “Approved”Find Approved Status POs with High Amount
Stemmed Entity Recognition & Linguistic Parsing yields… Search Verb: Target Entity: “Find” Attribute:Type=“PO” Having Attribute Attribute:Amount= “High” Attribute:Status= “Approved”Find Approved Status POs with High Amount
Stemmed Entity Recognition & Linguistic Parsing yields… Search Verb: Target Entity: “Find” Attribute:Type=“PO” Having Having Attribute Attribute Attribute:Amount= “High” Attribute:Status= “Approved”Find Approved Status POs with High Amount
OWL: attribute: string Transaction Party has a CodeOWL:class has OWL:class many Role OWL:class has an plays OWL: Is A Bank attribute: string is related ID to Person Corporation OWL:class Is A OWL:class OWL:attribute: Finance number Corporation OWL:class has has First has Name ID has many many Name OWL:class OWL: Address attribute: string Last has an OWL:attribute: Account number Name OWL:attribute: has an in ID string OWL:attribute: has number Door Street City State Zip Country Number Name OWL: OWL: OWL: OWL: attribute: OWL: OWL: attribute:string attribute: string attribute: string attribute: string attribute:string string
TransactionID:200911071234 has Party has ID: SBK has Role: S? Bank Role played by Bank has Name: Bank Of Congo has many Address has Street Name: Afrique Au Congo has Country: RDC
TransactionID:200911071235 has Party has ID: ORP has Role: Ordering Party Role played by Person has First Name: John has Last Name: Doe has many AddressAccount has City: Kinshasa has Account Id: 123456 has Country: CD in Bank has Name: Bank Of Congo
Transaction Transaction ID:200911071234ID:200911071235 has is related Party to Party has has ID: ORP has ID: SBK has Role: Ordering Party Role has Role: S? Bank Role played by Person has First Name: John played by has Last Name: Doe Address has City: Kinshasa Address has Country: CD has Street Name: has Afrique Au Congo many has Country: RDCAccount has Account Id: 123456 in Bank has has Name: Bank Of Congo many
A possible solution: Pipelining approach•Flow 1: •Machine learning Pipeline: Input data is directly fed to the Machine Learning piece.•Flow 2: •Domain Ontology Pipeline: Input data is fed to a Domain Ontology. •Standardize the output from the Domain Ontology. •Machine learning Pipeline: Feed it into the Machine Learning piece.•Flow 3: •NER Pipeline: Input data is fed to a NER. •Domain Ontology Pipeline: Output from the NER is fed to the Domain Ontology. •Standardize the output from the Domain Ontology. •Machine learning Pipeline: Feed it into the Machine Learning piece.•Note: •Domain Ontology and NER Pipelines can be optionally turned on or off
SVM Steps1.Identify taxonomy (hierarchical or flat) to be classified against2.Identity representative training data that has been classified to this taxonomy3.Run training data against blank SVM model and the given taxonomy4.Classify training data as per required taxonomy5.Classify the data6.Increase training population and enrich classification model7.Recognize and realign impact of original model against fresh training data8.Classify (manually) misclassifications into proper taxonomy nodes9.Run step 6 through 8 until all the variations for a given domain have been recognized10.Introduce live data11.Repeat steps 4 and 5 for misclassifications12.Store the result in a relational database13.Insert data in an Ontology14.Enable analysis using RQL or SPARQL
Open source software1.Jena2.Pentaho http://www.pentaho.com/3. Stanford NER, http://nlp.stanford.edu/software/CRF-NER.shtml4.Annie NER5.GATE6.UIMA7.SVM, http://en.wikipedia.org/wiki/Support_vector_machine