Your SlideShare is downloading. ×
Bionic Info Pro - Taxonomies and Machine Learning SLA 2014
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Bionic Info Pro - Taxonomies and Machine Learning SLA 2014


Published on

Presentation for Special Libraries Association on machine assisted taxonomy creation and the human element.

Presentation for Special Libraries Association on machine assisted taxonomy creation and the human element.

Published in: Education, Technology

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Not an expert, I am a “LEARNER” a student
  • “automatic discovery of patterns using software to analyze vast amounts of records in a database”
    What else was going on in techi n 1996
  • The 1996 article mentioned transactional data, “all the rage”
    Risk mitigation
    Efficiency and waste
    allow us to formulate solutions in englisn
  • “Library Hand” – we’ve been doing indexing, taxonomies, classsification since the beginning of our profession
    Machine created taxonomies are not new, text mining, extraction, and indexing have been automated since the 1960s. The earliest I could find was a paper published by the RAND corporation in 1961
  • Wider need for classification- Building Enterprise Taxonomies, Stewart
    The pendulum – “searching” versus “browsing” paradigms
    Search = lack of context, precision versus recall, relevancy ranking, choice of terminology
    Proper syntax for each search tool, where to search? Spelling variants, bad labels
    Where do we find taxonomies and ontologies today? Here are some of their natural habitats
    Web sites
    Discipline/Domain Classification
    Machine Learning Algorithms
    Training dataset and a testing dataset.
    As heather points out in her book the Accidental Taxonoist, the efficacy of machine created taxonomies improves dramatically with human quality control

    Legacy systems
    Hierarchical models
    Network models
    Diagram for a realtional database is in rows and columns,
    Classes, variables, attributies, qualities, fields observations instances, records, cases
  • NoSQL

  • Andrew Brust
    “Bigger data means weirder data” <-Jeffry Stanton in Intro to Data Science book
  • Big Data a revolution that will transform how we live work and think
    Weed out data noise

    Algorithms can be programed with human quality control to account for redundancy and catch inconsistencies, different terms
  • Autoclassification model:
    Linguistic/lexical: gather and rank representative words and phrases that are associated with the concepts to be classified; 
    Rules Based: no common syntax for developing rules; varies by tool. Rules syntax could be Boolean to the more complex syntax more commonly used in programming languages. Because of this lack of consistency, the people who create and maintain these rules will have a more specialized skill set and will require more training.
    Machine Learning/Predictive: And these systems rely on iteration to continuously validate. Traditional hierarchical taxonomy may not be needed, reference terms or document sets to model. Maintenance of machine learning systems = repeated training, especially when you add new content. You will also help revise the larger machine-learning model as you learn more about your content.
  • Examples of Domain Knowledge
    -Big data revolution book – buliding inspectors needed to predict which buildings should have priority inspections
    wEb design for user generated content – automatically ccategorizes user driven content but taxonomy is refined by humans
    As refined, the autoclassifier improves,”gets smarter”

    We as knowledge experts fill in the gaps!
    We can be facilitators with those in the field/analysts and those programming the algorithms

  • Example of meaningless data: Google Flu trends
    Scientific controlled experiments limit external sources, domain knowledge fills in the gaps in the real world data analysis
  • Transcript

    • 1. Bionic Info Pro: New Takes on an Old Theme Machine Learning, Taxonomy Creation, Big Data, Competitive Intelligence, and the Human Element Elaine M. Lasda Bergman Annual Conference Special Libraries Association Vancouver, BC, Canada Monday, June 9, 2014
    • 2. Overview • A little bit about Machine Learning • A little bit about Taxonomies • A little bit about Big Data • A little bit about Hybrid Techniques
    • 3. NOT NEW: Machine Learning for CI Mena, Jesus. (1996). Data Mining for Competitive Intelligence, Competitive Intelligence Review, 7(4):18-25.
    • 4. Refinement of Machine Learning • Decision Trees/Classification • Clustering • Anomaly Detection
    • 5. Refinement of Machine Learning • Support Vector Machines- – Predictive Classification • Association Rules – Marketbasket analysis • Natural Language Processing – Sentiment Analysis
    • 6. Getting up to Speed • • 6 Video Tutorials and Playlists on Machine Learning (January 2014)
    • 7. NOT NEW: Taxonomies in Information Retrieval
    • 8. Need for Taxonomic Structures
    • 9. NOT NEW: Datasets's-Foot-ERD-Sample60.png
    • 10. Enter BIG DATA
    • 11. BigData Sources and AnalysisDataType Qualities Analysis Tools Result Social Media Demographics API integration More profiles of like- minded users “Social Influencers” User Reviews NLP, Text Analysis Sentiment readings “Internet of Things” Logs/Sensors/Check-Ins Parsing Usage and behavior patterns SaaS Cloud/Web-based/Subscription software Dist. data integration/in-memory caching technology/API integration Usage behavior patterns, customer data, etc. Public Data e.g., Amazon Data Market, WorldBank, Wikipedia All above (depends on data structure) Depends on Dataset (and there are LOTS of them!) Hadoop/MapReduce Volume! Parallel Processing/Parsing/Reduction Big patterns, correlations, needles in haystacks Data Warehouses Internal transactional data Likely same as above Correlations, marketbasket, etc. NoSQL/Columnar Volume! Fills gaps in Parallel processing tools Real time activity and patterns In-Stream Monitoring Network traffic (streaming videos, system outages) Packet evaluation, distributed query processing Network/Stream usage patterns Legacy Data Usually PDFs & Documents/SemiStructured Transformation tools(eg, Xenos d2e) + above Depends on content (could be all)
    • 12. Why “Concept Hierarchies” in an Unstructured Environment?
    • 13. Advantages • When term is too low to appear in frequent item/rulesets • Create more interesting rules using more general, aggregated concepts [DVD, wheat bread, home electronics, electronitcs, food] Kumar, T.S. (2005) Introduction to Data Science
    • 14. Disadvantages • How low and how high in the hierarchy do you set the threshold? • Increased computation time • If threshold is to high, redundant rules for more specific terms can be summarized by rules using more general terms
    • 15. Hybrid Taxonomic Development • Understand your auto-classification model • Work with domain experts to create basic taxonomy • Test Taxonomy in the Model • Rinse, repeat Wendy Pohs,ASIS&T Bulletin 12/1/13
    • 16. Domain Knowledge and Thick Data • Thick Data analysis primarily relies on human brain power to process a small “N” while big data analysis requires computational power (of course with humans writing the algorithms) to process a large “N”. • Big Data reveals insights with a particular range of data points, while Thick Data reveals the social context of and connections between data points. Big Data delivers numbers; thick data delivers stories. Big data relies on machine learning; thick data relies on human learning. (Tricia Wang)
    • 17. Data Driven CI is Meaningless Without Human/Domain Knowledge world/
    • 18. Recap • Data Mining for CI is not new • Refinement and Improvement • Bigger, Weirder Data
    • 19. Recap • Where it’s at: Hybrid Schemas • Thick Data, not just Big Data • HUMAN ELEMENT IS ESSENTIAL
    • 20. Questions? Elaine Lasda Bergman University at Albany @ElaineLibrarian