Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automatic and rapid generation of massive knowledge repositories from data

612 views

Published on

Presentation by Frank Guerino to the SIKM Leaders Community on June 20, 2017

Published in: Business
  • Be the first to comment

Automatic and rapid generation of massive knowledge repositories from data

  1. 1. IF4IT AUTOMATIC AND RAPID GENERATION OF MASSIVE KNOWLEDGE REPOSITORIES, DIRECTLY FROM DATA Author/Presenter: Frank Guerino Chairman for The International Foundation for Information Technology (IF4IT) Email: Frank.Guerino @ if4it.com LinkedIn: https://www.linkedin.com/in/frankguerino/ Follow Us on Twitter: @IF4IT Co-Author: Dr. Joel Kline, PhD. Board of Advisors, The International Foundation for Information Technology (IF4IT) Professor, Lebanon Valley College, PA-USA 1
  2. 2. IF4IT The Future isAutomated Synthesis of Knowledge Repositories Read More: https://www.if4it.com/knowledge-management-automated-content-generation-and-curation/ Meet Bob. Bob is very competent. Bob outperforms other people by generating one great knowledge article per hour. Automated Content Generation Software Meet Bob’s replacement. Bob’s replacement generates millions of higher quality, highly curated, and semantically inter-linked knowledge articles, in the time it takes Bob to create just one… at a fraction of the cost. 2 Few knowledge repositories, limited content, poor curation, lots of dead links, and no semantic relationships. More knowledge repositories, far more content, greater curation, almost no dead links, and semantic relationships. ✖ ✔ ACTOR ACTIONS RESULTS
  3. 3. IF4IT The Wikipedia Problem • The Wikipedia Community is NOT like an Enterprise Work Community - About 17 years to develop, - Over 130M voluntary editors (i.e. free labor), - Over 6M content articles • People believe they can build internal knowledge repositories (like libraries and intranets) using the same manual content development paradigm as Wikipedia • The end result is almost always the same… “Relatively empty and low value Knowledge/Content Repositories” People often can’t find the answers they need. Read More: https://www.if4it.com/wikipedia-problem-understanding-enterprise-knowledge-repositories-fail/ 3
  4. 4. IF4IT The Problem is Manual Labor Quantity: Low quantities of artifact delivery. Quality: Higher levels of human-introduced errors. Time: Longer artifact delivery times. Money: High costs for delivery of artifacts. Trend: Knowledge Repository Automation is very important because, more often than not, teams that build them have very limited resource (people & finances). Trend: With the move to “Digital” the expectation of Knowledge Repositories is even higher. 4
  5. 5. IF4IT The Solution = Automation via Compilation • The process is called Synthesis (a.k.a. Compilation) • Compilation is the word used by software developers • Synthesis is the word used by non-software developers • Specifically, we use and recommend Data Driven Synthesis (DDS) • We use Compiler-based DDS to generate content, curate content, interlink content, and automatically build and provision Knowledge/Content repositories Read More: https://www.if4it.com/understanding-data-driven-synthesis/ 5
  6. 6. IF4IT Many Decades of Successful Synthesis  Synthesis/Compilation of Software (Since 1970s)  Synthesis of Integrated Circuit Schematics (Since 1992) - Inputs are Hardware Descriptive Languages (HDLs) like VHDL and Verilog. - Outputs are used for Simulation, Acceleration, Emulation, and Fabrication  Synthesis of APIs and software code (i.e. Scaffolding for Software Developers, such as for Java Spring and Ruby on Rails)  Synthesis of large volumes of test data to exercise complex systems  Synthesis of chemical Compounds for Drug Discovery  Synthesis of Health Care Pathways (Diagnosis + Treatments)  Synthesis of (computer generated) Music and Art  Synthesis of Electronic Documentation (i.e. data driven content)  Synthesis of Digital Libraries (massive web sites)  Synthesis of Semantic Data Graphs (SDGs) 6
  7. 7. IF4IT Who cares about DDS-based automation? • Internet and Intranet Web Content Managers & Developers • Technical Writers / Technical Communicators • Architects (Enterprise/Solutions/Business/Applications/Data/etc.) • Enterprise Models • Software Developers (Using Compilation for about 5 Decades) • API Documentation • Software Configuration Documentation • Engineers (Using Synthesis for about 3 Decades) • Hardware, Network, Communications, & Semiconductor Documentation • Anyone who documents topics, curates, and who publishes results to web pages in some Content/Knowledge Repository 7
  8. 8. IF4IT Common Use Cases Driving DDS • Strategic Planning – Enterprise Portfolio Impact Analysis • Faster Domain Documentation, - More inter-linked documentation, with interactive data and with fewer errors, @ far lower costs • Better Customer Support – Rapid and more accurate Incident Impact Analysis • Better Operational Work - Faster Knowledge Discovery = faster & better work decisions • Lower Development Costs – Synthesis helps eliminate significant Software Development • Better Search & Discovery – Synthesis helps yield better & more accurate Search Results Higher Levels of Customer / End-User Satisfaction 8
  9. 9. IF4IT Synthesis is Compiler-based Data Compiler/Synthes izer Baseline Input Data Processing Rules Synthesized Output(s) Outputs are used for machines like computers AND for Humans. Flat files like *.csv sourced from spreadsheets and systems. Controls ontologies, formatting, view controls, report generation, semantic relationship harvesting, etc. 9 Software Compiler/Synthes izer Source Code Files Compiled Software Software Compilation/Synthesis Data Compilation/Synthesis
  10. 10. IF4IT Benefits of DDS Agile: Changes can be made iteratively and in seconds/minutes • Simple CSV flat files can be compiled • No long software development cycles Scalable: Hundreds of Thousands or Millions of content pages can be generated in minutes Stable: Elimination of human errors, like dead links, leads to far higher levels of quality. Affordable: The cost per content page (including both Quantity and Quality) is a small fraction of manually generated content 10
  11. 11. IF4IT The Synthesis Sequence of Events Application Data (e.g. .CSV File) Capability Data (e.g. .CSV File) Human Resource Data (e.g. .CSV File) Product Data (e.g. .CSV File) Service Data (e.g. .CSV File) Etc. Data (e.g. .CSV File) Facility Data (e.g. .CSV File) Organization Data (e.g. .CSV File) …Synthesizer Inputs Fromspreadsheetsandsystems. 1 Processing Rules for • Relationship Discovery • Data Formatting • View Generation • Report Calculations • Etc. 2 Data Synthesizer/ Data Compiler 3 Node Views Data Graph/Network Relationships CI (z) CI (y) CI (x) Business Intelligence • Inventories • Reports • Graphs & Charts • Glossaries • Dashboards • Visualizations • Abbreviations • Acronyms Data Indexes Catalogs Intranet/ Digital Library 4 11
  12. 12. IF4IT Real Business Impacts 12 Your Compiler Intranets / Content Management Systems (Confluence, Jive, Drupal, MediaWiki, etc.) Architecture Modeling Tools (AMTs) (Troux, Mega, Adaptive, System Architect, etc.) Configuration Management Databases (CMDBs) (HP, BMC, ServiceNow, etc.) Stand-Alone Knowledge Management Systems (Madcap, KPS, Bitrix, SalesForce, ServiceNow, etc.) Library Management Systems (LMSs) (Koha, Soft Link, NGL, LibSys, Folet, etc.) Semantic Data Systems (Cambridge Semantics, Protégé, Swoop, LDIF, etc.) The Traditional Way = $$$$$$$$$$$$$$$$$$$ (Too many complex, expensive, difficult to deliver & operate systems and tools… just to get to a comprehensive view of your enterprise!) ExpensiveIntegration ExpensiveBusinessIntelligence&Reporting ExpensivePeoplewithSpecificSkills DDS Results = $ (A very simple, very quick, and very affordable “Compiler Based Approach”) Your Data Your Branded Digital Libraries (Complete with Catalogs, Indexes, Relationships, Data Views, Reports, Dashboards, Visualizations, etc.) 3 4 Your Data + Your Rules1 Complexity Simplicity 2 Data Synthesizer/ Data Compiler ✖ ✔ Many Years & Countless Resources Minutes/Hours & Small # of Resources
  13. 13. IF4IT Compiler-based DDS helps generate “Knowledge Structures” 1. Content – High quantities, richly formatted, highly structured, and strongly inter-linked 2. Interactive Data Visualizations - for Interactive Analytics, Data Science, and Visual Discovery 3. Knowledge Repositories – fully curated structures like advanced Intranets and Digital Libraries Read More: https://www.if4it.com/knowledge-management-understanding-knowledge-structures/ 13
  14. 14. IF4IT 1. Content: SFN over LFN Raw and unstructured human narrative in the form of “content” (not “data”). Highly structured data, based on Name/Value pair paradigms (e.g. CSV, JSON, etc.). ✖ ✔ 14
  15. 15. IF4IT 2. Interactive Data Visualizations VisualComplexity.com D3js.org • Data Science and Data Scientists are VERY expensive. • DDS creates a common set of fully integrated Data Visualizations • DDS automatically creates many more out-of-the-box and ready- to-use Data Visualizations, faster and at far lower costs. 15
  16. 16. IF4IT Geographic Maps Interactive Data Visualization Examples… Force Directed Graphs Bubbles Condegram Spirals Bars, Pies, Lines Sankey FlowsChords Multivariate Grids See many interactive examples in the gallery at: http://www.d3js.org 16
  17. 17. IF4IT 3. Knowledge Repositories Read More: https://www.if4it.com/nounz/ Generic Example: http://nounz.if4it.com Domain-Specific Example: http://km.if4it.com 17
  18. 18. IF4IT The Spectrum of Synthesizable Knowledge Structures Range of Synthesizable Knowledge Structures • Data Records/Nodes • Tables & Inventories • Charts (Pie, Bar, Area, Bubble, etc.) • Graphs (Line, Multi- Line, etc.) • Web Pages • Catalogs • Indexes • Reports • Semantic Relationships • Semantic Predicates Simple Knowledge Structures • Dashboards • Data Visualizations (many different visualizations) • Semantic Data Graphs (SDGs) / Semantic Data Networks (SDNs) • HTML Link Networks • Navigation Taxonomies • Classification Taxonomies Moderately Complex Knowledge Structures • General Web Sites • Intranets • Architecture Models • Architecture Repositories • Configuration Management Databases (CMDBs) • Domain-specific Knowledge Repositories Complex Knowledge Structures • Multi-Context/Multi- Domain Digital Libraries that include all other structures in the spectrum (all columns to the left) • Industry Specific Determinations… - Automatic Claim Processing - New Viable Drugs - Healthcare Care Pathways - High Frequency Auto- Investing - Etc. Super Complex Knowledge Structures Example Formats = TXT, CSV, TSV, JSON, XML, HTML, SVG, PDF, Etc. Simplest Most Complex • Bits and Bytes • Built-In Types and Constants • Lists, Arrays, and Hash Tables • Stacks and Heaps • For Loops, Do Loops, and While Loops • Formulas and Algorithms • Buffers, Streams and Files • Classes and Objects Simplest Knowledge Structures Read More: https://www.if4it.com/knowledge-management-understanding-knowledge-structures/ 18
  19. 19. IF4IT DDS Solves the Wikipedia Problem for Enterprises... Quantity: Much higher quantities of artifact delivery. Quality: Much higher levels quality. Time: Much shorter times for artifact delivery (i.e. much higher quantities with higher quality). Money: Much lower costs to deliver artifacts (especially for Data Science & Data Visualizations). FASTER & BETTER KNOWLEDGE DISCOVERY AND DECISION MAKING 19
  20. 20. IF4IT The Benefits of DDS • More and Better Knowledge Repositories - Far higher quantities of more advanced content - More advanced features and capabilities - Dynamic integration of data with content - Higher quality of content (e.g. far fewer dead links) - Far less investment of time and funds • Higher stakeholder satisfaction and engagement 20
  21. 21. IF4IT Getting Started with DDS 1. Acquire a Data Compiler/Synthesizer • Contact IF4IT for a free NOUNZ Lite compiler https://www.if4it.com/contact-us/ 2. Start with simple Spreadsheet-based Inventories (and Sharepoint List Structure extracts) 3. Incrementally customize small data sets to meet your needs and your desired look-and-feel 4. Slowly progress to more complicated Data Extracts (from proprietary systems) 5. Keep in mind that Time-To-Learn is “incremental” [you don’t have to start with big projects] Crawl Walk Run 21
  22. 22. IF4IT Questions and Discussion 22 Frank Guerino CEO & Chairman The International Foundation for Information Technology (IF4IT) Email: Frank.Guerino@if4it.com Twitter: @IF4IT
  23. 23. IF4IT Read More: • Automated Content Generation & Curation: https://www.if4it.com/knowledge- management-automated-content-generation-and-curation/ • The Wikipedia Problem: https://www.if4it.com/wikipedia-problem-understanding- enterprise-knowledge-repositories-fail/ • Understanding Data Driven Synthesis: https://www.if4it.com/understanding-data- driven-synthesis/ • Understanding Knowledge Structures: https://www.if4it.com/knowledge-management- understanding-knowledge-structures/ • Learn about D3 and Interactive Visualizations: http:www.d3js.org • Understanding Knowledge Structures: https://www.if4it.com/knowledge-management- understanding-knowledge-structures/ • Learn about the IF4IT NOUNZ Data Compilation Platform: https://www.if4it.com/nounz/ • See Interactive Example of DDS-generated Generic Digital Library: http://nounz.if4it.com (Less than 3 minutes to generate.) • See Interactive Example of DDS-generated KM Body of Knowledge: http://km.if4it.com (Only seconds to generate.) 23
  24. 24. IF4IT24 APPENDIX Real Case Studies
  25. 25. IF4IT Global Biopharmaceutical 25 -- TOTAL Administration Category Noun Instances = 5: Time = Wednesday June 15, 2016 at 10:04:08 -- TOTAL Assay Noun Instances = 749: Time = Wednesday June 15, 2016 at 10:04:08 -- TOTAL Biological Matrix Category Noun Instances = 42: Time = Wednesday June 15, 2016 at 10:04:08 -- TOTAL Biomarker Noun Instances = 42: Time = Wednesday June 15, 2016 at 10:04:08 -- TOTAL Company Noun Instances = 18: Time = Wednesday June 15, 2016 at 10:04:08 -- TOTAL Disease Mechanism Noun Instances = 17: Time = Wednesday June 15, 2016 at 10:04:08 -- TOTAL Facility Noun Instances = 3: Time = Wednesday June 15, 2016 at 10:04:08 -- TOTAL Immunoassay Platform Noun Instances = 6: Time = Wednesday June 15, 2016 at 10:04:08 -- TOTAL Instrument Category Noun Instances = 5: Time = Wednesday June 15, 2016 at 10:04:08 -- TOTAL Instrument Noun Instances = 37: Time = Wednesday June 15, 2016 at 10:04:08 -- TOTAL Offering Noun Instances = 516: Time = Wednesday June 15, 2016 at 10:04:09 -- TOTAL Program Category Noun Instances = 5: Time = Wednesday June 15, 2016 at 10:04:09 -- TOTAL Study Type Noun Instances = 17: Time = Wednesday June 15, 2016 at 10:04:09 -- TOTAL White Paper Noun Instances = 28: Time = Wednesday June 15, 2016 at 10:04:09 -- TOTAL Application Noun Instances = 1000: Time = Wednesday June 15, 2016 at 10:04:09 -- TOTAL Business Domain Noun Instances = 9: Time = Wednesday June 15, 2016 at 10:04:09 -- TOTAL Capability Noun Instances = 32: Time = Wednesday June 15, 2016 at 10:04:09 -- TOTAL Computing Server Noun Instances = 100: Time = Wednesday June 15, 2016 at 10:04:09 -- TOTAL Contract Noun Instances = 1166: Time = Wednesday June 15, 2016 at 10:04:09 -- TOTAL Country Noun Instances = 251: Time = Wednesday June 15, 2016 at 10:04:09 -- TOTAL Customer Noun Instances = 150: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Database Noun Instances = 100: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Data Transport Technology Noun Instances = 4: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Environment Noun Instances = 8: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Frequently Asked Question Noun Instances = 32: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Information Category Noun Instances = 16: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Interface Noun Instances = 99: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Language Code Noun Instances = 504: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Letter Noun Instances = 26: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Location Noun Instances = 50: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Market Sector Noun Instances = 2: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Market Segment Noun Instances = 2: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL News Article Noun Instances = 6: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Number Noun Instances = 9: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Organization Noun Instances = 29: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Policy Noun Instances = 100: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Process Noun Instances = 26: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Product Noun Instances = 25: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Project Noun Instances = 1000: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Resource Noun Instances = 14: Time = Wednesday June 15, 2016 at 10:04:10 -- TOTAL Sales Transaction Noun Instances = 886: Time = Wednesday June 15, 2016 at 10:04:11 -- TOTAL SDLC Activity Noun Instances = 353: Time = Wednesday June 15, 2016 at 10:04:11 -- TOTAL SDLC Phase Noun Instances = 14: Time = Wednesday June 15, 2016 at 10:04:11 -- TOTAL Service Noun Instances = 561: Time = Wednesday June 15, 2016 at 10:04:11 -- TOTAL Software Noun Instances = 100: Time = Wednesday June 15, 2016 at 10:04:11 -- TOTAL Glossary Term Noun Instances = 235: Time = Wednesday June 15, 2016 at 10:04:11 -- TOTAL Vendor Noun Instances = 100: Time = Wednesday June 15, 2016 at 10:04:11 -- TOTAL Undefined Noun Type Noun Instances = 1: Time = Wednesday June 15, 2016 at 10:04:11 TOTAL Number of Unique Noun Types = 48: Time = Wednesday June 15, 2016 at 10:04:11 TOTAL Noun Instances registered = 8500: Time = Wednesday June 15, 2016 at 10:04:11 TOTAL Number of Unique Abbreviations or Acronyms = 655: Time = Wednesday June 15, 2016 at 10:04:11 TOTAL Number of Unique Semantic Relationships = 30767: Time = Wednesday June 15, 2016 at 10:04:15 TOTAL Number of Unique Semantic Relationship Predicates = 97: Time = Wednesday June 15, 2016 at 10:04:15 TOTAL Minimum Number of HTML Links = 113536: Time = Wednesday June 15, 2016 at 10:07:27 Spreadsheets were used to easily and quickly collect, organize, and supply data to NOUNZ Compiler in 1st Normal Form CSV formats. Vertical industry and business data was collected from public Biopharma web site, organized and cleansed in about 5 hours. Generic IT Data was intentionally comingled with Biopharma vertical industry and business data, in order to show the effects of mixing different data types. TOTALS: Total unique Noun Types (Data Types) = 48 Total Catalogs = 50 Total Noun Instances (across all Noun Types = 8500 Total Semantic Relationships = 30767 Total Semantic Predicates = 97 Total Abbreviations and Acronyms = 655 Total “minimum” # of HTML links = 113536 Total Compile Time = 3 Minutes and 27 Seconds
  26. 26. IF4IT Regional Health Care Payer/Insurer 26 • 47 defined Noun Types (a.k.a. Data Types), • almost 49,000 Noun Instances (a.k.a. Data Instances or Records) that are sourced from the different Noun Types, • Almost 294,000 automatically synthesized web pages with different views of data and information, • Over 300K automatically discovered and harvested Semantic Relationships that translate directly to over 1,100,000 contextual and meaningful HTML links. • 46 total Catalogs, Including a Master Catalog, 47 Noun Domain Specific Catalogs (one for each Noun Type), an Abbreviations/Acronyms Catalog, and a Relationship Predicates Catalog • 288 unique Indexing Categories with 2582 unique Data Indexes • 869 harvested and curated Abbreviations and Acronyms • Over 1,600 unique semantic relationship descriptors (i.e. Predicates) • 47 Domain Specific Dashboards (one for each Noun Type). Total Compiler Time = Approximately 15 minutes

×