Product data processing 30.08.2011 gg


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Product data processing 30.08.2011 gg

  1. 1. Date: 30.08.2011 Product Overview Owner: JairIntroductionThe primaryobjective of this document is to consolidate context and interfacing for teammembersengaged in product development. We begin with an overview of key product systems, theircomponents,systemdata flows and key components roles in each flow. A set of appendices then dive intospecific systems and interfaces in more detail – each appendix owned by a specific team member.The terms defined in this document should be the terms used in all other related documents. Questionsand clarifications should be directed to specific section owners so that this document can continue toimprove and expand to achieve the document objectives.Kinor Spheres and AppsThe Kinor mission is to provide powerful tools for ordinary ‘workers’(knowledge workers) tocollaboratively harvest information (data and content) from any source (web and otherwise) in amanner that can best serve the needs of each worker in a fully automated, private and personalizedmanner. From a business perspective, the product conceptually comprises of the following: 1. Spheres - The information harvested from one or more sources is maintained in ‘spheres’, each sphere covering a specific domain of common interest to a group of sphere workers. Each worker group typically serves a specific business, organization or community. Typicalharvested sources include web-based catalogues, professional publications, news feeds, social networks, databases,Excel worksheets and PDF documents - public and private. 2. Pipes – Each source is conceptually connected to a sphere via a pipe that pumps harvested information from the source to a specific sphere on an ongoing basis. Spheres can be fed by any number of pipes, the pipes primed (configured) by a non-technical sphere administrator or collectively (crowd-sourced) by the workers themselves. The output of each pipe is a semantic datasetthat is published to the sphere and maintained there for automated refresh via the pipe. The dataset is semantic in the sense that data within itis semantically tagged in a manner that enables subsequent data enrichment, integration and processing to be fully automated. 3. Apps – A growing spectrum of sphere applications will empower each worker to automatically view and leveragepublished informationwithin the sphere in a fully personalized way. The initial apps will be horizontal, i.e. readily applied to any sphere - each worker configuring the app to match personalized needs. Sample horizontal apps will: a. Enable workers (interactively or via API) to easily find the information that they need within a sphere and deliver it in the most useful form. b. Automatically mine sphere information for worker configured events, sentiments and inferences.
  2. 2. c. Automatically hyperlink and annotate sphere information with additional information within a sphere as prescribed by sphere administrator or worker-defined rules. Horizontal apps will pave the way to an even greater number of pre-configured vertical appsdrawing information from specific spheres, i.e. a specific ontology with appropriate pipes. Once the sphere core has been configured, vertical apps provide instant value out of the box available to all customers. Horizontal apps on the other hand enable workers to independently or collaboratively develop their own spheres with unique value available only to them.Harvested information can either be cached (replicated) within the sphere or acquired on demand fromthe source. When dealing with unstructured and semi-structured sources of information, e.g. the Web,the harvested information will typically be replicated within the sphere unless the volume ofinformation is prohibitive. When dealing with fully structured sources, e.g. a database, the harvestedinformation can be acquired on demand if this will not disrupt operations for higher priority sourceaccess. Needless to say, the response time for on demand acquisition will highly depend upon thevolume of information, the source availability and responsiveness and the semantic complexity of thatinformation, i.e. the computational resources needed to semantically tag and process that information.SemanticIntelligenceKinor empowerment of non-technicalworkers is achieved by cultivating and applying semanticintelligence to automate every possible aspect of harvesting, processing and applying information. Thesemantic intelligence is maintained in aknowledge base (KB) linked to a growing web ofontologiesmapped to a common super-ontology. Each sphere addresses a specific domain of interest,e.g. online book stores. During the information acquisition stage, the KB associated with the sphereontology semantically tagsidentified entities and their properties within that information. The sphereontology must therefore contain a set of inter-related schemas (frames) that describe allentities in thesphere, e.g. books, authors, publishers, suppliers, prices and reviews. Each entity schema must alsocontain anticipated properties (frame slots), e.g. book properties might include title, author, publisher,ISBN and year of publication.Kinor internally refers to each entity property as an atom,each atomassigned a predefinedsemantictag,the atom value a predefined atom type. Thus for examplea ‘year of publication’ must be a valid yearand ‘book publisher’ must beavalid publisher name. Atom types are typically recognized by a set of textpatterns (e.g. a four digit year) ora namethesaurus (e.g. formal publisher names and known synonymsfor these names). Atom filtersauto-recognize specific atom types whileconsidering both the atom valueas well as theatom context in which the value was found, e.g. a column name (e.g. ‘home phone’) or aprefix (e.g. ‘(H)’). Armed with adequate semantic intelligence, blocks of information piped into a sphereare automatically parsed by atom filters into records (e.g. book records) of semantically taggedproperties to be associated with a specific entity (e.g. a specific book instance).All sphere schemas are mapped to the super ontology with its shared bank of atom types and theirrespective atom filters and contexts so that semantic intelligence can be cultivated collectively (e.g. newfilter patterns and thesaurus entries) by all spheres. Atom thesauri in the super ontology also retain
  3. 3. frequently adjoined entity properties (e.g. a given publisher city, state and country) to facilitate theauto-acquisition of new entity names and synonyms. The super ontology can thus readily expandautomatically with relatively little supervision.Entity properties from one pipe can be subsequentlymerged(joined) with entity properties from otherpipes when all properties have been attributed to the same entity. Matching entities across pipes can bechallenging since each pipe record must have a unique identifier key (UIK)based upon propertiesavailable in each record. A set of propertiescan uniquely identify an entity with a degree of confidencethat can be computed empirically. The ISBN property alone (when available) can serve as an idealUIK forbook entitieswhereas a UIKbased upon the book title and author properties is somewhat less reliable.Entity records from multiple pipes can only be merged if they have UIKs with adequate confidence levelsto be determined by a sphere administrator or worker.Semantic intelligence can only operate on sphere information mapped to the sphere ontology. Schemasfrom public and private ontologies are acquired and retained in an ontology bank mapped to the superontology. Unmapped sphere data can then be schema matched with schema in the ontology bank tosemi-automatically add or expand sphere schemas to map them. When dealing with well-structuredcatalogue sources, ontologies can be auto-generated from the catalogue structure and data itself.Key Product SystemsKey product systems include the following: 1. Pipes – The pipes system schedules all tasks related to the harvesting of information from Web and additional sources and subsequent data processing. Each task is executed by one or more agents distributed in a cloud. The most common pipe tasks include spidering (collecting designated pages of interest from a specific web site), scraping (extracting designated blocks of information from those pages), cleansing (decomposing those blocks into semantically tagged atoms and normalizing the atom values where possible) and finally importing the semantic dataset into a sphere repository. The pipes system also includes a Wizard for priming the pipes. 2. Spheres – Each sphere retains fresh semantic datasets for each pipe in a query-ready repository (QR) capable of serving a growing number of horizontal and vertical apps. The QR must respond to app queries on demand while also enabling a growing spectrum of ongoing app tasks to process and augment the QR in the background. Each QR atom maintains the history of that atom starting with the pipe history produced by the pipe. The origins of each atom and value are thus readily traced back to the source and the data processing tasks. 3. Ontology – The ontology system comprises of a centralized ontology server (OntoServer) working in unison with any number of ontology plugs (OntoPlug) to apply semantic intelligence to every possible aspect of harvesting, processing and applying sphere information. The OntoServercultivates and maintains the KB for all spheres while the OntoPlug caches a minimal subset of the KB to serve specific agent tasks. 4. Applications - A web-based user interface provides integrated user access to all authorized applications (apps) including administrator apps for managing the above systems.
  4. 4. 5. Framework –A common framework for the above enables all of the above systems to run securely and efficiently atop any private or public cloud. 6. E-Meeting –An interactive conferencing facility fully integrated with the product that enables existing and potential customers to instantly connect with designated support and sales representatives for instant pilots, training, assistance and trouble-shooting.Key Pipe ComponentsWithin the Pipes system, key components include the following: 1. Wizard – The Pipe configuration wizard enables a non-technical user to prime a pipe within minutes, i.e. to direct the pipe in how it should navigate within a Web site to harvest all pages of interest and subsequently extract from those pages all required blocks of information. Very few user clicks are needed to determine: a. Which deep web forms (searches and queries) to post (submit) with which input values. b. Which hyperlinks and buttons to subsequently navigate (follow) to harvest result pages of interest. Note that some result pages may lead to additional pages with different layouts - hence each page must also be associated with a specific layout id.
  5. 5. c. Which blocks of information per page are to be extracted and semantically tagged for each layout id.A scraper filter is subsequently generated per layout id – hence blocks of interest must only be marked for one sample page per layout. Additional sample pages are randomly chosen to test the scraper filter for worker feedback using several pages. d. When it should revisit the site to refresh that pipe dataset. Throughout this process the wizard will provide feedback regarding the pipe dataset that will be produced using the current pipe configuration as well as the anticipated price tag for acquiring and refreshing the dataset. The pipe dataset produced for Wizard feedback will use a relatively small sample of pages for user feedback within seconds and the dataset will not be published to the sphere. The user can subsequently refine the pipe configuration to better suit user needs. Once primed via the wizard, the pipe can operate autonomously as depicted in the above diagram by the ‘Map a website’ followed by ‘Run’ that results in ‘Notification’ when the pipe completes its operation (‘End’).2. Spider – Any number of spider agents can then interact in parallel with the source web site to harvest all pages of interest. The harvesting is accomplished in two stages: a. Site spidering – A multi-threaded collection of URLs and subsequent postings and navigations are produced with a unique id, an order tag (to collate scraper results in the proper order) a parent tag (the id of the page that pointed to it) and a page layout id tag. New pages are readily flagged by comparing the new collection with the previous one. The harvesting of pages can subsequently be parallelized in an appropriate order by allocating subsets of the collection to several agents. b. Page harvesting - Either all pages or only newly flagged pages are cached in the pipe repository by any number of spider agents with order tags so that the pages can subsequently be processed in an appropriate order. Each harvested page is recorded in a page index with sourcing tags that include the site name, an order tag, bread crumb tags (i.e. posted form inputs and navigations) that led to this page, a layout id tag that identifies the scraper filter for extracted blocks from that page, the harvesting date and time and the site response time for that page.3. Scraper – The layout id tag is used to apply the appropriate scraper filter to extract the designated information blocks per page and transform them into dataset records. Any number of scraper agents can do this in parallel, each agent producing a dataset of scraped records per page. The page datasets are then merged into a pipe dataset in an appropriate page order. Key scraper filter components include the following: a. Sequence filter –Matching tag sequences are used to mark designated page blocks. The sequences are robust by being sparse (only key tags are included) and depth sensitive (reflecting how deep they are in the element tree). b. Block table filters – Blocks with conventional table structures use these structures to parse records with context tags that include column numbers and headers where available. Filters are robust in that they treat nested tables for a variety of tables while handling missing tags. Tables can either be vertical or horizontal.
  6. 6. c. Record markers – New dataset records are identified by table structure or by distinct fields within the record. Thus, if the third field in the record is always a telephone number, the beginning or end of the record is readily found. Record markers take into consideration that some fields might be broken into multiple parts for multiple styles, hyperlinks, multiple lines and other special effects within a single field.d. Context–Context is crucial to automated semantic tagging, hence all relevant column headers and value prefixes (e.g. QTY in ‘QTY:50’) are extracted and attached to relevant field values (e.g. ‘50’). Frequent contexts are retained in the sphere ontology so that probable context can be auto-identified by structure (e.g. table position), style (e.g. color/emphasis) and content (e.g. frequent context values).e. uFilters(micro-filters) – Block filters may contain any number of uFilters to mark records and context as the block information is being parsed. The set of uFilters are applied to every field in the block, each uFilter checking for a specific combination of field attributes that include the following: i. Field content – A specific text (Equals, StartsWith, EndsWith) or equivalence with the page title as identified by TITLE tags. ii. Field atom type – Contains a designated text pattern or name found in a designated thesaurus. iii. Field style – Has a designated set of style attributes, e.g. font color/size, emphasis, header level (e.g. H2). iv. Field ID – Was preceded by a designated “id=” tag. v. Field Column – Appears in a specific table column. All uFilters are applied to their designated blocks of information before the record parsing begins. uFilters can also be configured to mark nearby fields: vi. Field displacement – the marked field is a given number of fields before/after. vii. Field expand –fields before/after also marked until specific tags are matched.f. Active atom filters and patterns – uFilters are applied to all block fields, hence the need to minimize the computational requirements of each uFilter. Atom filters can be heavy, hence only those filters that appear in the uFilters are active. Moreover, when dealing with atom filters with several patterns, only patterns that actually captured atoms during the pipe priming will be active. An active atoms tag in each scraper filter must contain the active atoms and patterns so that only these will be armed in the OntoPlug serving the scraper.g. Variable records – Some pages may contain more record fields than others, whereupon placing the right fields into the right columns can be challenging. Context tags are helpful only if they are available. Hence the need for additional field context as defined by the uFilters that captured each field. Field context tags include TITLE, atom type, field style, field ID and field column.h. Layout uFilters – Inconsistencies in the page templates (the actual template may depend upon the number of results) may cause the sequence filter to break. Optional layout uFilters mark block begin/end as a backup.
  7. 7. i. Block relationships – Each record block is ultimately parsed into a sequence of semantically tagged atoms that belong to a specific record. This specific record might be one of several records in a table block on a parent page. A table block contains several records (e.g. pricing options) that might all belong to a single record. Hence block relationships between pages and on the same page are crucial. The following assumptions are made: i. All record blocks on the same page belong to the same record. ii. All blocks on the same page belong to a specific record on the parent page that pointed to that page. iii. A table block belongs to the same record (as a table within a record) as the record blocks on the same page. iv. An attachment (history) block belongs to all records in the table block on the same page. j. Record categories–Records extracted from a single site might all belong to one or more categories, e.g. in an online catalogue in which different product categories will have different sets of columns. A page dataset must therefore be associated with a specific category. The category might appear on the page as a category block or it might be auto identified by the bread crumbs that led to that page. k. Category taxonomies – When comparing records from multiple pipes it might be necessary to compare only those that belong to the same category. In such cases there is a need to develop a standard category taxonomy for the sphere and map the local pipe categories to that standard taxonomy. This is referred to as automated category harmonization and it is carried out by the Ontology system. Upon selecting a page block and indicating the nature of the block contents (table, record, category, or attachment), the generation of candidate filters is fully automated whereupon the selection of a specific candidate filter is finalized by the wizard ‘by example’. This means that the candidates are ordered by probability and a ‘guess’ button enables the Wizard to try each candidate filter until the user is satisfied with the results in the Collected Data pane.4. Tables (kTable) – All datasets produced by the pipe are maintained in a kTable component that is first produced by a Scraper agent and then passed on to additional pipe agents. a. Columns, atoms, column relationships b. Tables within tables and block relationships.5. Cleanser – Tables are further refined by cleanser agents that may operate on specific sets of records and columns. Any given kTable may undergo several cleanser iterations as the semantic intelligence expands to cover more table columns and specific atoms. Hence subsequent cleanser iterations will typically focus on specific columns. Specific columns might also be tagged by the Wizard to remain ‘as is’. Similarly, the OntoPlug will indicate to the cleanser which columns can’t be cleansed due to a missing semantic tag or a semantic tag with inadequate intelligence in the KB. Key cleanser operations include: a. Column decomposition – Identifying atoms within a field and creating new columns for each of them. Decomposition can be recursive when atoms can further be decomposed into more refined atoms.
  8. 8. b. Atom normalization – Atoms that can be normalized are normalized as directed by atom normalization tags associated with each column. All transformations are documented in the kTable history so that the origin of each atom and value is readily traced. The refinement state of each kTable (e.g. percentage of cleansed columns and decomposed and normalized cells) is updated so that the quality impact of each agent can also be instantly assessed. 6. Enrich – The sphere administrator is tooled with a horizontal app that analyzes QR data to identify potential rules that govern entity properties. Thus, for example, a memory chip with a given capacity, error checking capability and packaging will always have a given number of pins. Similarly, a person under age will not be married or have a driver license. The administrator may add rules that reflect domain knowledge and the app is semi-supervised to ensure that incidental patterns are not adopted as rules. The enrich agent can then use these rules to fill in missing date and flag suspicious data. Rules may also have confidence levels that are reinforced or diminished by additional data and administrator confidence. The kTable history must document all rules that affected a transformation and a follow-up tag is added to all suspect atoms to ensure that they are subsequently given the right attention. All tags record also when they were created and by whom to evaluate the quality of treatment. The enrich agent may also insert additional entity properties found in the super ontology for identified entities. 7. Import – This agent imports a kTable into the QR when it is ready. The importer must also concatenate all record snippets from multiple blocks and pages into complete records. This is also where the UIK is determined via the OntoPlug.As implied by the above, the pipe value proposition already includes the following: 1. Semantic tagging that can also serve the auto-generation of RDFA content. 2. Field decomposition 3. Atom normalization 4. Enriched data, e.g. missing fields using rules and additional entity properties from the super ontology. 5. Flagged suspicious data 6. The best available UIK for these records with a confidence level.To emphasize this value proposition and manage customer expectations, the Wizard will reflect the pipeoutput in a Collected Data pane. This means that a small sample of pages will be run through the pipe toproduce a kTable that will be presented by the Wizard in that pane for additional user feedback. Userfeedback may be global (trying a different candidate filter) or local (selecting a specific semantic tag for acolumn, changing the UIK or configuring a column to be left ‘as is’). By default, a column that is notsemantically tagged with be parsed syntactically by the cleanser without semantics.Key Sphere ComponentsThe harvested pipe datasets are published and maintained in a sphere for worker access via a growingspectrum of horizontal and vertical apps. Horizontal apps empower each worker to automatically view
  9. 9. and leverage the published datasets in a fully personalized way. Most apps have both ongoing andinteractive processes: Ongoing app processes analyze and augment sphere data to make it more useful. Thus, for example, a financial app might use NLP and additional techniques to identify relevant news and sentiments on a variety of sites. An e-commerce app might seek reviews that compare competing products so that consumers seeking one product can be offered cheaper deals for similar products. Interactive appprocesses deliver personalized views to users and respond to consumer activities. Thus, for example, the financial app will deliver the news that each user subscribed to and the e-commerce site will offer deals that match products being sought via a partnering search engine. Each user might also want to search the sphere datasets for specific information.The sphere administrator will also want to optimize the quality and availability of the sphereinformation. To this end there will be a need for additional administrative processes: Quality optimization processesseek methods to continuously improve the quality of the information, e.g. by identifying rules that will flag erroneous data and seek corroborating sources to increase the confidence levels of app results. Performance optimization processesseek methods to improve app performance, e.g. by maintaining frequently queried aggregated datasets that take precedence over the datasets that have already been aggregated to improve app response times.To support the above, the sphere system will initially comprise of the following key components: 1. QR – Vertical database containing all data collected from all pipes. 2. QBE – Enabling users to find the sphere information that they need without prior knowledge of the sphere categories, schemas and value formats. 3. Indexer – Building an index of unstructured nuggets so that they can be semantically searched. 4. Rules Builder – Tools for automatically deriving rules and their confidence levels from the QR and manually editing and building them for auto-enrich and the flagging of suspicious data. 5. kGrid – Delivering ad hoc data integration on-the-fly from disparate online database sources. This will initially serve only cloud-hosted databases until the patent-pending bio-immune security is implemented whereupon it can also tap into remote private databases.Conceptually, every entity in the sphere ontology has an independent QR table. When dealing withcatalogues, every product entity may have different properties, resulting in a very sparse database.Upon querying for a specific product with given properties, the QBE will identify all properties thatsufficiently match the property names and constraints to include all relevant products. Propertiesalready covered by the ontology will match all known property names. Properties not covered will seeksimilar names. RoutineOntoServer processes will attempt to identify equivalent properties across sitesto add these properties to the ontology.
  10. 10. kGrid maps online databases to ontological entities so that their data can also be queries on demandand combined with data already within the sphere.Spheres will typically fall into one of the following classes: 1. Simple – These are spheres that can readily be built and applied by non-technical users without assistance. 2. Complex – In these spheres, power users use advanced features to get the job done with appropriate Kinor guidance. 3. Custom – These spheres require Kinor customization and/or additional features to get the job done.Progressive sphere improvements will ensure that more and more complex spheres will become simpleand fewer spheres will require any customization.Key Ontology ComponentsThe ontology system comprises of the following key components: 1. OntoServer – An integrated environment for managing, maintaining and importing/exporting ontologiesand super ontologies. Must also provide an API for OntoPlug access to the KB. 2. OntoPlug– An OntoServer proxy capable of rapidly loading select portions of the KB from the OntoServerand providing all ontology based services to all designated clients in the Pipe, Spheres and Apps. Current (immediate) OntoPlug clients include the following: a. Wizard – Determining which atom filters and patterns should be armed for uFilters. Identifying most probably semantic tags per column. Also auto-suggesting block of information that match the sphere ontology. b. Spider–Use atom types and learned properties to choose best form inputs. c. Scraper - Identifying atoms as needed by uFilters to scrape pages. d. Cleanser – Decomposing and normalizing cell values. e. Importer –Proposing best UIKs per dataset. f. QR – Choosing the best UIKs across datasets and using thesauri to semantically expand queries to cover all synonyms and possibly even instances. g. Content enrich app – Finding known entities in an unstructured text (tokenizing) and enriching the text with the entity properties. 3. Ontology Editor – An ability to view existing ontologies and manually model new ontologies as needed. 4. Ontology Builder – Auto extension of a sphere ontology using the ontology bank or via the adoption of pipe schemas as an ontology (e.g. in an online catalogue) and auto-mapping other pipe schemas to that ontology (column harmonization). 5. Ontology Trainer – Auto acquisition of new thesaurus entries and their synonyms from undefined atoms collected by the pipes. New patterns must also be acquired in a semi- supervised process.
  11. 11. 6. Category Harmonization– Analyzing pipe categories to produce a category taxonomy or enumerator (id) that all pipe categories can readily be mapped to.Key Framework ComponentsThe framework system seamlessly deploys the other systems on any private or public cloud. Theframework is designed to optimize cloud utilization, measure performance, automate testing and readilysupport a growing number of apps. Key framework components already include: 1. Repository – Persistent storage for all configuration and operational data serving the pipes including cached pages, scraper filters, kTables and recorded data. 2. Scheduler – Scheduling pipe agents to meet quality of service (QoS) and refresh requirements while also catering to hi-priority agent tasks initiated by the Wizard on demand. 3. Planner – Planning a schedule for 4. Back office apps - 5. Recorder – Recording all activities for testing, performance optimization and exception handling. 6. Auto-tester – Using archived repository content to do regression testing and ascertain that the quality and performance only improves with new system versions. 7. Health monitor – Analyzing recorded data to measure KPIs (key performance indicators) that reflect system health and ascertain that it is acceptable and only improving. 8. Agent emulator – To support agent debugging outside the framework. 9. Portable GUI – An integrated web-based framework for all user interactions with these systems including all apps. It is portable in the sense that it will ultimately support several deployment modes (e.g. with and without code downloads) without modifying the apps. 10. Bio-immune security – Elements of kGrid and its patent-pending bio-immune security will be integrated into the framework to support secure import and export of enterprise data for arbitrary cloud-based applications.Given the generic nature of this framework, it might ultimately be open sourced to enableothers todevelop new pipe agents and apps. Upon implementing the bio-immune security, the framework mightbe sold as an independent product for secure cloud computing.eMeeting SystemAn interactive web-based conferencing facility must be fully integrated with the product to enableexisting and potential customers to instantly connect with designated support and sales representativesfor instant pilots, training, assistance and trouble-shooting.System FlowsEach of the above systems has key data flows to be surveyed in the following sections with the rolesplayed by key components. The key data flows will be reviewed in the following order: 1. Pipe Data Flow 2. Application Data Flows
  12. 12. 3. Ontology Data Flows 4. Sphere Data Flows 5. Framework Data FlowsData flows typically span multiple systems but each will be addressed in the context of one system.Pipe Data FlowKey pipe data flows include the following: 1. Record assembly – Parallel caching and scraping can result in the loss of order and relationships (parent-son) between records spanning multiple pages. a. Consider, for example, a table of books (block 1) with links to book details (block 2) on separate pages that include a link to a collection of reviews (block 3) on a single page. Moreover, the table of books may contain additional category information that relates to all books in the table (block 4). b. Then each block consists of one or more block records, each consisting of one or more data fields that the scraper will extract with their field contexts. Such a dataset could subsequently be imported into the QR in two ways: i. One table – Data from blocks 1, 2, 3 and 4 are assembled into a single table whereupon multiple reviews for a book will result in multiple rows per book, one row per review with extensive duplication. ii. Multiple tables – Independent tables for category data (block 4), book data (blocks 1 and 2) and review data (block3) appropriately interlinked. c. Clearly the latter approach has advantages but a sphere administrator might prefer the former. By maintaining all scraped and cleansed data as block records, duplicate kTable storage and cleansing is avoided and the decision as to how to collate the data in the QR can be made in the final import stage. This also simplifies the distribution of pages to agents for caching and scraping – each page can be processed independently with the exception of cases (e.g. PDF files) in which records roll over from one page to the next. d. As later detailed, the initial spidering phase must therefore create an index of category {id, bread crumbs} and pages {id, category, order and relationship}. The scraper must subsequently create an index of blocks {id, type (table, record, category or attachment), page and relationship} and records {id, block} so that all block records belonging to the same entity/schema can be appropriately collated upon import. e. Note that cleansing and enriching are also best applied to block records to avoid unnecessary duplication.Note also that scraper collating of block records would require that each sub-tree of pagesbe processed by the same scraper in a specific order using multiple scraper filters to cater to the multiple page layouts. The kFrameworkmaintains a cache with Page objects for each pipe, each page retaining navigations to previous and next pages. To support block linkage for subsequent record assembly, kFramework maintains a kRegistryto maintain the following:
  13. 13. a) Each page object must retain a parent record id, e.g. when a page with an index containing several books links to a page per specific book, each specific book page must link back to a specific record in the index page. The parent record id must consist of the page id, the parent block id containing the index and the link value itself so that we can subsequently identify the parent index record for each specific page. Block Ids are foreign to the spider, so the spider merely registers the LinkValueToPage so that the scraper can later identify the link content and register block and record ids of the parent record id. Class PageRecordId { ParentPageId, ParentBlockId, LinkValueToPage}b) Each page object must also identify the scraper filter that must be used to scrape that page. A scraper filters is assigned to each page layout, so the page object need only identify the page layout that it belongs to, each page layout also having a list of layout blocks: Class RegisteredLayout { LayoutId, ScrapeFilter, List<LayoutBlock> } Class LayoutBlock { LayoutBlockId, BlockType} EnumBlockType { Record, Table, Attachment, Category } Class PageLayout { LayoutId, List<LayoutBlock>} Class RegisteredPage {PageId, PageRecordId, LayoutId, List<BlockId>,BreadCrumbs} Class RegisteredBlock{BlockId, PageId,LayoutBlockId}c) To keep track of all pages and blocks, kLibrary must retain a dictionary of registry of Layouts, lan index of each page object will retain a list of blocks the ) several pages withand Sibling navigations and each page must also maintain a list of Blocks. Dictionary<LayoutId,RegisteredLayout>LayoutRegistry Dictionary<PageId,RegisteredPage>PageRegistry Dictionary<BlockId,RegisteredBlock>BlockRegistryd) The spider must maintain the LayoutRegistry and PageRegistry whereupon the scraper maintains the BlockRegistry, independently scraping records per BlockId and storing them per BlockId in the kTable.e) The assembly of records spanning multiple blocks can then be accomplished by the Importer as follows: Category, Record and Attachment blocks only have a single record per block whereas a Table block may have several records. We assume any number of Table and Record blocks per page. If there are Record blocks, then all Record blocks are assembled as a single Record, all Category and Attachment blocks linked to it and any number of Table blocks are linked to it as well as Tables within that record. If there are no Record blocks, each Table block produces any number of records to which all Category and Attachment blocks are linked.
  14. 14. If the Page has a PageRecordId then all records produced are linked to that record. The appropriate record is identified by finding a record in the designated BlockID that has the appropriate LinkValueToMe. Record fields are loaded into the QR as dictated by the sphere ontology. Thus when loading a record containing a table, if the table contents map into that of an ontological entity linked to other ontological entities in the record then each entity will be loaded into a different table appropriately linked. If the table contents map into entity properties with an appropriate cardinality then it will be loaded into them.2. Scraper filters –scraper filters are automatically generated and tested for sample pages by the Wizard, but they may need to be adjusted by the Wizard after they are applied to all of the pages. The auto-generation and adjustment is accomplished as follows: a. One or more blocks per page are marked by the worker as one of the following types: Table, Record, Category or Attachment. b. The default block type is determined by the block content assisted by the OntoPlug. Thus for example the presence of bread crumbs suggests a Category block. The presence of a table with multiple records suggests a Record block. The presence of certain types and context may suggest a Record or Attachment block. c. Key elements in a scraper filter include: i. Tag sequences to identify the page blocks. ii. Layout uFilters to identify page blocks including a potential Title uFilter to identify the beginning of a nugget block. iii. The above two mechanisms backup each other in case one fails (e.g. inconsistent html templates) or a uFilter gets confused by dynamic content. iv. A record/attachment block may have variable sets of fields per page – hence as many fields as possible are provided uFilters to map them to appropriate block dataset columns. v. A table block requires mechanisms to detect its headers and records. If there is an obvious html structure, the key table tags are used – else a set of block uFilters to identify headers and the beginning or end of each record. vi. In both of the above cases, the generation of uFilters begins with an attempt to find a strong set of uFilter attributes per field until as many fields as possible are readily captured by a uFilter. A strong uFilter is one that captures a reasonable number of fields. A uFilter in a record block that captures half of the fields might be useful for capturing labels whereupon it will also be included. vii. In a table block, the number of fields captured by each uFilter then serves as a basis for identifying the number of records on the marked page. viii. In a table block, each uFilter is also assessed regarding its ability to serve as the basis for breaking the table into records. ix. All of the generated uFilters are treated as candidate uFilters of which the most probable ones are armed for use in the scraper filter. Less probable ones are retained so that the Wizard can show the dataset that would be produced if
  15. 15. they are armed, enabling the worker to choose the right combination by example. x. The Wizard also uses the OntoPlug to assign semantic tags and cardinality per uFilter to determine how captured fields will be mapped into block columns. d. Subsequent Wizard adjustments merely alter the set of armed uFilters and the semantic tags and cardinalities assigned to them.3. Semantic tags – Whereas the scraper can readily extract fields and insert them into appropriate columns with context in page block tables, the decomposition of fields with multiple atoms can only be accomplished by the cleanser for columns that have been semantically tagged. The semantic tagging of columns is accomplished as follows: a. When dealing with site columns that have already been mapped to specific semantic tags, the semantic tagging can be accomplished automatically by the scraper upon concluding its work or by the cleanser prior to cleansing. The former is preferred since cleansing can be iterative and it’ll be easier to plan and schedule if we know in advance which columns can be cleansed. b. The OntoPlug attempts to determine a semantic tag …………….. c. When dealing with consistent columns yet to be mapped, the Wizard can query the OntoPlugfor the most probably headers, enable the user to approve its automated selection and prompt the user to select a specific semantic tag from a short list when the confidence levels associated with the most probable tag isinsufficient. d. not sufficiently differentiated.too close and allow the user to determine the right one. Site columns that have not yet been semantically tagged are analyzed by the OntoServer to attempt expansion of the sphere ontology to cover these columns and map them accordingly.4. Pipe OntoPlug directives a. uFilter attributes i. Atom Filters – Identifying the atom type and pattern that best captures the atoms in a given field. ii. Label/Prefix – Probability that a uFilter captures labels and/or prefixes based upon the text in the fields captured by that uFilter. Labels will subsequently be captured by a Text Equals attribute and a prefix by a Text StartsWith attribute. b. Semantic tags5. Atom filter training – Saving fields with unknown atoms in decompose and elsewhere with their contexts and semantic tags…6. Atom context training– Atoms will often appear in new contexts that have to be acquired by the ontology and newly harvested data will often lack the ontology needed to cleanse it. The scraper records the context of each field in kTableso that the OntoServer can subsequently learn everything possible for harvested data, including new patterns for existing atoms as well as new atoms. It is context like column, style, headers and ID that enable us to recognize fields that contain common atoms and provide hints as to what they might be.
  16. 16. 7. Field decomposition– 8. Table/record within table – the scraper uses context (style, id, etc) that is not available to the cleanser….Application Data FlowsKey application data flows include the following: 1. Query auto-suggestion 2. Aggregated data view 3. Sourcing crowd-qualification 4. Best-practice sphere views 5. Best-practice content enrichment 6. Collective sphere ontology development 7.Ontology Data FlowsKey ontology data flows include the following: 1. Ontology Acquisition 2. Contexts 3. Entity relationships 4. Synonyms 5. Patterns 6. Atom Types 7. Rules 8. Atom Filter training 9. Unique Identifiers – production of UIK permutations 10. Schema expansion 11.Sphere Data FlowsKey sphere data flows include the following: 1. Publishing 2. Merging data across multiple pipes 3. Conflict resolution 4. Manual refinement 5. Rules 6. Crowd worker data refinement 7. Ontology refinement 8. Quality refinement
  17. 17. 9. Performance refinement 10. 11.Framework Data FlowsKey framework data flows include the following: 1. Data quality measurement 2. Key Performance Indictors (KPI) 3. Cloud utilization optimization 4. Auto-testing optimization 5. Pipe caching optimization 6.
  18. 18. Appendix A: kTables Tags Owner: MosheEach pipe may produce several tables of data, each category of entities maintained in a separate table.Each table comprises of a matrix of fields, each field belonging to a specific column and row. Each field,column, row, table and pipe may have several properties maintained as key/value pairs maintained bykTables at an appropriate pipe, table, column, rowand field level that reflects their scope. The kTableskeys are collectively referred to as kTable tags maintained in a kTableTags enumerator. The propertyvalues are often objects defined in designated APIs.Consider a site that sells books, CDs and videos with reviews. Then book, CDs, videos, prices and reviewscan be treated as independent entities maintained in separate tables with specific rows in one table(e.g. prices and reviews) linked to specific rows in other tables (e.g. book, CD and video). Each field isassociated with a specific CategoryID to ensure that it is stored in the right table. Each field is alsoassociated with a specific PageID and BlockID so that we can trace back to its origin.kTables is created by a scraper agent whereupon it is refined by Cleanser and additional agents beforefinally being imported by an Importer agent into the QR. Prior to kTables creation, properties pertainingto the data that are created by the Wizard App and Spider Agent are maintained in the kRegistry ofkFramework. This includes, for example pipe properties such as a SphereID and SphereOntology, as wellas the properties of the pages and page blocks that each table cell was extracted from. kTablestherefore need only maintain a PipeId, PageId and BlockId so that all properties associated with thepipe, page and block can be obtained from the kRegistry. The scraper receives as input a list of pagesthat it needs to process including the PipeId and all PageIds that it needs to access them. Given a PageId,the scraper can get all BlockIds for that page. Similarly, when dealing with a cell containing a value witha given AtomType, kTables only needs to maintain the AtomTypeId to obtain additional properties of theAtomType from the OntoPlug.The following table contains a list of properties per scope (pipe, table, column etc) that are available viakTable. In many cases, the property is maintained internally – in other cases the property is prefixedwith a ‘*’ or ‘#’ to indicate that is can be obtained via the kRegistry or OntoPlugrespectively usingakTable tag designated in the second column. A ‘W’ value indicates which components write the tagvalue and an ‘R’ indicated which components read it for purposes described in the final column. Postpipe processing typically includes OntoServer analysis or QA evaluation of the quality of pipe processing.Highlighted tags are for Beta.
  19. 19. Property/Tag Level App/ Spid Scra Clea Impo Post Description/Purpose /Scope Wiz er per nser rter PipePipeID Pipe W R To access all related properties in kRegistry*CustomerID *PipeID W R Serving QR access control*SphereID *PipeID W R Each sphere has an independent QR*OntologyName *PipeID W R R R R OntologyName for OntoPlug.SetContext*MinPedigree *PipeID W R R R R Min Pedigree for OntoPlug.SetContext*SiteID *PipeID W R For auto-navigation to the original pagesBlockID Field W Entity records spanning multiple blocks & pages must be merge*PageID *BlockID W R To find other blocks on the same page*PageLayout *PageID W R To apply correct scraper filter to each page by layoutAtomFilters Pipe W R R OntoPlug.RelevantFilters(List<fields>) per block per layout for O*Prev/NextPage *PageID W R To scrape records in the appropriate order*ParentBlockID *PageID W R R To find linked block on parent page*ParentLinkValue *PageID W R R To find linked block on parent page*BreadCrumbs *PageID R W R To auto-identify a category block & measure spider coverage*CategoryContent *PageID W R R To validate right choice of category block and categoryIDCategoryID Table W R R OntoPlug.CategoryID(BreadCrumbs, CategoryContent)*AgentsVersions *PipeID W W W W W R List of agents that processed this kTables and their onto versionUniqueKeySets Table W R OntoPlug.UniqueKeySets(List<column.SemanticTags>) per Cate*PageTitle *PageID WR R HTML page title to validate matching nugget titleIsNugget Column W R R To apply QR indexing to these columnsColumnContext Column W R Scraped TableColumnOrder,Header,Prefix/Label,Style,AtomTypColumnID Column W R Autogenerated by Scraper based upon ColumnContextUserColumnName Column (W) W R Manually entered by user & registered in Scraper Filter (takes pSemanticTag Column W W R R OntoPlug.ProbableSemanticTags(List<columnValues>,ColumnCFirstSonColumn Column W To find first descendent column produced by field decompositiNextSonColumn Column W Whereupon this leads to remaining descendent columnsParentColumn Column W To recursively find all ancestor columnsSkipColumn Column W R R#CanDecompose *SemanticTag W OntoPlug.CanDecompose(SemanticTag) Wizard can also configFieldProperties Field W W R Link,ImageLink,ImageAlt, Empty,NewWordFieldAtomType Field W W R R To get OntoPlug atom type attributes, e.g. AtomValueMin/MaxFieldValue Field W W R R Object containing Amount, UnitNameetc
  20. 20. Quality MetricsThe following table contains a list of quality metrics per pipe designed to monitor: a) Data quality improvements as the data flows through the pipe b) Pipe data quality and KPI (key performance indicator) improvements over time.Over time here could mean from one test cycle to the next due to a new code version or and improvedontology. Several of the metrics are maintained per Type*Layout, i.e. for each Block Type (Record,Table, Category, Attachment) and each Page Layout in the pipe.PKIs are readily derived from these metrics, e.g. Suspicious/Validated characters, percentage of EmptyPages, percentage of Normalized/Atomic cells, etc.Property/Tag Level /Scope App/ Spid Scra Clea Impo Post Description/Purpose Wiz er per nser rter PipeBread crumb paths Pipe W Number of queries – should correlate with Category IDsSpider Pages Pipe W Total navigationsPage Layouts Pipe WLinked pages Pipe W Navigations via linksBroken links Pipe WLayout Pages Layout W Scraped pages per layout IDEmpty pages Layout WCategory IDs Layout WBlocks Type*Layout W Category/Table/Record/Attachment blocks per layout IDTexts Type*Layout W Texts per block type per layoutFields Type*Layout W Extracted fields per block type/layoutColumns Type*Layout W W Number of columnsSemantic Tags Type*Layout W W How many of them taggedAtomic Columns Type*Layout W WAtomic Cells Type*Layout WNormalized Cells Type*Layout WTotal Chars Type*Layout WUnknown Atoms Type*Layout W Candidate new atom names/patternsResidue Chars Type*Layout WLeave As Is Chars Type*Layout WValidated Chars Type*Layout W Output characters validated via their history as matching speSuspicious Chars Type*Layout W Non-validated output charactersMax Response Pipe W Source response timesTotal Response Pipe W Total response times for all pages in spider processSpider Time Pipe W Total spider processScrape Time Pipe WCleanse Time Pipe WScrape Onto Time Pipe W Onto processing time onlyCleanse Onto Time Pipe W Onto processing time only
  21. 21. Appendix B: Ontology API Owner: Naama
  22. 22. Appendix C: Pipe Flow with Data Example Owner: Oksana 1. Website Mapping Result: Sample data Pipe 1 Pipe 2 Column1 Column 2 Price Column 3 Column1 Column 2 Price Column 3 ayn rand fountainhead 10.00$ 5 ayn rand fountainhead 10.00$ 5 2. Collect data (scraper) Process System columns added – what columns? Data manipulation? Result: Scraped dataPipe 1 Pipe 2Collected from 10,000 pages, 15,000 records. Collected from 5,000 pages, 10,000 records. Column1 Category Column 2 Price Column 3 Column1 Column 2 Column 3 Price Column 4 ayn rand Philosophy fountainhead 10.00$ 5 - excellent ayn rand Inspiration fountainhead 10.00€ 10 Author1 Cat1 Book1 10.00$ 3 - medium Author1 Cat7 Book1 5.00€ 8 Author3 Cat3 Book3 7.00$ 2 – poor Author2 Cat3 Book2 5.00€ 6 Author3 Cat1 Book4 7.00$ 4 – good 3. Cleansing 3.1. Decomposition Breakdown to atoms? Based on what rules? Prices, measure units should be decomposed? Result Pipe 1 Decomposed Column 3 to Column3_1, Column3_2 Column1 Category Column 2 Price Column 3_1 Column 3_2 ayn rand Philosophy fountainhead 10.00$ 5 excellent Author1 Cat1 Book1 10.00$ 3 medium Author3 Cat3 Book3 7.00$ 2 poor Author3 Cat1 Book4 7.00$ 4 good
  23. 23. 3.2. Typing What data types we distinct today? String, Number, Price, Phone, Address? Break down to units and scale part of typing or decomposition? Result Pipe 1 Pipe 2 Type Units Scale Type Units Scale Column1 String Column1 String Category String Column 2 String Column 2 String Column 3 String Price Price $ Price Price € Column 3_1 Number 1-5 Column 4 Number 1-10 Column 3_2 String3.3. Column mapping & unique identifiers (entities & relationships) Description TBD Result Pipe 1 Pipe 2 Column Identifier Column Identifier Column1 Author Yes Column1 Author name Yes name Column 2 Book category Yes Category Book Yes Column 3 Book name No category Price Book price No Column 2 Book name No Column 4 Book rating No Price Book price No Column 3_1 Book rating No Column 3_2 Book rating No legend3.4. Normalization What normalization we perform in cleanser? As far as I understand the normalization can be done on sphere level? For example, price normalization comes to question when we join data (one source $ and the other €) or already in cleanse stage we bring all data to once currency? The same for rating normalization.3.5. Complete data What completion we perform in cleanser? Is it not sphere related action as well? E.g. complete all zips, phones, etc’. I think this also something we cannot do automatically, user input will be required, guidelines as for what data to complete and how.
  24. 24. 4. Merge data 4.1. Unify – simple merge Description TBD Result – Unified dataSource (S) Author name Book category Book name Book price Book rating Book rating legendPipe 1 Ayn rand Philosophy fountainhead 10.00$ 5 excellentPipe2 ayn rand Inspiration fountainhead 15.00$ 5Pipe 1 Author1 Cat1 Book1 10.00$ 3 mediumPipe2 Author1 Cat7 Book1 7.00$ 4Pipe 1 Author3 Cat3 Book3 7.00$ 2 poorPipe 1 Author3 Cat1 Book4 7.00$ 4 goodPipe2 Author2 Cat3 Book2 7.00$ 3 4.2. Merge data by unique identifier 4.2.1. Keep duplicates (default) Enabled domain expert to decide how to resolve duplicates/conflicts – manually reconcile data from various sources. Result – merge data with duplicatesAuthor name Book category Book name Book price Book rating Book rating legendAyn rand Philosophy (Pipe 1) fountainhead 10.00$ (Pipe 1) 5 excellent Inspiration (Pipe 2) 15.00$ (Pipe2) Now in this point the domain expert can perform decision round and decide how to treat the duplicates Based on the example above, I decide to always take categories by pipe raring (pipe 1 in my case) (actually defined the categories normalization here) and keep prices from both sources Result exampleAuthor name Book category Book name Book price – Book price – Book Book rating Pipe1 Pipe2 rating legendAyn rand Philosophy fountainhead 10.00$ 15.00$ 5 excellent 4.2.2. Resolve all duplicates by pipe rating Based on Sphere preferences, in this case the merge should result with all duplicates automatically resolved by pipe rating. Example – My sphere preferences = always resolve duplicates based on Pipe1 Result – merge data duplicates resolved
  25. 25. Author name Book category Book name Book price Book rating Book rating legendAyn rand Philosophy fountainhead 10.00$ 5 excellent
  26. 26. Appendix D: Domain Expert Worker Flow Owner: Oksana 1. High level domain expert flowMore can be found here
  27. 27. Appendix E: Sphere Architecture and APIs Owner: Irina
  28. 28. Appendix F: Framework Architecture and APIs Owner: Aryeh
  29. 29. Appendix G: Spider Architecture and APIs Owner: Yossi
  30. 30. Appendix H: Scraper Architecture and APIs Owner: HagaySubsequent iterations: a) Multiple record blocks: Linked records between blocks b) Optimize performance: replace TextList with linked texts, no need for spliceRecords
  31. 31. Appendix I: Cleanser Architecture and APIs Owner: Ronen
  32. 32. Appendix J: kGrid Integration Owner: JairkGrid was designed for cross-enterprise data integration - hencekGrid agents are fundamentallydifferent than pipe agents in kFramework: 1. kGrid agents run continuously, processing requests sent to their message queues whereas kFramework agents are dispatched to do a specific task. 2. kGrid agents typically reside at specific Agency or Gateway locationsuntil system dynamics warrant relocation whereas kFramework agents are dispatched per task anywhere in the cloud. 3. kGrid agents cache the ontology that they need internally via an Ontology service whereas kFramework agents rely upon an OntoPlug to do their ontology-related work.Kinor envisions future cross-enterprise contexts, so the fundamental kGrid architecture must beretained. This architecture enables kGrid Agencies and Gateways anywhere to dynamically discover eachother and work together without disrupting operations. A robust dynamic discovery mechanism has yetto be implemented. Hence kGridshould initially be deployed internally using a static configurationdictated by an appropriate kGridConfig object in the kFrameworkkRegistry. The following principlesshould enable immediate kGrid deployment for online database integration purposes within weeks: 1) The designate machines will run Java 1.4. 2) At least one of the machines will deploy MySQL for initial kGrid persistence. 3) The kGridConfig object should initially comprise of the following: a) A list of Agency and Gateway machines for deployment of the kGrid Agent Manager b) A kGrid compatible XML file per Agent Manager to configure its agents and services c) A minimalistic agent and services configuration to get started 4) The kGrid discovery service should be adjusted to access kRegistry for connecting with other kGrid discovery services rather than attempt dynamic discovery. 5) kGrid uses Protégé as an ontology editor with a plugin that knows how to engage the kGrid ontology service. The OntoServer must incorporate this plugin and translate its OWL ontology format into the OKBC format currently serving the Ontology Server.The following kGrid improvements can then be implemented over subsequent months: 1) Replace kGrid Query Builder with Sphere Query Builder 2) Upgrade code to be compatible with Java 7. 3) Use an OntoPlug to semi-automate Wrapper schema mapping. 4) Self-organizing kGrid deployment to match operational demand 5) Might be worth replacing the ODBC connectivity currently serving the kGrid wrappers with a standard DBMS interfacing that is more readily configured programmatically by kFramework when new database sources are added to a sphere.