Keiji Shinzato and Satoshi Sekine
17th Oct. 2013
The 6th International Joint Conference on Natural Language Processing
This slide shows an unsupervised method for extracting product attributes and their values from an e-commerce product page. Previously, distant supervision has been applied for this task, but it is not applicable in domains where no reliable knowledge base (KB) is available. Instead, the proposed method automatically creates a KB from tables and itemizations embedded in the product’s pages. This KB is applied to annotate the pages automatically and the annotated corpus is used to train a model for the extraction. Because of the incompleteness of the KB, the annotated corpus is not as accurate as a manually annotated one. Our method tries to filter out sentences that are likely to include problematic annotations based on statistical measures and morpheme patterns induced from the entries in the KB. The experimental results show that the performance of our method achieves an average F score of approximately 58.2 points and that filters can improve the performance.
Masato Hagiwara, Satoshi Sekine
Rakuten Institute of Technology, New York
NEWS 2012, July 12 2012
Transliteration has been usually recognized by spelling-based supervised models. However, a single model cannot deal with mixture of words with different origins, such as “get” in “piaget” and “target”. Li et al. (2007) propose a class transliteration method, which explicitly models the source language origins and switches them to address this issue. In contrast to their model which requires an explicitly tagged training corpus with language origins, Hagiwara and Sekine (2011) have proposed the latent class transliteration model, which models language origins as latent classes and train the transliteration table via the EM algorithm. However, this model, which can be formulated as unigram mixture, is prone to over fitting since it is based on maximum likelihood estimation. We propose a novel latent semantic transliteration model based on Dirichlet mixture, where a Dirichlet mixture prior is introduced to mitigate the over fitting problem. We have shown that the proposed method considerably outperform the conventional transliteration models.
Latent Class Transliteration based on Source Language OriginRakuten Group, Inc.
Masato Hagiwara, Satoshi Sekine
Rakuten Institute of Technology, New York
ACL-HLT 2011, June 21 2011
Transliteration, a rich source of proper noun spelling variations, is usually recognized by phonetic- or spelling-based models. However, a single model cannot deal with different words from different language origins, e.g., “get” in “piaget” and “target.” Li et al. (2007) propose a method which explicitly models and classifies the source language origins and switches transliteration models accordingly. This model, however, requires an explicitly tagged training set with language origins. We propose a novel method which models language origins as latent classes. The parameters are learned from a set of transliterated word pairs via the EM algorithm. The experimental results of the transliteration task of Western names to Japanese show that the proposed model can achieve higher accuracy compared to the conventional models without latent classes.
Hundreds of queries in the time of one - Gianmario SpacagnaSpark Summit
The document describes an Insights Engine that generates business insights for small businesses by combining hundreds of queries into a single optimized execution plan. It takes transaction and market data for businesses and calculates key performance indicators, comparing each business to similar competitors at different granularities of time and location. The engine uses composable "monoids" to allow efficient aggregation at multiple levels and a domain-specific language to define insights concisely. It ensures results are privacy-safe and relevant by filtering and ranking insights. The engine was able to run hundreds of queries for over 275,000 UK businesses in under 30 minutes on a small cluster.
Oracle Endeca 101 Developer Introduction High Level OverviewGordon Kiser
This slideshare gives developers a high level overview of the structure of an Oracle Commerce Experience Manager page used by business users to create scenarios and triggers that may control static pages and dynamic pages that automatically present content based on site visitor behavior.
This document summarizes a project to create a Web Symbol Service (WSS) for remote access to symbology. The project involved developing a symbols server using Django and Pinax to allow uploading and querying of symbol metadata and files. It also involved creating a WSS client for the gvSIG application to allow loading symbols from the remote server. Future plans include encouraging other organizations to publish symbols through WSS and expanding the WSS protocol and usability.
New for 2018 MRO master data auditing and cleansingDavid Thompson
New Workshop developed due to high number of company's inventory systems with poor data leading to high number of duplicates, lack of focus on cost reduction, excess downtime
Unify Your Selling Channels in One Product Catalog ServiceMongoDB
The document discusses how MongoDB can help unify product data across retail selling channels. It describes MongoDB's capabilities like flexible schemas, real-time querying, and scaling that make it well-suited for powering modern retail systems. Several use cases are provided, including maintaining a global product catalog, consolidated customer views, innovations across channels, and personalized recommendations.
Masato Hagiwara, Satoshi Sekine
Rakuten Institute of Technology, New York
NEWS 2012, July 12 2012
Transliteration has been usually recognized by spelling-based supervised models. However, a single model cannot deal with mixture of words with different origins, such as “get” in “piaget” and “target”. Li et al. (2007) propose a class transliteration method, which explicitly models the source language origins and switches them to address this issue. In contrast to their model which requires an explicitly tagged training corpus with language origins, Hagiwara and Sekine (2011) have proposed the latent class transliteration model, which models language origins as latent classes and train the transliteration table via the EM algorithm. However, this model, which can be formulated as unigram mixture, is prone to over fitting since it is based on maximum likelihood estimation. We propose a novel latent semantic transliteration model based on Dirichlet mixture, where a Dirichlet mixture prior is introduced to mitigate the over fitting problem. We have shown that the proposed method considerably outperform the conventional transliteration models.
Latent Class Transliteration based on Source Language OriginRakuten Group, Inc.
Masato Hagiwara, Satoshi Sekine
Rakuten Institute of Technology, New York
ACL-HLT 2011, June 21 2011
Transliteration, a rich source of proper noun spelling variations, is usually recognized by phonetic- or spelling-based models. However, a single model cannot deal with different words from different language origins, e.g., “get” in “piaget” and “target.” Li et al. (2007) propose a method which explicitly models and classifies the source language origins and switches transliteration models accordingly. This model, however, requires an explicitly tagged training set with language origins. We propose a novel method which models language origins as latent classes. The parameters are learned from a set of transliterated word pairs via the EM algorithm. The experimental results of the transliteration task of Western names to Japanese show that the proposed model can achieve higher accuracy compared to the conventional models without latent classes.
Hundreds of queries in the time of one - Gianmario SpacagnaSpark Summit
The document describes an Insights Engine that generates business insights for small businesses by combining hundreds of queries into a single optimized execution plan. It takes transaction and market data for businesses and calculates key performance indicators, comparing each business to similar competitors at different granularities of time and location. The engine uses composable "monoids" to allow efficient aggregation at multiple levels and a domain-specific language to define insights concisely. It ensures results are privacy-safe and relevant by filtering and ranking insights. The engine was able to run hundreds of queries for over 275,000 UK businesses in under 30 minutes on a small cluster.
Oracle Endeca 101 Developer Introduction High Level OverviewGordon Kiser
This slideshare gives developers a high level overview of the structure of an Oracle Commerce Experience Manager page used by business users to create scenarios and triggers that may control static pages and dynamic pages that automatically present content based on site visitor behavior.
This document summarizes a project to create a Web Symbol Service (WSS) for remote access to symbology. The project involved developing a symbols server using Django and Pinax to allow uploading and querying of symbol metadata and files. It also involved creating a WSS client for the gvSIG application to allow loading symbols from the remote server. Future plans include encouraging other organizations to publish symbols through WSS and expanding the WSS protocol and usability.
New for 2018 MRO master data auditing and cleansingDavid Thompson
New Workshop developed due to high number of company's inventory systems with poor data leading to high number of duplicates, lack of focus on cost reduction, excess downtime
Unify Your Selling Channels in One Product Catalog ServiceMongoDB
The document discusses how MongoDB can help unify product data across retail selling channels. It describes MongoDB's capabilities like flexible schemas, real-time querying, and scaling that make it well-suited for powering modern retail systems. Several use cases are provided, including maintaining a global product catalog, consolidated customer views, innovations across channels, and personalized recommendations.
Retail Reference Architecture Part 1: Flexible, Searchable, Low-Latency Produ...MongoDB
MongoDB provides a flexible data model and fast querying capabilities that make it well-suited for powering retail merchandising systems. Documents can represent products, variants, pricing and other metadata in a way that maps well to the complex hierarchical and attribute-based relationships in retail. MongoDB's indexing, real-time updates and ability to handle high read and write volumes meet the performance requirements for browsing, searching and maintaining a large catalog. The document model also simplifies building faceted search and summary views of products that integrate related metadata in a single query.
This document discusses using MongoDB for inventory management in retail applications. Some key points:
- MongoDB allows for a single view of inventory across all channels with real-time updates and bulk writes for refresh. Its flexible schema and horizontal scaling are well-suited for inventory needs.
- Collections would include Stores, Inventory, Products, Audits, Assortments, and Shipments. Stores documents contain store-specific metadata.
- Inventory documents have embedded documents for products and variants with attributes like size and color. This embedded structure allows for efficient queries on combinations of attributes.
- The target architecture replaces traditional batch-based ETL with real-time updates to MongoDB for improved customer experience and business operations.
FDSeminar Processen Stroomlijnen - Bart De Backer en Joris Vanderlinden - Bar...FDMagazine
Barry Callebaut, the world's leading manufacturer of chocolate and cocoa products, implemented an e-invoicing solution to streamline its invoicing process. The company worked with consulting and printing partners Anachron and Pyramid to set up electronic invoicing and outsourced printing. The project was rolled out in phases, first testing with pilot customers in key markets before expanding to all customers. The e-invoicing solution eliminated manual tasks, errors, and costs for Barry Callebaut while providing customers faster service and easier invoice access and storage. Testing with stakeholders and gradual personalized onboarding of customers were keys to the success of the transition.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
Presented at JavaOne 2013, Tuesday September 24.
"Data Modeling Patterns" co-created with Ian Robinson.
"Pitfalls and Anti-Patterns" created by Ian Robinson.
Here are five possible metrics and units of measure for the customer need "The pen writes smoothly":
1. Friction coefficient - unitless
2. Ink flow rate - grams per minute
3. Starting force - grams of force
4. Line width variation - millimeters
5. Skip rate - number of skips per 100 words
Here are five possible metrics and units of measure for the customer need "The pen writes smoothly":
1. Friction coefficient - unitless
2. Ink flow rate - grams per minute
3. Starting force - grams of force
4. Line width variation - millimeters
5. Skip rate - number of skips per 100 words
Billion Goods in Few Categories: How Histograms Save a Life?Sveta Smirnova
We store data with an intention to use it: search, retrieve, group, sort... To do it effectively, the MySQL Optimizer uses index statistics when it compiles the query execution plan. This approach works excellently unless your data distribution is not even.
Last year I worked on several support tickets where data follows the same pattern: millions of popular products fit into a couple of categories and the rest used the rest. We had a hard time finding a solution for retrieving goods fast. We offered workarounds for version 5.7. However, a new MariaDB and MySQL 8.0 feature - histograms - would work better, cleaner and faster. The idea of the talk was born.
Of course, histograms are not a panacea and do not help in all situations.
I will discuss
- how index statistics physically stored by the storage engine
- which data exchanged with the Optimizer
- why it is not enough to make correct index choice
- when histograms can help and when they cannot
- differences between MySQL and MariaDB histograms
Talk for Percona Live 2019 Austin: https://www.percona.com/live/19/sessions/billion-goods-in-few-categories-how-histograms-save-a-life
SIMUL8 User Group - Visual8 Case Study - Plywood Manufacturing. SIMUL8 Corporation
The document provides an agenda and overview for a SIMUL8 User Group presentation by Visual8 Corporation. Visual8 is an industrial engineering consulting firm that specializes in simulation modeling. The presentation includes an overview of Visual8, example projects applying SIMUL8 across various industries, a case study of simulating different designs for an automated plywood patching line, and a live SIMUL8 model demonstration. The case study details the project goals, simulation model development process, different layout designs tested through sensitivity analysis, and results identifying Layout 4 with a combined routing and patching robot as the final choice meeting throughput needs within space limitations.
This document summarizes a chapter about product specifications from a textbook on product design and development. It discusses the nature and purpose of specifications, the process for setting target and final specifications, and guidelines for establishing metrics and assigning values. Target specifications are set based on customer needs and benchmarks, while final specifications are refined based on the selected concept and testing. The key aspects covered are:
- Specifications represent an agreement on what the team will achieve to satisfy customer needs
- Target specs are goals and final specs reflect feasibility testing and trade-offs
- Metrics should sufficiently address customer needs and be practical to measure
- Benchmarking, models, and trade-offs inform refining specs
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB
Are you new to schema design for MongoDB, or are looking for a more complete or agile process than what you are following currently? In this talk we will guide you through the phases of a flexible methodology that you can apply to projects ranging from small to large with very demanding requirements.
MT311 Operations and Quality ManagementFall 2019 Team Research.docxroushhsiu
MT311 Operations and Quality Management
Fall 2019
Team Research Project
I. Project overview
Students are required to participate in a team project for the comparative performance analysis of companies. After completion of the project, each group shall make a 15-20 min. presentation using power point slides and turn in Excel and PPT files (hard and soft copies). Power point material doesn’t necessarily have to be narrative or very detailed, however, should include sufficient information for the audience to understand your preparation and readiness. Each member is expected to make valuable and equitable contributions to the team efforts including team presentation. For the project, select a firm (manufacturing/service) of your interest and its two comparable competitors in the same industry. The presentation team should lead the class discussion and be prepared for the questions from the class. Peer evaluation is required at the end of the project.
II. Content
1. Profile of companies (Three-Four companies in the same industry):
A. Company names: xxx, xxx, xxx
B. Industry analysis (Competition, rivalry, attractiveness, innovation, and etc.)
C. Business summary for each company: Mission/ Vision/ Business description
D. Etc.
E. References
· PPT: 4-7 pages
2. Sustainability
A. Goals/ Initiatives (Sustainability programs)
* Introduce brief overview of the programs launched by the company
B. Primary emphasis (activities): Economic (Profit) –Social (People) – Environmental (Planet)
C. References
* At least four different references.
· PPT: 3-5 pages
* Generate a summary table comparing companies
3. Comparative performance analysis (Benchmarking)
A. For each company, collect the most recent data (2018)
1). Revenue, Cost of Goods Sold (Cost of Sales), Inventories, R&D expenses, Employees.
Note: $M for financial variables, and 000 for employees.
2). Rank companies for each category for each year
* Generate a summary table
Variables
Company A
Company B
Company C
Rank
Revenue
E.g., A-B-C
Employees
CoGS
R&D
* Generate graphs for each category (e.g. bar chart with numbers displayed).
B. For the most recent year, calculate productivity (efficiency):
1). Employee productivity (Revenue/Employees)
2). Production efficiency (Revenue/COGS)
3). R&D productivity (Revenue/ R&D)
4). Rank companies for each category
5). Compare the rank of each category with the rank from raw data in ‘A’.
* Generate a summary table (Excel and PPT)
Variables
Company A
Company B
Company C
Rank
Remark
Employee productivity
E.g., A-B-C
Production efficiency
R&D
productivity
Note: Excel spread sheet must show the calculation formula (Do not type simple answers).
* Generate graphs for each category (e.g. bar chart with numbers displayed).
C. Conduct ‘best practice’ benchmarking to improve employee productivity of the lowest performer.
* Generate a summary table (Excel and PPT)
Option
Company:
(Lowest perfo ...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
Niko Neugebauer gave a presentation on the columnstore improvements in SQL Server 2016. Some of the key improvements discussed include hybrid transactional/analytical processing (HTAP), new T-SQL syntax for defining columnstore indexes, high availability features like readable secondaries, improved data loading and batch processing performance, new maintenance features, and expanded monitoring capabilities. The presentation provided examples and demonstrations of many of these new columnstore features in SQL Server 2016.
Prepare for Peak Holiday Season with MongoDBMongoDB
This document discusses preparing for the holiday season by providing a seamless customer experience. It covers expected trends for the 2014 holiday season including increased spending and an extended shopping window. The opportunity is to provide personalized and relevant experiences for customers. The document then provides an overview of how MongoDB can be used to power various retail functions like product catalogs, real-time inventory and orders, and consolidated customer views to enable a modern seamless retail experience. Technical details are discussed for implementing product catalogs and real-time inventory using MongoDB.
How to Build a ML Platform Efficiently Using Open-SourceDatabricks
Fast-growing startups usually face a common set of challenges when employing machine learning. Data scientists are expected to work on new products and develop new models as well as iterate on existing ones. Once in production, models should be continuously monitored and regularly maintained as the infrastructure evolves. Before too long, data scientists end up spending most of their time doing maintenance and firefighting of existing models instead of creating new ones.
At GetYourGuide, we faced these challenges and decided to think about machine learning development holistically, which led us to our machine learning platform. Our platform uses MLflow to keep track of our machine learning life-cycle and ease the development experience. To integrate our models into our production environment, we also need to deal with additional requirements like API specification, SLOs and monitoring. To empower our data scientists, we have built a templating system that takes care of the heavy lifting of going to production, leveraging software engineering tools and ML-specific ones like BentoML.
In this talk we will present:
– Our previous approaches for deploying models and their tradeoffs
– Our data science and platform principles
– The main functionalities of our platform
– A live demo to create a new service
– Our learnings in the process
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
Topics covered include:
- Scaling Vertically
- Hardware Considerations
- Index Optimization
- Schema Design
- Sharding
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.
Foundations for Scaling ML in Apache SparkDatabricks
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
Let's build an adoption centre in office 365Joanne Klein
Using a modern communication site, build an adoption centre for Office 365 using Site Pages and Page Properties. Slides from Modern Workplace Conference in Paris - October 2018
Retail Reference Architecture Part 1: Flexible, Searchable, Low-Latency Produ...MongoDB
MongoDB provides a flexible data model and fast querying capabilities that make it well-suited for powering retail merchandising systems. Documents can represent products, variants, pricing and other metadata in a way that maps well to the complex hierarchical and attribute-based relationships in retail. MongoDB's indexing, real-time updates and ability to handle high read and write volumes meet the performance requirements for browsing, searching and maintaining a large catalog. The document model also simplifies building faceted search and summary views of products that integrate related metadata in a single query.
This document discusses using MongoDB for inventory management in retail applications. Some key points:
- MongoDB allows for a single view of inventory across all channels with real-time updates and bulk writes for refresh. Its flexible schema and horizontal scaling are well-suited for inventory needs.
- Collections would include Stores, Inventory, Products, Audits, Assortments, and Shipments. Stores documents contain store-specific metadata.
- Inventory documents have embedded documents for products and variants with attributes like size and color. This embedded structure allows for efficient queries on combinations of attributes.
- The target architecture replaces traditional batch-based ETL with real-time updates to MongoDB for improved customer experience and business operations.
FDSeminar Processen Stroomlijnen - Bart De Backer en Joris Vanderlinden - Bar...FDMagazine
Barry Callebaut, the world's leading manufacturer of chocolate and cocoa products, implemented an e-invoicing solution to streamline its invoicing process. The company worked with consulting and printing partners Anachron and Pyramid to set up electronic invoicing and outsourced printing. The project was rolled out in phases, first testing with pilot customers in key markets before expanding to all customers. The e-invoicing solution eliminated manual tasks, errors, and costs for Barry Callebaut while providing customers faster service and easier invoice access and storage. Testing with stakeholders and gradual personalized onboarding of customers were keys to the success of the transition.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
Presented at JavaOne 2013, Tuesday September 24.
"Data Modeling Patterns" co-created with Ian Robinson.
"Pitfalls and Anti-Patterns" created by Ian Robinson.
Here are five possible metrics and units of measure for the customer need "The pen writes smoothly":
1. Friction coefficient - unitless
2. Ink flow rate - grams per minute
3. Starting force - grams of force
4. Line width variation - millimeters
5. Skip rate - number of skips per 100 words
Here are five possible metrics and units of measure for the customer need "The pen writes smoothly":
1. Friction coefficient - unitless
2. Ink flow rate - grams per minute
3. Starting force - grams of force
4. Line width variation - millimeters
5. Skip rate - number of skips per 100 words
Billion Goods in Few Categories: How Histograms Save a Life?Sveta Smirnova
We store data with an intention to use it: search, retrieve, group, sort... To do it effectively, the MySQL Optimizer uses index statistics when it compiles the query execution plan. This approach works excellently unless your data distribution is not even.
Last year I worked on several support tickets where data follows the same pattern: millions of popular products fit into a couple of categories and the rest used the rest. We had a hard time finding a solution for retrieving goods fast. We offered workarounds for version 5.7. However, a new MariaDB and MySQL 8.0 feature - histograms - would work better, cleaner and faster. The idea of the talk was born.
Of course, histograms are not a panacea and do not help in all situations.
I will discuss
- how index statistics physically stored by the storage engine
- which data exchanged with the Optimizer
- why it is not enough to make correct index choice
- when histograms can help and when they cannot
- differences between MySQL and MariaDB histograms
Talk for Percona Live 2019 Austin: https://www.percona.com/live/19/sessions/billion-goods-in-few-categories-how-histograms-save-a-life
SIMUL8 User Group - Visual8 Case Study - Plywood Manufacturing. SIMUL8 Corporation
The document provides an agenda and overview for a SIMUL8 User Group presentation by Visual8 Corporation. Visual8 is an industrial engineering consulting firm that specializes in simulation modeling. The presentation includes an overview of Visual8, example projects applying SIMUL8 across various industries, a case study of simulating different designs for an automated plywood patching line, and a live SIMUL8 model demonstration. The case study details the project goals, simulation model development process, different layout designs tested through sensitivity analysis, and results identifying Layout 4 with a combined routing and patching robot as the final choice meeting throughput needs within space limitations.
This document summarizes a chapter about product specifications from a textbook on product design and development. It discusses the nature and purpose of specifications, the process for setting target and final specifications, and guidelines for establishing metrics and assigning values. Target specifications are set based on customer needs and benchmarks, while final specifications are refined based on the selected concept and testing. The key aspects covered are:
- Specifications represent an agreement on what the team will achieve to satisfy customer needs
- Target specs are goals and final specs reflect feasibility testing and trade-offs
- Metrics should sufficiently address customer needs and be practical to measure
- Benchmarking, models, and trade-offs inform refining specs
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB
Are you new to schema design for MongoDB, or are looking for a more complete or agile process than what you are following currently? In this talk we will guide you through the phases of a flexible methodology that you can apply to projects ranging from small to large with very demanding requirements.
MT311 Operations and Quality ManagementFall 2019 Team Research.docxroushhsiu
MT311 Operations and Quality Management
Fall 2019
Team Research Project
I. Project overview
Students are required to participate in a team project for the comparative performance analysis of companies. After completion of the project, each group shall make a 15-20 min. presentation using power point slides and turn in Excel and PPT files (hard and soft copies). Power point material doesn’t necessarily have to be narrative or very detailed, however, should include sufficient information for the audience to understand your preparation and readiness. Each member is expected to make valuable and equitable contributions to the team efforts including team presentation. For the project, select a firm (manufacturing/service) of your interest and its two comparable competitors in the same industry. The presentation team should lead the class discussion and be prepared for the questions from the class. Peer evaluation is required at the end of the project.
II. Content
1. Profile of companies (Three-Four companies in the same industry):
A. Company names: xxx, xxx, xxx
B. Industry analysis (Competition, rivalry, attractiveness, innovation, and etc.)
C. Business summary for each company: Mission/ Vision/ Business description
D. Etc.
E. References
· PPT: 4-7 pages
2. Sustainability
A. Goals/ Initiatives (Sustainability programs)
* Introduce brief overview of the programs launched by the company
B. Primary emphasis (activities): Economic (Profit) –Social (People) – Environmental (Planet)
C. References
* At least four different references.
· PPT: 3-5 pages
* Generate a summary table comparing companies
3. Comparative performance analysis (Benchmarking)
A. For each company, collect the most recent data (2018)
1). Revenue, Cost of Goods Sold (Cost of Sales), Inventories, R&D expenses, Employees.
Note: $M for financial variables, and 000 for employees.
2). Rank companies for each category for each year
* Generate a summary table
Variables
Company A
Company B
Company C
Rank
Revenue
E.g., A-B-C
Employees
CoGS
R&D
* Generate graphs for each category (e.g. bar chart with numbers displayed).
B. For the most recent year, calculate productivity (efficiency):
1). Employee productivity (Revenue/Employees)
2). Production efficiency (Revenue/COGS)
3). R&D productivity (Revenue/ R&D)
4). Rank companies for each category
5). Compare the rank of each category with the rank from raw data in ‘A’.
* Generate a summary table (Excel and PPT)
Variables
Company A
Company B
Company C
Rank
Remark
Employee productivity
E.g., A-B-C
Production efficiency
R&D
productivity
Note: Excel spread sheet must show the calculation formula (Do not type simple answers).
* Generate graphs for each category (e.g. bar chart with numbers displayed).
C. Conduct ‘best practice’ benchmarking to improve employee productivity of the lowest performer.
* Generate a summary table (Excel and PPT)
Option
Company:
(Lowest perfo ...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
Niko Neugebauer gave a presentation on the columnstore improvements in SQL Server 2016. Some of the key improvements discussed include hybrid transactional/analytical processing (HTAP), new T-SQL syntax for defining columnstore indexes, high availability features like readable secondaries, improved data loading and batch processing performance, new maintenance features, and expanded monitoring capabilities. The presentation provided examples and demonstrations of many of these new columnstore features in SQL Server 2016.
Prepare for Peak Holiday Season with MongoDBMongoDB
This document discusses preparing for the holiday season by providing a seamless customer experience. It covers expected trends for the 2014 holiday season including increased spending and an extended shopping window. The opportunity is to provide personalized and relevant experiences for customers. The document then provides an overview of how MongoDB can be used to power various retail functions like product catalogs, real-time inventory and orders, and consolidated customer views to enable a modern seamless retail experience. Technical details are discussed for implementing product catalogs and real-time inventory using MongoDB.
How to Build a ML Platform Efficiently Using Open-SourceDatabricks
Fast-growing startups usually face a common set of challenges when employing machine learning. Data scientists are expected to work on new products and develop new models as well as iterate on existing ones. Once in production, models should be continuously monitored and regularly maintained as the infrastructure evolves. Before too long, data scientists end up spending most of their time doing maintenance and firefighting of existing models instead of creating new ones.
At GetYourGuide, we faced these challenges and decided to think about machine learning development holistically, which led us to our machine learning platform. Our platform uses MLflow to keep track of our machine learning life-cycle and ease the development experience. To integrate our models into our production environment, we also need to deal with additional requirements like API specification, SLOs and monitoring. To empower our data scientists, we have built a templating system that takes care of the heavy lifting of going to production, leveraging software engineering tools and ML-specific ones like BentoML.
In this talk we will present:
– Our previous approaches for deploying models and their tradeoffs
– Our data science and platform principles
– The main functionalities of our platform
– A live demo to create a new service
– Our learnings in the process
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
Topics covered include:
- Scaling Vertically
- Hardware Considerations
- Index Optimization
- Schema Design
- Sharding
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.
Foundations for Scaling ML in Apache SparkDatabricks
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
Let's build an adoption centre in office 365Joanne Klein
Using a modern communication site, build an adoption centre for Office 365 using Site Pages and Page Properties. Slides from Modern Workplace Conference in Paris - October 2018
Similar to Unsupervised Extraction of Attributes and Their Values from Product Description (20)
This document discusses how to make software more green and environmentally friendly. It defines green software as software that is carbon efficient, energy efficient, hardware efficient, and carbon aware. It provides recommendations for various roles within an organization on driving green initiatives, including focusing on efficiency for CxOs, architects, infrastructure engineers, and developers. Examples include optimizing resource usage, using public clouds effectively, prioritizing equipment standardization, and developing applications that can run more efficiently.
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Rakuten Group, Inc.
The document proposes a knowledge-driven query expansion approach for question answering (QA)-based product attribute extraction. It trains QA models using attribute-value pairs from training data as knowledge, while mimicking imperfect knowledge at test time through techniques like knowledge dropout and token mixing. This helps induce better query representations, especially for rare and ambiguous attributes. Experiments on a cleaned product attribute dataset show the proposed approach with all techniques outperforms baseline methods in both macro and micro F1 scores.
This document summarizes Andrew Hajinikitas' work developing Rakuten's private cloud infrastructure. It describes the key components of Rakuten's infrastructure including metal instances, microservers, and GPU servers. It provides details on Rakuten's software stack and their goals to expand managed services. Currently, Rakuten operates 9 data centers in Japan and overseas providing around 30,000 servers to support their ecosystem. Their future plans include extending network self-service, making GPU resources available as a platform service, and improving efficiency through optimized hardware selection.
The document discusses the Travel & Leisure Platform Dept and its responsibilities related to data and platform management. It provides an overview of the technical stack including private/public clouds, databases, containers, and automation/monitoring tools. It then discusses recent projects involving business continuity, containerization, alert integration, and automation. Finally, it describes open roles for a DBA and DevOps position and their responsibilities related to database provisioning, backup/recovery, infrastructure as code, and providing platforms and tools for developers.
This presentation introduces the OWASP Top 10:2021.
It explains how to look at the data related to OWASP Top 10:2021, and provides detailed explanations of items with distinctive data. It also introduces the OWASP Project related to each item.
Gora API Group technology provides a microservices architecture and APIs for Rakuten's golf course reservation system, improving the user experience and increasing customer loyalty and annual golf rounds. The architecture migrates the monolithic reservation system to microservices using Kotlin, Spring Boot, and other technologies, exposing APIs for the frontend and new products while sustaining the legacy system through services, queues, continuous delivery, and operations monitoring.
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
"NATO Hackathon Winner: AI-Powered Drug Search", Taras KlobaFwdays
This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application.
In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy.
We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillLizaNolte
HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable.
In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed:
Key Takeaways:
Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement.
Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers.
Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.
High performance Serverless Java on AWS- GoTo Amsterdam 2024Vadym Kazulkin
Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
What is an RPA CoE? Session 2 – CoE RolesDianaGray10
In this session, we will review the players involved in the CoE and how each role impacts opportunities.
Topics covered:
• What roles are essential?
• What place in the automation journey does each role play?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Unsupervised Extraction of Attributes and Their Values from Product Description
1. Unsupervised Extraction of Attributes and
Their Values from Product Description
Keiji Shinzato and Satoshi Sekine
Rakuten Institute of Technology
17th Oct. 2013
The 6th International Joint Conference on Natural Language Processing
2. 2
What is Rakuten?
• Biggest e-commerce company in Japan.
• B2B2C model.
• Statistics:
– # of merchants: 40K+
– # of products: 100M+
– # of product categories: 40K+
• Product page is categorized into a single product
category by a merchant.
• Product info. offered by merchants is described
by various kind of methods.
– Not well organized :-(
3. 3
Examples of product pages (wine category)
Table
Itemizations
Product data is offered by merchants using various methods.
4. 4
Examples of product pages (wine category)
Product data is offered by merchants using various methods.
Full texts
5. 5
Goal
• Develop an unsupervised methodology for
constructing structured data from full texts.
Attribute Value
Color Red
Production
area
Italy,
Tuscany
Grape
variety
Merlot,
Cabernet sauvignon,
Petit verdot,
Cabernet franc
Vintage 2010
Volume 750ml
Full texts
(Unstructured data)
Structured data
6. 6
Unsupervised information extraction
• Distant supervision [Mintz+ 2009]
– Construct an annotated corpus using an existing
Knowledge Base (KB).
– Train a model from the constructed corpus.
Hiroshi Mikitani is founder and CEO of
the online marketing company Rakuten .
Training data for founder-company
information extraction
Founder:
Hiroshi Mikitani
Machine learning
Extraction model
7. 7
Problem of existing KBs
• Wikipedia
– Infobox is not tailored towards e-commerce.
• Freebase
– Only available in English.
– Attribute and values are limited even in English.
Production area
Grape variety
Winery
Attributes in the infobox for the
wine article in Wikipedia.
Attributes for users seeking
their favorite wines.
Vintage
Gap
1. Construct KB for product information extraction.
2. Remove false-positive and false-negative annotations in
the automatically constructed corpous.
8. 8
Agenda
• Background
• Overview of our approach
– Knowledge base induction
– Training data construction
– Extraction model training
– Product page structuring
• Experiments
• Conclusion and future work
9. 9
Overview of our approach
Input: Product pages in the category C
Pages for model construction Pages that we want to structure
Winery: Bodegas Carchelo
Type: Medium body
Grape: Monastrell 40%, Syrah 40%,
Cabernet Sauvignon 20%
Type Red
Country Italy / Tuscany
Grape Sangiovese
Year 2011
Pages including
tables or itemizations
Unstructured pages
10. 10
Overview of our approach
Input: Product pages in the category C
Pages for model construction
Pages including
tables or itemizations
Unstructured pages
1. Knowledge
base induction
Knowledge base(KB)
<attr1, value1>
<attr2, value2>
<attr1, value3>
:
Pages that we want to structure
Annotated pages
2. Training data
construction
3. Extraction
model training
Extraction
model
4. Product page
structuring
Output:
Structured data
11. 11
KB induction – Extraction of attribute and its value -
• Attribute acquisition:
– Assumption: Expressions that are often located in
table headers can be considered as attributes.
– Extract expressions enclosed by <TH> tags.
• Attribute value extraction:
– Extract attribute-value using regular expression
patterns [Yoshinaga and Torisawa 2006].
– Store <attr., val.> in the KB along with the number of
merchants that use it in tables or itemizations.
Merchant frequency (MF)
<Production area, France> (29),
<Region, Italy> (13)
12. 12
KB induction - Attribute synonym discovery -
• Assumption: Attributes can be seen as
synonyms of one another if
– they are not included in the same structured data, and
– they share an identical popular value.
• Regard attribute pairs satisfying the conditions as
synonyms.
• Aggregate similar pairs of attribute synonyms by
computing cosine measure.
Non
synonym
<Alcohol, 15 degree>
<Temperature, 15 degree>
Synonym
<Production area, France>
<Region, France>
(Country, Region, Production area)
(Production area, Region),
(Country, Production area)
14. 14
Overview of our approach
Input: Product pages in the category C
Pages for model construction
Pages including
tables or itemizations
Unstructured pages
1. Knowledge
base induction
Knowledge base(KB)
<attr1, value1>
<attr2, value2>
<attr1, value3>
:
Pages that we want to structure
Annotated pages
2. Training data
construction
3. Extraction
model training
Extraction
model
4. Product page
structuring
Output:
Structured data
15. 15
Training data construction
• Simple longest string matching between full texts
and attribute-values in KB.
• Problems in automatic annotation:
– Incorrect annotation (false-positive)
• The flavor of the <grape_variety> grape </grape_variety> is quite
a little.
– Missing annotation (false-negative)
• Chateau Talbot is a famous winery in <production_area> France
</production_area>.
16. 16
Incorrect annotation filtering
• Assumption: Attribute values with low MFs in
structured data and high MFs in unstructured
data are likely to be incorrect.
NM … # of merchants offering a product in a category.
MS … # of merchants offering structured data in a category.
MFD (v) … # of merchants describing the value v in full texts.
MFS (v) … # of merchants describing the value v in structure data.
𝑆𝑐𝑜𝑟𝑒 𝑣 =
𝑀𝐹 𝐷(𝑣) 𝑁 𝑀
𝑀𝐹𝑆(𝑣) 𝑀𝑆
Likeliness of occurring the value v in structured data.
Likeliness of occurring the value v in full texts.
We regard attribute values with scores greater than 30 as incorrect,
and remove sentences including such values from the corpus.
17. 17
Missing annotation filtering
• Induce frequently occurred token sequences in
attribute values with PrefixSpan [Pei+ 2001].
• Remove sentences containing a string that is not
annotated and matches an induced pattern.
– Chateau Talbot is a famous winery in <production_area>
France </production_area>.
Pattern:
[chateau] [ANY_TOKEN]
<Winery, Chateau Lanessan>
<Winery, Chateau Fontareche>
<Winery, Chateau Latour>
18. 18
Overview of our approach
Input: Product pages in the category C
Pages for model construction
Pages including
tables or itemizations
Unstructured pages
1. Knowledge
base induction
Knowledge base(KB)
<attr1, value1>
<attr2, value2>
<attr1, value3>
:
Pages that we want to structure
Annotated pages
2. Training data
construction
3. Extraction
model training
Extraction
model
4. Product page
structuring
Output:
Structured data
19. 19
Extraction model training
• Algorithm: Conditional random fields [Lafferty+ 2001]
• Chunk tag: Start/End (IOBES) model [Sekine+ 1998]
• Features:
– Token: Surface form of the token.
– Base: Base form of the token.
– PoS: Part-of-Speech tag of the token.
– Char. type: Types of characters in the token.
– Prefix: Double character prefix of the token.
– Suffix: Double character suffix of the token.
– The above features of ±3 tokens surrounding the token.
They are frequently employed in the task of Japanese NER.
20. 20
Overview of our approach
Input: Product pages in the category C
Pages for model construction
Pages including
tables or itemizations
Unstructured pages
1. Knowledge
base induction
Knowledge base(KB)
<attr1, value1>
<attr2, value2>
<attr1, value3>
:
Pages that we want to structure
Annotated pages
2. Training data
construction
3. Extraction
model training
Extraction
model
4. Product page
structuring
Output:
Structured data
21. 21
Agenda
• Background
• Overview of our approach
– Knowledge base induction
– Training data construction
– Extraction model training
– Product page structuring
• Experiments
• Conclusion and future work
22. 22
Experiments
• Evaluation of KB
– Extracted attributes
– Aggregated attribute synonyms
– Extracted attribute-values
• Evaluation of the quality of annotated corpora
• Evaluation of extraction models
23. 23
Experimental setting
• Category:
– Selected major eight categories in Rakuten.
• Wine, T-shirts, Printer ink, Shampoo, Golf ball, and others.
• Attribute:
– Selected the top eight attributes in each category
according to the merchant frequencies of the attributes.
• Training dataset:
– Randomly picked up 100K sentences for each category.
• Evaluation dataset:
– Tailored annotated corpus comprising 1,776 product
pages gathered from the categories.
24. 24
Compared models
• KB match:
– Matching attribute values in KB, and then filtering out
problematic annotations.
• Model w/o filters:
– Training models based on a corpus where the both
filters are not applied.
• Model w/ incorrect annotation filter:
– Training models based on a corpus where only the
filter for incorrect annotations is applied.
• Model w/ missing annotation filter:
– Training models based on a corpus where only the
filter for missing annotations is applied.
25. 25
Evaluation of extraction models
Model P (%) R (%) F score
KB match 57.14 29.29 37.21
Model w/o filters 52.60 54.49 53.14
Model w/ incorrect annotation
filter
60.46 54.23 56.84
Model w/ missing annotation
filter
50.47 59.71 54.43
Model of the proposed method 57.05 59.66 58.15
26. 26
Model P (%) R (%) F score
KB match 57.14 29.29 37.21
Model w/o filters 52.60 54.49 53.14
Model w/ incorrect annotation
filter
60.46 54.23 56.84
Model w/ missing annotation
filter
50.47 59.71 54.43
Model of the proposed method 57.05 59.66 58.15
Evaluation of extraction models
+30.4 %.
Recall was dramatically improved.
⇒ Contexts surrounding a value and patterns of
⇒ tokens in a value are successfully captured.
27. 27
Model P (%) R (%) F score
KB match 57.14 29.29 37.21
Model w/o filters 52.60 54.49 53.14
Model w/ incorrect annotation
filter
60.46 54.23 56.84
Model w/ missing annotation
filter
50.47 59.71 54.43
Model of the proposed method 57.05 59.66 58.15
Evaluation of extraction models
+7.9 %.
The incorrect annotation filter improved precision.
28. 28
Model P (%) R (%) F score
KB match 57.14 29.29 37.21
Model w/o filters 52.60 54.49 53.14
Model w/ incorrect annotation
filter
60.46 54.23 56.84
Model w/ missing annotation
filter
50.47 59.71 54.43
Model of the proposed method 57.05 59.66 58.15
Evaluation of extraction models
+5.2 %.
The incorrect annotation filter improved precision.
The missing annotation filter improved recall.
29. 29
Model P (%) R (%) F score
KB match 57.14 29.29 37.21
Model w/o filters 52.60 54.49 53.14
Model w/ incorrect annotation
filter
60.46 54.23 56.84
Model w/ missing annotation
filter
50.47 59.71 54.43
Model of the proposed method 57.05 59.66 58.15
Evaluation of extraction models
+5.1 pt.
The incorrect annotation filter improved precision.
The missing annotation filter improved recall.
⇒ The precision and recall of the proposed method
⇒ are enhanced by employing both filters.
30. 30
Error trend
• Randomly selected 50 attribute values judged as
incorrect in wine and shampoo categories.
Type # of err.
Automatic annotation 36
Incorrect KB entry 23
Over generation by learned patterns 15
Extraction from unrelated regions 12
Others 14
31. 31
Automatic annotation error
• 土壌が<産地>ボルドー</産地>のポムロールと非常に似ている。
• <成分>ヒアルロン酸</成分> 以上の保水力がある。
• <タイプ>白</タイプ>カビチーズに合わせるとより楽しめます。
• 輸出は全体の<アルコール>10%</アルコール>程度。
Soil is very similar with ones in Pomerol region of <production_area>
Bordeaux </production_area>.
<type>White</type> mold cheese will enhance the taste of the wine.
The amount of exports is approximately <alcohol>10 %</alcohol> of
the total.
It has a higher water-holding ability <constituent> than hyaluronan
</constituent> has.
32. 32
Related work
• Product information extraction
– (Semi-) Supervised methodology [Ghani+ 2006, Probst+
2007, Davidov+ 2010, Bakalov+ 2011, Putthividhya+ 2011]
⇒ Training data or initial seeds are required.
– Unsupervised methodology [Yoshinaga+ 2006, Dalvi+ 2009,
Gulhane+ 2010, Mauge+ 2012, Bing+ 2012]
⇒ Not for full texts or limited to the size of texts.
• Unsupervised NER / Unsupervised IE
– Many attempts based on distant supervision [Nadeau+
2006, Whitelaw+ 2008, Nothman+ 2008, Mintz+ 2009, Ritter+
2011]
⇒ Wikipedia and Freebase are resources.
33. 33
Conclusion and future work
• Distant supervision based approach for extracting
attributes and their values from product pages.
– Construction of knowledge base.
– Remove false-positive and false-negative annotations
from automatically constructed corpus.
• Evaluated the performance of KB induction,
automatic annotation, and extraction models
under multiple categories.
• Future work
– Improve the annotation quality by considering contexts.
– Construct KB with wide coverage and high quality.