The document discusses a data enrichment framework that uses data mining and semantic techniques to automatically select and enrich data from web APIs. The framework aims to address issues with static data enrichment approaches by dynamically selecting sources based on data availability and source quality. It assesses attribute importance, selects sources contextually based on input data and source performance, monitors source quality over time, and adjusts source selection accordingly. The framework provides granular, adaptive data enrichment to integrate diverse data sources for tasks like customer profiling, competitive intelligence, and fraud detection.
Odam an optimized distributed association rule mining algorithm (synopsis)Mumbai Academisc
This document proposes ODAM, an optimized distributed association rule mining algorithm. It aims to discover rules based on higher-order associations between items in distributed textual documents that are neither vertically nor horizontally distributed, but rather a hybrid of the two. Modern organizations have geographically distributed data stored locally at each site, making centralized data mining infeasible due to high communication costs. Distributed data mining emerged to address this challenge. ODAM reduces communication costs compared to previous distributed ARM algorithms by mining patterns across distributed databases without requiring data consolidation.
This document provides an introduction to data mining and text mining. It defines data mining as the non-trivial extraction of implicit information from data for exploration and analysis. Text mining is described as discovering useful information from large text collections. Common data mining tasks like classification, clustering, and association rule discovery are explained. Challenges of data mining like scalability, dimensionality, and heterogeneous data are also discussed.
Lessons and Challenges from Mining Retail E-Commerce DataKun Le
This document discusses lessons learned from mining retail e-commerce data over 4 years working with over 20 clients. Key lessons include:
1. Clients often don't know what specific business questions to ask, so presenting preliminary findings gets them to provide a long list of questions.
2. Pushing clients to ask deeper "characterization" and "strategic" questions rather than just basic reporting questions.
3. The software architecture automatically collects useful data like clickstreams, searches, and form failures without additional work, solving problems that make e-commerce data mining difficult.
4. Collecting the right data up front is important, and changes later are difficult. Integrating external events like marketing is also
The document discusses setting up a database to track author contracts for a small publishing company. It identifies seven key fields that would be needed for the contract database, including author name, book title, payment details. It also notes that some of these fields like author name and book title could be used in other existing databases at the company like an author database or book title database.
Datalicious was founded in late 2007 and has since grown to become a 360 data agency with specialist teams combining analysts and developers. It has a short but successful history in web analytics and a carefully selected group of best-in-breed partners. Datalicious provides a wide range of data services across the data, insights, and action spectrum, including platforms, analytics, and marketing campaigns. It serves clients across all industries and aims to help them progress along the data journey from basic reporting to advanced predictive modeling and trigger-based marketing.
Architecting a-big-data-platform-for-analytics 24606569Kun Le
This document discusses the growth of big data and the need for businesses to analyze new and complex data sources. It describes how data has become more varied in type, larger in volume, and generated faster. It also outlines different types of big data analytics workloads and technology options for building an end-to-end big data analytics platform. Finally, it provides an example of IBM's solution for analyzing both data in motion and at rest across the entire big data analytics lifecycle.
A survey on various architectures, models and methodologies for information r...IAEME Publication
This document discusses various architectures, models, and methodologies used in information retrieval. It describes query models, ranking models, and feedback models used by researchers. It also highlights the importance of using context-based queries to better understand a user's search intent. The document provides an extensive survey of different approaches used in information retrieval systems and how adding context can help improve search results.
Discovering diamonds under coal piles: Revealing exclusive business intellige...IJERA Editor
Web Mining has gained prominence over the last decade. This rise is concomitant with the upsurge of pure
players, the multiple challenges of data deluge, the trend toward automation and integration within organization,
as well as a desire for hyper segmentation. Confronted, partly or totally, with these multiple issues, companies
recourse increasingly to replicate the data mining toolbox on web data. Although much is known about the
technical aspect of WM, little is known about the extent to which WM actually fits within a customer
relationship management system, designed at attracting and retaining the maximum amount of customers. An
exploratory study involving twelve senior professionals and scholars indicated that WM is well-suited to achieve
most of the customer relationship management objective, with regards to the profiling of existing web customers.
The results of this study suggest that the engineering of WM processes into analytic customer relationship
management systems, may yield highly beneficial returns, provided that some guidelines are scrupulously
followed.
Odam an optimized distributed association rule mining algorithm (synopsis)Mumbai Academisc
This document proposes ODAM, an optimized distributed association rule mining algorithm. It aims to discover rules based on higher-order associations between items in distributed textual documents that are neither vertically nor horizontally distributed, but rather a hybrid of the two. Modern organizations have geographically distributed data stored locally at each site, making centralized data mining infeasible due to high communication costs. Distributed data mining emerged to address this challenge. ODAM reduces communication costs compared to previous distributed ARM algorithms by mining patterns across distributed databases without requiring data consolidation.
This document provides an introduction to data mining and text mining. It defines data mining as the non-trivial extraction of implicit information from data for exploration and analysis. Text mining is described as discovering useful information from large text collections. Common data mining tasks like classification, clustering, and association rule discovery are explained. Challenges of data mining like scalability, dimensionality, and heterogeneous data are also discussed.
Lessons and Challenges from Mining Retail E-Commerce DataKun Le
This document discusses lessons learned from mining retail e-commerce data over 4 years working with over 20 clients. Key lessons include:
1. Clients often don't know what specific business questions to ask, so presenting preliminary findings gets them to provide a long list of questions.
2. Pushing clients to ask deeper "characterization" and "strategic" questions rather than just basic reporting questions.
3. The software architecture automatically collects useful data like clickstreams, searches, and form failures without additional work, solving problems that make e-commerce data mining difficult.
4. Collecting the right data up front is important, and changes later are difficult. Integrating external events like marketing is also
The document discusses setting up a database to track author contracts for a small publishing company. It identifies seven key fields that would be needed for the contract database, including author name, book title, payment details. It also notes that some of these fields like author name and book title could be used in other existing databases at the company like an author database or book title database.
Datalicious was founded in late 2007 and has since grown to become a 360 data agency with specialist teams combining analysts and developers. It has a short but successful history in web analytics and a carefully selected group of best-in-breed partners. Datalicious provides a wide range of data services across the data, insights, and action spectrum, including platforms, analytics, and marketing campaigns. It serves clients across all industries and aims to help them progress along the data journey from basic reporting to advanced predictive modeling and trigger-based marketing.
Architecting a-big-data-platform-for-analytics 24606569Kun Le
This document discusses the growth of big data and the need for businesses to analyze new and complex data sources. It describes how data has become more varied in type, larger in volume, and generated faster. It also outlines different types of big data analytics workloads and technology options for building an end-to-end big data analytics platform. Finally, it provides an example of IBM's solution for analyzing both data in motion and at rest across the entire big data analytics lifecycle.
A survey on various architectures, models and methodologies for information r...IAEME Publication
This document discusses various architectures, models, and methodologies used in information retrieval. It describes query models, ranking models, and feedback models used by researchers. It also highlights the importance of using context-based queries to better understand a user's search intent. The document provides an extensive survey of different approaches used in information retrieval systems and how adding context can help improve search results.
Discovering diamonds under coal piles: Revealing exclusive business intellige...IJERA Editor
Web Mining has gained prominence over the last decade. This rise is concomitant with the upsurge of pure
players, the multiple challenges of data deluge, the trend toward automation and integration within organization,
as well as a desire for hyper segmentation. Confronted, partly or totally, with these multiple issues, companies
recourse increasingly to replicate the data mining toolbox on web data. Although much is known about the
technical aspect of WM, little is known about the extent to which WM actually fits within a customer
relationship management system, designed at attracting and retaining the maximum amount of customers. An
exploratory study involving twelve senior professionals and scholars indicated that WM is well-suited to achieve
most of the customer relationship management objective, with regards to the profiling of existing web customers.
The results of this study suggest that the engineering of WM processes into analytic customer relationship
management systems, may yield highly beneficial returns, provided that some guidelines are scrupulously
followed.
DATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASETAM Publications
Big Data is the extremely large sets of data that their sizes are beyond the ability of capturing, managing, processing and storage by most software tools and people which is ever increasing day-by-day. In most enterprise scenarios the data is too big or it moves too fast that extremely exceeds current processing capacity. The term big data is also used by vendors, may refer to the technology which includes tools and processes that an organization requires to handle the large amounts of data and storage facilities. This advancement in technology leads to make relationship marketing a reality for today’s competitive world. But at the same time this huge amount of data cannot be analyzed in a traditional manner, by using manual data analysis. For this, technologies such as data warehousing and data mining have made customer relationship management as a new area where business firms can gain a competitive advantage for identifying their customer behaviors and needs. This paper mainly focuses on data mining technique that performs the extraction of hidden predictive information from large databases and organizations can identify valuable customers and predicts future user behaviors. This enables different organizations to make proactive, knowledge-driven decisions. Data mining tools answer business questions that in the past were too time-consuming, this makes customer relationship management possible. For this in this paper, we are trying explain the use of data mining technique to accomplish the goals of today’s customer relationship management and Decision making for different companies that deals with big data.
Information Extraction and Aggregation from Unstructured Web Data for Busines...Alexander Michels
Team Praedicat's final presentation at UCLA's Research in Industrial Projects for Students. We discuss our project for Praedicat, Inc. which helped them algorithmically profile companies to help them assess their actuarial risk.
GitHub: https://github.com/alexandermichels/pcatxcore
Customer segmentation is a Project on Machine learning that is developed by using Clustering & clustering is the technique that comes under unsupervised learning of machine learning.
Segmentation allows prospects based on their wants and needs. It allows identifying the most valuable customer segment so the basis of it vender improve their return on marketing investment by only targeting those likely to be your best customer.
A few days ago I presented a webinar on Insight as a Service. In the presentation I tried to provide further details on the concept which I first introduced and later elaborated on my blog: http://blog.tridentcap.com
I am including the presentation and the notes because they provide further details on the concept and some examples
This document provides a summary of a group project report on big data analytics. It discusses how big data and analytics can help companies optimize supply chains by improving decision making and handling risks. It defines big data as large, diverse, and rapidly growing datasets that are difficult to manage with traditional tools. It also discusses data sources, management, quality dimensions, and using statistical process control methods to monitor and control data quality throughout the production process.
Best practices for building and deploying predictive models over big data pre...Kun Le
The tutorial is divided into 12 modules that cover best practices for building and deploying predictive models over big data. It introduces key concepts like predictive analytics, building predictive models, and deploying models. The life cycle of a predictive model is also described, from exploratory data analysis to deployment and operations.
The Comparison of Big Data Strategies in Corporate EnvironmentIRJET Journal
The document discusses and compares different big data strategies that corporations can use to handle large volumes of data. It analyzes traditional relational database management systems (RDBMS), MapReduce techniques, and a hybrid approach. While each strategy has benefits, the hybrid approach that combines traditional databases and MapReduce is identified as being most valuable for companies pursuing business analytics, as it allows for efficiently handling both structured and unstructured data at large scales. The document provides an overview of these strategies and their suitability based on different corporate needs and environments.
[This is work presented at SIGMOD'13.]
The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.
The document describes a proposed data warehouse for a collection agency. It would help the agency identify profitable clients and account types, measure employee performance, and make better strategic decisions. The data warehouse would collect, categorize, and analyze data on accounts, clients, collection methods, employees, and more. Key performance indicators would measure costs and revenues by category, compare employee collections to targets, and analyze revenues by geography to guide business growth.
Causata is a big data architecture that stores customer interaction data from digital sources in HBase for scalability. It constructs an identity graph to link customer identifiers and assemble a timeline of events. Causata then computes predictive profiles with variables and model scores to provide a structured view of customers. Analysts can query customer data and profiles using Causata SQL to retrieve predictive records for analysis and power real-time decisions.
The document provides an overview of Lean Six Sigma (LSS) and its key principles and tools. It discusses how Lean focuses on eliminating waste and ensuring smooth workflow, while Six Sigma aims to reduce variation and improve quality. The combination of Lean and Six Sigma in LSS provides a balanced approach that can drive process improvements in any organization. Case studies like Toyota demonstrate how LSS principles like standard work and data-driven decision making can significantly enhance production quality and efficiency. Key takeaways emphasize measuring results from LSS projects and implementing them as a team through defined roles and documented progress.
The document introduces the 797 mining truck, which has a 360 ton nominal payload capacity. It is larger than the 793 model, with dimensions of 30 feet wide, 24 feet high, and 48 feet long. The 797 has improvements like a cast frame, updated powertrain components, and an operator station with additional features. The objective of the 797 truck is to improve haulage costs per ton through a longer component life and durable, purpose-built design.
This document is Karthik Gomadam's dissertation submitted in partial fulfillment of the requirements for a Doctor of Philosophy degree from Wright State University. The dissertation addresses problems related to semantics enriched service environments, including service description, discovery, data mediation, and dynamic configuration. It proposes techniques to add semantic metadata to RESTful services and resources on the web, an algorithm for service discovery and ranking, and methods for aiding data mediation and dynamic configuration. The dissertation also examines applying service-oriented principles to social and human computation.
The document summarizes a development plan for the Eastern Urban Center at Millenia, which will include up to 3,000 multifamily homes and 300,000 square feet of retail across 80 city blocks. The development is designed as a walkable, mixed-use district centered around sustainability with public parks, trails, and transit connections. It aims to achieve LEED for Neighborhood Development certification by integrating energy efficient buildings, transportation, and urban design.
This document provides an overview of APA citation style. It discusses why APA style is used, how to cross-reference sources, and how to establish credibility as a writer. It also covers creating a reference page, using parenthetical citations, and where to find additional APA style resources.
The document proposes a platform that matches patients from online health communities to relevant medical research projects, by developing rich semantic profiles of both patients and projects. It analyzes patient conversations to extract medical conditions, medications, and demographics to create patient profiles. It also analyzes research project descriptions to create profiles. These profiles are then matched using semantic similarity algorithms to find relevant patients for projects. The platform was prototyped and shown to accurately match patients to projects with similar medical conditions.
The document summarizes the plans for developing the Eastern Urban Center, an 80 city block urban core in Chula Vista. It will include up to 3,000 multifamily homes, 300,000 square feet of retail and commercial space, and acres of parks and public spaces. The development aims to be a walkable, mixed-use district with sustainable design and public transportation integration.
This document summarizes the sustainable development practices of the Corky McMillin Companies, a premier builder of mixed-use master planned communities. It discusses how the company has preserved thousands of acres of open space and wetlands. It also outlines current practices like LEED certification and highlighting green building options for homebuyers. The document concludes by describing plans for a future development called the Eastern Urban Center, which will utilize sustainable features like transit integration, urban parks, and compact neighborhoods to create an energy efficient community.
The document discusses the importance of attitude and how much of it is visible to others. It states that just like an iceberg where only 10% is visible above water, only a small part of a person's attitudes, knowledge, and skills are visible to others. The rest, including their values, motives, and beliefs, lie below the surface unknown to others. It emphasizes that attitude is everything and determines a person's success more than their aptitude. It provides several quotes about the power of positive thinking and having a positive attitude.
Homer wants to drive from Athens, GA to NYC and have his favorite burger along the way. He downloads an app to find restaurants but realizes it is too complicated. The document then discusses creating a simpler app to help Homer find McDonald's restaurants on his route. It outlines a 4-step process: 1) finding relevant services, 2) integrating the services, 3) ensuring they work together, and 4) identifying issues when used. The rest of the document describes semantic techniques like RDF that could be used to build such an app.
DATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASETAM Publications
Big Data is the extremely large sets of data that their sizes are beyond the ability of capturing, managing, processing and storage by most software tools and people which is ever increasing day-by-day. In most enterprise scenarios the data is too big or it moves too fast that extremely exceeds current processing capacity. The term big data is also used by vendors, may refer to the technology which includes tools and processes that an organization requires to handle the large amounts of data and storage facilities. This advancement in technology leads to make relationship marketing a reality for today’s competitive world. But at the same time this huge amount of data cannot be analyzed in a traditional manner, by using manual data analysis. For this, technologies such as data warehousing and data mining have made customer relationship management as a new area where business firms can gain a competitive advantage for identifying their customer behaviors and needs. This paper mainly focuses on data mining technique that performs the extraction of hidden predictive information from large databases and organizations can identify valuable customers and predicts future user behaviors. This enables different organizations to make proactive, knowledge-driven decisions. Data mining tools answer business questions that in the past were too time-consuming, this makes customer relationship management possible. For this in this paper, we are trying explain the use of data mining technique to accomplish the goals of today’s customer relationship management and Decision making for different companies that deals with big data.
Information Extraction and Aggregation from Unstructured Web Data for Busines...Alexander Michels
Team Praedicat's final presentation at UCLA's Research in Industrial Projects for Students. We discuss our project for Praedicat, Inc. which helped them algorithmically profile companies to help them assess their actuarial risk.
GitHub: https://github.com/alexandermichels/pcatxcore
Customer segmentation is a Project on Machine learning that is developed by using Clustering & clustering is the technique that comes under unsupervised learning of machine learning.
Segmentation allows prospects based on their wants and needs. It allows identifying the most valuable customer segment so the basis of it vender improve their return on marketing investment by only targeting those likely to be your best customer.
A few days ago I presented a webinar on Insight as a Service. In the presentation I tried to provide further details on the concept which I first introduced and later elaborated on my blog: http://blog.tridentcap.com
I am including the presentation and the notes because they provide further details on the concept and some examples
This document provides a summary of a group project report on big data analytics. It discusses how big data and analytics can help companies optimize supply chains by improving decision making and handling risks. It defines big data as large, diverse, and rapidly growing datasets that are difficult to manage with traditional tools. It also discusses data sources, management, quality dimensions, and using statistical process control methods to monitor and control data quality throughout the production process.
Best practices for building and deploying predictive models over big data pre...Kun Le
The tutorial is divided into 12 modules that cover best practices for building and deploying predictive models over big data. It introduces key concepts like predictive analytics, building predictive models, and deploying models. The life cycle of a predictive model is also described, from exploratory data analysis to deployment and operations.
The Comparison of Big Data Strategies in Corporate EnvironmentIRJET Journal
The document discusses and compares different big data strategies that corporations can use to handle large volumes of data. It analyzes traditional relational database management systems (RDBMS), MapReduce techniques, and a hybrid approach. While each strategy has benefits, the hybrid approach that combines traditional databases and MapReduce is identified as being most valuable for companies pursuing business analytics, as it allows for efficiently handling both structured and unstructured data at large scales. The document provides an overview of these strategies and their suitability based on different corporate needs and environments.
[This is work presented at SIGMOD'13.]
The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.
The document describes a proposed data warehouse for a collection agency. It would help the agency identify profitable clients and account types, measure employee performance, and make better strategic decisions. The data warehouse would collect, categorize, and analyze data on accounts, clients, collection methods, employees, and more. Key performance indicators would measure costs and revenues by category, compare employee collections to targets, and analyze revenues by geography to guide business growth.
Causata is a big data architecture that stores customer interaction data from digital sources in HBase for scalability. It constructs an identity graph to link customer identifiers and assemble a timeline of events. Causata then computes predictive profiles with variables and model scores to provide a structured view of customers. Analysts can query customer data and profiles using Causata SQL to retrieve predictive records for analysis and power real-time decisions.
The document provides an overview of Lean Six Sigma (LSS) and its key principles and tools. It discusses how Lean focuses on eliminating waste and ensuring smooth workflow, while Six Sigma aims to reduce variation and improve quality. The combination of Lean and Six Sigma in LSS provides a balanced approach that can drive process improvements in any organization. Case studies like Toyota demonstrate how LSS principles like standard work and data-driven decision making can significantly enhance production quality and efficiency. Key takeaways emphasize measuring results from LSS projects and implementing them as a team through defined roles and documented progress.
The document introduces the 797 mining truck, which has a 360 ton nominal payload capacity. It is larger than the 793 model, with dimensions of 30 feet wide, 24 feet high, and 48 feet long. The 797 has improvements like a cast frame, updated powertrain components, and an operator station with additional features. The objective of the 797 truck is to improve haulage costs per ton through a longer component life and durable, purpose-built design.
This document is Karthik Gomadam's dissertation submitted in partial fulfillment of the requirements for a Doctor of Philosophy degree from Wright State University. The dissertation addresses problems related to semantics enriched service environments, including service description, discovery, data mediation, and dynamic configuration. It proposes techniques to add semantic metadata to RESTful services and resources on the web, an algorithm for service discovery and ranking, and methods for aiding data mediation and dynamic configuration. The dissertation also examines applying service-oriented principles to social and human computation.
The document summarizes a development plan for the Eastern Urban Center at Millenia, which will include up to 3,000 multifamily homes and 300,000 square feet of retail across 80 city blocks. The development is designed as a walkable, mixed-use district centered around sustainability with public parks, trails, and transit connections. It aims to achieve LEED for Neighborhood Development certification by integrating energy efficient buildings, transportation, and urban design.
This document provides an overview of APA citation style. It discusses why APA style is used, how to cross-reference sources, and how to establish credibility as a writer. It also covers creating a reference page, using parenthetical citations, and where to find additional APA style resources.
The document proposes a platform that matches patients from online health communities to relevant medical research projects, by developing rich semantic profiles of both patients and projects. It analyzes patient conversations to extract medical conditions, medications, and demographics to create patient profiles. It also analyzes research project descriptions to create profiles. These profiles are then matched using semantic similarity algorithms to find relevant patients for projects. The platform was prototyped and shown to accurately match patients to projects with similar medical conditions.
The document summarizes the plans for developing the Eastern Urban Center, an 80 city block urban core in Chula Vista. It will include up to 3,000 multifamily homes, 300,000 square feet of retail and commercial space, and acres of parks and public spaces. The development aims to be a walkable, mixed-use district with sustainable design and public transportation integration.
This document summarizes the sustainable development practices of the Corky McMillin Companies, a premier builder of mixed-use master planned communities. It discusses how the company has preserved thousands of acres of open space and wetlands. It also outlines current practices like LEED certification and highlighting green building options for homebuyers. The document concludes by describing plans for a future development called the Eastern Urban Center, which will utilize sustainable features like transit integration, urban parks, and compact neighborhoods to create an energy efficient community.
The document discusses the importance of attitude and how much of it is visible to others. It states that just like an iceberg where only 10% is visible above water, only a small part of a person's attitudes, knowledge, and skills are visible to others. The rest, including their values, motives, and beliefs, lie below the surface unknown to others. It emphasizes that attitude is everything and determines a person's success more than their aptitude. It provides several quotes about the power of positive thinking and having a positive attitude.
Homer wants to drive from Athens, GA to NYC and have his favorite burger along the way. He downloads an app to find restaurants but realizes it is too complicated. The document then discusses creating a simpler app to help Homer find McDonald's restaurants on his route. It outlines a 4-step process: 1) finding relevant services, 2) integrating the services, 3) ensuring they work together, and 4) identifying issues when used. The rest of the document describes semantic techniques like RDF that could be used to build such an app.
The document discusses semantic web services and proposes approaches to help describe, discover, compose and mediate between services in a semantic way. It presents technologies like OWL-S, WSMO that model service semantics and proposes semantic templates to describe service aspects like inputs, outputs and properties in a declarative way. It also discusses facilitating service discovery, composition and mediation through semantic annotations and using techniques like faceted search and semantic association querying. The document argues that taking a semantic approach helps address challenges in service interoperability, discovery and mediation.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help boost feelings of calmness, happiness and focus.
The document discusses dynamic and agile service-oriented architectures using semantic web services. It describes how semantic web services can enable description, discovery, and data mediation of web services to support more dynamic integration. Semantic annotations of web services using ontologies is proposed to achieve agility through increased reuse, easier integration, and the ability to span domains.
This document is the copyright page and dedication for the book "100 Ways to Boost Your Self-Confidence" by Barton Goldsmith. It lists the publisher, copyright date, and thanks those who supported the author in writing the book, including friends, colleagues, and organizations. The book is dedicated to the author's clients, readers, and listeners who have shared personal stories that gave him confidence to write this book.
The document introduces the 797 mining truck, which has a 360 ton nominal payload capacity. It is larger than the 793 model, with dimensions of 30 feet wide, 24 feet high, and 48 feet long. The 797 has improvements like a cast frame, updated powertrain components like a new torque converter and transmission, a larger operator station, and purpose-built components for longer life, with the goal of improving haulage cost per ton.
Data has become a key asset for modern enterprises, as they use larger datasets to gain competitive advantages through better decision making and customer insights. However, data differs from traditional assets as it is intangible, can have multiple users, and utilizing it often creates more data. To understand the value of their data assets, organizations need to assign them a monetary value. While there is no standard approach, common methods include calculating the intrinsic, business, and performance value of data assets or conducting financial valuations based on potential revenue opportunities and replacement costs. Data valuation provides clarity on how data assets can drive growth and supports establishing formal data management practices.
Semantic 'Radar' Steers Users to Insights in the Data LakeCognizant
The document discusses how a semantic "data lake" can help organizations extract meaning and insights from large amounts of digital data. A data lake combines data from different sources and uses semantic models, tagging, and algorithms to help users more quickly find relevant data relationships and insights. It describes how semantic technology plays a key role in data ingestion, management, modeling of different views, querying, and exposing analytics as web services to create personalized customer experiences.
Semantic 'Radar' Steers Users to Insights in the Data LakeThomas Kelly, PMP
By infusing information with intelligence, users can discover meaning in the digital data that envelops people, organizations, processes, products and things.
1. What are the business costs or risks of poor data quality Sup.docxSONU61709
1. What are the business costs or risks of poor data quality? Support your discussion with at least 3 references.
Data area utilized in most of the activities of corporations and represent the premise for choices on operational and strategic levels. Poor quality information will, therefore, have considerably negative impacts on the potency of a company, whereas good quality information is typically crucial to a company's success. The development of information technology throughout the last decades has enabled organizations to gather and store huge amounts of data. However, because the data volumes increase, thus will the complexity of managing them. Since larger and additional complicated info resources are being collected and managed in organizations nowadays, this implies that the chance of poor data quality increases.Poor data quality might have significant negative economic and social impacts on an organization.The implications of poor data quality carry negative effects to business users through: less client satisfaction, increase in running prices, inefficient decision-making processes, lower performance and low employee job satisfaction.
References:
1. Haug, A., Zachariassen, F., & van Liempd, D. (2011). The cost of poor data quality. Journal of Industrial Engineering and Management, 4(2), 168-193
2. https://www.edq.com/blog/the-consequences-of-poor-data-quality-for-a-business/
3. Knowledge Engineering and management by the masses. 17th International Conference,EKAW 2010,Lisbon,Portugal,October 11-15,2010 Proceedings
2. Data Mining: Data Mining is an analytic method designed to explore knowledge (usually massive amounts of data - generally business or market connected - conjointly called "big data") in search of consistent patterns and/or systematic relationships between variables, and then validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is that the most typical sort of data processing and one that has the foremost direct business applications.The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification and (3) deployment.
Reference:
1. Three perspectives of data mining Zhi-Hua Zhou.
2. http://www.statsoft.com/Textbook/Data-Mining-Techniques
3. https://paginas.fe.up.pt/~ec/files_0506/slides/04_AssociationRules.pdf
3. Text Mining: Text mining and text analytics area broad umbrella terms describing a variety of technologies for analyzing
and processing semi-structured and unstructured text data. The unifying theme behind every of those technologies is that the ought to “turn text into numbers” thus powerful algorithms will be applied to giant document databases.Converting text into a structured, numerical format and applying analytical algorithms require knowing how to both use and combine techniq ...
Developing A Universal Approach to Cleansing Customer and Product DataFindWhitePapers
Take a look at this review of current industry problems concerning data quality, and learn more about how companies are addressing quality problems with customer, product, and other types of corporate data. Read about products and use cases from SAP to see how vendors are supporting data cleansing.
The presentation includes the introduction to the topic, the various dimensions of big data, its evolution from big data 1.0 to bid data 3.0 and its impact on various industries, uses as well as the challenges it faces. The concluding slide gives a brief on the future of big data.
The document outlines a new data project plan for DERAK, an analytical support team for a young credit card company. It describes the company background and goals of consolidating various data sources. A SWOT analysis is presented identifying strengths, weaknesses, opportunities and threats. Key issues of understanding customers, credit risk, and fraud are discussed. Plans are outlined for acquiring, storing, maintaining, and accessing data from various sources like customer databases, social media, and mobile phones. Issues around data integration, storage, governance and access are addressed.
How telecommunication companies can leverage power Hadoop and Big Data to derive use cases.
Based on Cloudera Whitepaper - Big Data Use Cases for Telcos
Technology & Innovation - User Experience in Business Database SystemsCris Ong
Formulate a business solution to a potential customer or supplier that is currently facing problems with the current manual or inadequate system. Critically discuss the issues raised by your study with regard to a concept or theory related to the design of information systems.
Aziksa hadoop for buisness users2 santosh jhaData Con LA
This document discusses big data, including its drivers, characteristics, use cases across different industries, and lessons learned. It provides examples of companies like Etsy, Macy's, Canadian Pacific, and Salesforce that are using big data to gain insights, increase revenues, reduce costs and improve customer experiences. Big data is being used across industries like financial services, healthcare, manufacturing, and media/entertainment for applications such as customer profiling, fraud detection, operations optimization, and dynamic pricing. While big data projects show strong financial benefits, the document cautions that not all projects are well-structured and Hadoop alone is not sufficient to meet all business analysis needs.
Welcome to big data use case course. In this course we will talk about what is big data? Who are using it and at the end we will share the lessons learnt from the early adopters. Big Data is an umbrella term used to refer the technology behind collecting and analyzing large volume of data at a fast speed. In last few years, number of devices and services customers use, have increased multi fold. As customers are using more of every thing, they are creating more data. By inter connecting these data, you can know your customer better and provide a better service. Big Data helps you in storing and connecting these data.
This document discusses data warehousing and data mining. It defines data warehousing as the process of centralizing data from different sources for analysis. Data mining is described as the process of analyzing data to uncover hidden patterns and relationships. The document provides examples of how data mining and data warehousing can be used together, with data warehousing collecting and organizing data that is then analyzed using data mining techniques to generate useful insights. Applications of data mining and data warehousing discussed include medicine, finance, marketing, and scientific discovery.
Data mining allows companies to analyze large amounts of customer data to discover patterns and trends that can help target new customers and increase profits. It involves extracting, transforming, and storing transaction data, then analyzing it to find useful business insights. Popular data mining algorithms include statistical analysis, neural networks, and nearest neighbor methods. While data mining provides benefits, privacy is a concern as customer information may be shared with third parties without consent.
Mohanbir Sawhney, Robert R. McCormick Tribune Foundation Clinical Professor of Technology Kellogg School of Management, Northwestern University presents at the 2012 Big Analytics Roadshow.
Companies are drinking from a fire hydrant of data that is too big, moving too fast and is too diverse to be analyzed by conventional database systems. Big Data is like a giant gold mine with large quantities of ore that is difficult to extract. To get value out of Big Data, enterprises need a new mindset and a new set of tools. They also need to know how to extract actionable insights from Big Data that can lead to competitive advantage. The Big Story of Big Data is not what Big Data is, but what it means for business value and competitive advantage.... read more: http://www.biganalytics2012.com/sessions.html#mohan_sawhney
Data Mining Presentation for College Harsh.pptxhp41112004
This document provides an overview of data mining. It defines data mining as the process of exploring large amounts of data to identify patterns and extract useful information. The document then describes the typical data mining process of data gathering, preparation, analysis and interpretation. It also outlines several common data mining techniques like association rules, classification, clustering, decision trees and neural networks. Finally, the document discusses applications of data mining in industries like retail, financial services, manufacturing and healthcare.
How Can You Calculate the Cost of Your Data?DATAVERSITY
Today, self-service, Cloud and big data technologies make new data preparation capabilities necessary…and possible. But, we've all been through the hype cycle and know the trough of disillusionment can come on hard and fast.
Organizations have been trying to solve the data quality problem and democratize insights for years spending millions of dollars and dedicating an increasing amount of resources to manage and govern the data. The result? Everyone is still looking to solve the problem.
Data preparation offers a new paradigm, but how can you avoid another round of minimal business impact? We’ll review a true data ROI model that helps organizations understand the value of existing versus modern data management architectures.
Top Data Mining Techniques and Their ApplicationsPromptCloud
In this presentation we have covered why data mining is important and various techniques used for data mining. Apart from that, examples of applications have been given for each technique. This presentation also explains how an enterprise can source web data via crawling services to bolster data mining models.
To Become a Data-Driven Enterprise, Data Democratization is EssentialCognizant
The document discusses how data democratization through an insights marketplace is essential for organizations to become truly data-driven. It defines data democratization as making data accessible across business lines through self-service analytics and predictive platforms. An insights marketplace allows internal users and partners to search, access, and subscribe to shared data assets like reports, models, and raw data. This facilitates collaboration, reduces duplication of efforts, and can help organizations monetize their data internally through improved products and efficiency or externally through partnerships. Examples of Transport for London and educational institutions successfully applying these approaches are provided.
Overview of major factors in big data, analytics and data science. Illustrates the growing changes from data capture and the way it is changing business beyond technology industries.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
20240609 QFM020 Irresponsible AI Reading List May 2024
Data Enrichment using Web APIs
1. Data Enrichment using Web APIs
Karthik Gomadam, Peter Z. Yeh, Kunal Verma
Accenture Technology Labs
50, W. San Fernando St,
San Jose, CA
{ karthik.gomadam, peter.z.yeh, k.verma} @accenture.com
Abstract at a granular level, based on the input data that is available
and the data sources that can be used.
As businesses seek to monetize their data, they are leveraging
Web-based delivery mechanisms to provide publically avail- 2. Quality of a data service may vary, depending on the
able data sources. Also, as analytics becomes a central part input data: Calling a data source with some missing val-
of many business functions such as customer segmentation, ues (even though they may not be mandatory), can result
competitive intelligence and fraud detection, many businesses in poor quality results. In such a scenario, one must be
are seeking to enrich their internal data records with data from able to find the missing values, before calling the source
these data sources. As the number of sources with varying or use an alternative.
degrees of accuracy and quality proliferate, it is a non-trivial
task to effectively select which sources to use for a particu- In this paper, we present an overview of a data enrichment
lar enrichment task. The old model of statically buying data framework which attempts to automate many sub-tasks of
from one or two providers becomes inefficient because of the data enrichment. The framework uses a combination of data
rapid growth of new forms of useful data such as social me- mining and semantic technologies to automate various tasks
dia and the lack of dynamism to plug sources in and out. In such as calculating which attributes are more important than
this paper, we present the data enrichment framework, a tool others for source selection, selecting sources based on infor-
that uses data mining and other semantic techniques to au-
tomatically guide the selection of sources. The enrichment
mation available about a data record and past performance
framework also monitors the quality of the data sources and of the sources, using multiple sources to reinforce low con-
automatically penalizes sources that continue to return low fidence values, monitoring the quality of sources, as well as
quality results. adaptation of sources based on past performance. The data
enrichment framework makes the following contributions:
1. Granular context dependent ordering of data sources:
Introduction We have developed a novel approach to the order in which
As enterprises become more data and analytics driven many data sources are called, based on the data that is available
businesses are seeking to enrich their internal data records at that point.
with data from data sources available on the Web. Consider
2. Automated assessment of data source quality: Our data
the example where a companys consumer database might
enrichment algorithm measures the quality of the output
have the name and address of its consumers. Being able
of a data source and its overall utility to the enrichment
to use publically available data sources such as LinkedIn,
process.
White Pages and Facebook to find information such as em-
ployment details and interests, can help the company col- 3. Dynamic selection adaptation of data sources: Using
lect features for tasks such as customer segmentation. Cur- the data availability and utility scores from prior invo-
rently, this is done in a static fashion where a business buys cations, the enrichment framework penalized or rewards
data from one or two sources and statically integrates with data sources, which affects how often they are selected.
their internal data. However, this approach has the following The rest of the paper is organized as follows:
shortcomings:
1. Inconsistencies in input data: It is not uncommon to Motivation
have different set of missing attributes across records in
the input data. For example, for some consumers, the We motivate the need for data enrichment through three real-
street address information might be missing and for oth- world examples gathered from large Fortune 500 companies
ers, information about the city might be missing. To ad- that are clients of Accenture.
dress this issue, one must be able to select the data sources Creating comprehensive customer models: Creating
comprehensive customer models has become a holy grail in
Copyright c 2012, Association for the Advancement of Artificial the consumer business, especially in verticals like retail and
Intelligence (www.aaai.org). All rights reserved. healthcare. The information that is collected impacts key de-
2. cisions like inventory management, promotions and rewards, set of sources for a particular attribute in particular data en-
and for delivering a more personal experience. richment task. The proposed framework can switch sources
1. Personalized and targeted promotions: The more infor- across customer records, if the most preferred source does
mation a business has about its customer, the better it can not have information about some attributes for a particular
personalize deals and promotions. Further, business could record. For low confidence values, the proposed system uses
also use aggregated information (such as interests of peo- reconciliation across sources to increase the confidence of
ple living in a neighborhood) to manage inventory and run the value. ADEF also continuously monitors and downgrade
local deals and promotions. sources, if there is any loss of quality.
Capital Equipment Maintenance: Companies within
2. Better segmentation and analytics: The provider may the energy and resources industry have significant invest-
need more information about their customers, beyond ments in capital equipments (i.e. drills, oil pumps, etc.).
what they have in the profile, for better segmentation and Accurate data about these equipments (e.g. manufacturer,
analytics. For example, a certain e-commerce site may model, etc.) is paramount to operational efficiency, proper
know a persons browsing and buying history on their site maintenance, etc.
and have some limited information about the person such The current process for capturing this data begins with
as their address and credit card information. However, un- manual entry. Followed by manual, periodic “walk-downs”
derstanding their professional activities and hobbies may to confirm and validate this information. However, this pro-
help them get more features for customer segmentation cess is error-prone, and often results in incomplete and inac-
that they can use for suggestions or promotions. curate data about the equipments.
3. Fraud detection: The provider may need more informa- This does not have to be the case. A wealth of structured
tion about their customers for detecting fraud. Providers data sources (e.g. from manufacturers) exist that provides
typically create detailed customer profiles to predict their much of the incomplete, missing information. Hence, a so-
behaviors and detect anomalies. Having demographic and lution that can automatically leverage these sources to enrich
other attributes such as interests and hobbies helps build- existing, internal capital equipment data can significantly
ing more accurate customer behavior profiles. Most e- improve the quality of the data, which in turn can improve
commerce providers are typically under a lot of pressure operational efficiency and enable proper maintenance.
to detect fraudulent activities on their site as early as pos- Competitive Intelligence. The explosive growth of ex-
sible, so that they can limit their exposure to lawsuits, ternal data (i.e. data outside the business such as Web data,
compliance laws or even loss of reputation. data providers, etc.) can enable businesses to gather rich in-
telligence about their competitors. For example, companies
Often businesses engage customers to register for programs in the energy and resources industry are very interested in
like reward cards or connect with the customers using so- competitive insights such as where a competitor is drilling
cial media. This limited initial engagement gives them the (or planning to drill); disruptions to drilling due to accidents,
access to basic information about a customer, such as name, weather, etc.; and more.
email, address, and social media handles. However, in a vast To gather these insights, companies currently purchase
majority of the cases, such information is incomplete and relevant data from third party sources – e.g. IHS and Dod-
the gaps are not uniform. For example, for a customer John son are just a few examples of third party data sources that
Doe, a business might have the name, street address, and a aggregate and sell drilling data – to manually enrich exist-
phone number, whereas for Jane Doe, the available infor- ing internal data to generate a comprehensive view of the
mation will be name, email, and a Twitter handle. Lever- competitive environment. However, this current process is
aging the basic information and completing the gaps, also manual one, which makes it difficult to scale beyond a small
called as creating a 360 degree customer view, is a signifi- handful of data sources. Many useful, data sources that are
cant challenge. Current approaches to addressing this chal- open (public access) (e.g. sources that provide weather data
lenge largely revolve around subscribing to data sources like based on GIS information) are omitted, resulting in gaps in
Experian. This approach has the following shortcomings: the intelligence gathered.
1. The enrichment task is restricted to the attributes provided A solution that can automatically perform this enrichment
by the one or two data sources that they buy from. If they across a broad range of data sources can provide more in-
need some other attributes about the customers, it is hard depth, comprehensive competitive insight.
to get them.
2. The selected data sources may have high quality infor- Overview of Data Enrichment Algorithm
mation about attributes, but poor quality about some oth- Our Data Enrichment Framework (DEF) takes two inputs –
ers. Even if the e-commerce provider knows about other 1) an instance of a data object to be enriched and 2) a set
sources, which have those attributes, it is hard to manually of data sources to use for the enrichment – and outputs an
integrate more sources. enriched version of the input instance.
DEF enriches the input instance through the following
3. There is no good way to monitor if there is any degrada- steps. DEF first assesses the importance of each attribute in
tion in the quality of data sources. the input instance. This information is then used by DEF to
Using the enrichment framework in this context would al- guide the selection of appropriate data sources to use. Fi-
low the e-commerce provider to dynamically select the best nally, DEF determines the utility of the sources used, so it
3. can adapt its usage of these source (either in a favorable or step until either there are no attributes in d whose values are
unfavorable manner) going forward. unknown or there are no more sources to select.
DEF considers two important factors when selecting the
Preliminaries next best source to use: 1) whether the source will be able
A data object D is a collection of attributes describing to provide values if called, and 2) whether the source targets
a real-world object of interest. We formally define D as unknown attributes in du (esp. attributes with high impor-
{a1 , a2 , ...an } where ai is an attribute. tance). DEF satisfies the first factor by measuring how well
An instance d of a data object D is a partial instanti- known values of d match the inputs required by the source.
ation of D – i.e. some attributes ai may not have an in- If there is a good match, then the source is more likely to re-
stantiated value. We formally define d as having two el- turn values when it’s called. DEF also considers the number
ements dk and du . dk consists of attributes whose values of times a source was called previously (while enriching d)
are known (i.e. instantiated), which we define formally as to prevent “starvation” of other sources.
dk = {< a, v(a), ka , kv(a) > ...}, where v(a) is the value DEF satisfies the second factor by measuring how many
of attribute a, ka is the importance of a to the data object D high-importance, unknown attributes the source claims to
that d is an instance of (ranging from 0.0 to 1.0), and kv(a) is provide. If a source claims to provide a large number of these
the confidence in the correctness of v(a) (ranging from 0.0 attributes, then DEF should select the source over others.
to 1.0). du consists of attributes whose values are unknown This second factor serves as the selection bias.
and hence the targets for enrichment. We define du formally DEF formally captures these two considerations with the
as du = {< a, ka > ...}. following equation:
Attribute Importance Assessment kv(a) ka
1 a∈dk ∩Is a∈du ∩Os
Given an instance d of a data object, DEF first assesses (and Fs = Bs + (4)
sets) the importance ka of each attribute a to the data object 2M −1 |Is | |du |
that d is an instance of. DEF uses the importance to guide where Bs is the base fitness score of a data source s being
the subsequent selection of appropriate data sources for en- considered (this value is randomly set between 0.5 and 0.75
richment (see next subsection). when DEF is initialized), Is is the set of input attributes to
Our definition of importance is based on the intuition that the data source, Os is the set of output attributes from the
an attribute a has high importance to a data object D if its data source, and M is the number of times the data source
values are highly unique across all instances of D. For exam- has been selected in the context of enriching the current data
ple, the attribute e-mail contact should have high importance object instance.
to the Customer data object because it satisfies this intuition. The data source with the highest score Fs that also ex-
However, the attribute Zip should have lower importance to ceeds a predefined minimum threshold R is selected as the
the Customer object because it does not – i.e. many instances next source to use for enrichment.
of the Customer object have the same zipcode. For each unknown attribute a enriched by the selected
DEF captures the above intuition formally with the fol- data source, DEF moves it from du to dk , and computes the
lowing equation: confidence kv(a ) in the value provided for a by the selected
source. This confidence is used in subsequent iterations of
X2
ka = (1) the enrichment process, and is computed using the following
1 + X2 formula:
where,
1
−1
U (a, D)
|V |
X = HN (D) (a) (2) kv(a ) = e a W , if kv(a ) = N ull (5)
|N (D)| λ(kv(a ) −1)
e , if kv(a ) = N ull
and
HN (D) (a) = − Pv logPv (3) where,
v∈a kv(a)
U (a, D) is the number of unique values of a across all a∈dk ∩Is
W = (6)
instance of the data object D observed by DEF so far, |Is |
and N (D) is all instances of D observed by DEF so far. W is the confidence over all input attributes to the source,
HN (D) (a) is the entropy of the values of a across N (D), and Va is the set of output values returned by a data source
and serves as a proxy for the distribution of the values of a. for an unknown attribute a .
We note that DEF recomputes ka as new instances of the This formula captures two important factors. First, if mul-
data object containing a are observed. Hence, the impor- tiple values are returned, then there is ambiguity and hence
tance of an attribute to a data object will change over time. the confidence in the output should be discounted. Second,
if an output value is corroborated by output values given by
Data Source Selection previously selected data sources, then the confidence should
DEF selects data sources to enrich attributes of a data object be further increased. The λ factor is the corroboration factor
instance d whose values are unknown. DEF will repeat this (< 1.0), and defaults to 1.0.
4. In addition to selecting appropriate data sources to use, past T values returned by the data source for the attribute a.
DEF must also resolve ambiguities that occur during the en- W is the confidence over all input attributes to the source,
richment process. For example, given the following instance and is defined in the previous subsection.
of the Customer data object: The utility of a data source Us from the past n calls are
(Name: John Smith, City: San Jose, Occupation: then used to adjust the base fitness score of the data source.
NULL) This adjustment is captured with the following equation
n
a data source may return multiple values for the unknown 1
attribute of Occupation (e.g. Programmer, Artist, etc). Bs = Bs + γ Us (T − i) (10)
n
To resolve this ambiguity, DEF will branch the original 1
instance – one branch for each returned value – and each where Bs is the base fitness score of a data source s, Us (T −
branched instance will be subsequently enriched using the i) is the utility of the data source i time steps back, and γ is
same steps above. Hence, a single data object instance may the adjustment rate.
result in multiple instances at the end of the enrichment pro-
cess. System Architecture
DEF will repeat the above process until either du is empty
or there are no sources whose score Fs exceeds R. Once
this process terminates, DEF computes the fitness for each
resulting instance using the following equation:
kv(a) ka
a∈dk ∩dU
(7)
|dk ∪ du |
and returns top K instances.
Data Source Utility Adaptation
Once a data source has been called, DEF determines the util-
ity of the source in enriching the data object instance of in-
terest. Intuitively, DEF models the utility of a data source as
a “contract” – i.e. if DEF provides a data source with high
confidence input values, then it is reasonable to expect the
data source to provide values for all the output attributes
that it claims to target. Moreover, these values should not Figure 1: Design overview of enrichment framework
be generic and should have low ambiguity. If these expecta-
tions are violated, then the data source should be penalized The main components of DEF are illustrated in Figure 1.
heavily. The task manager starts a new enrichment project by in-
On the other hand, if DEF did not provide a data source stantiates and executes the enrichment engine. The enrich-
with good inputs, then the source should be penalized mini- ment engine uses the attribute computation module to cal-
mally (if at all) if it fails to provide any useful outputs. culate the attribute relevance. The relevance scores are used
Alternatively, if a data source is able to provide unam- in source selection. Using the HTTP helper module, the en-
biguous values for unknown attributes in the data object in- gine then invokes the connector for the selected data source.
stance (esp. high importance attributes), then DEF should re- A connector is a proxy that communicates with the actual
ward the source and give it more preference going forward. data source and is a RESTful Web service in itself. The en-
DEF captures this notion formally with the following richment framework requires every data source to have a
equation: connector and that each connector have two operations: 1)
a return schema GET operation that returns the input and
1 1
−1 PT v(a)
output schema, and 2) a get data POST operation that takes
Us = W e |Va | ka − ka as input the input for the data source as POST parameters
|Os |
a∈Os +
a∈Os − and returns the response as a JSON. For internal databases,
(8) we have special connectors that wrap queries as RESTful
where, end points. Once a response is obtained from the connec-
tor, the enrichment engine computes the output value con-
PT (v(a)) , if |Va | = 1 fidence, applies the necessary mapping rules, and integrates
PT (v(a)) = argmin PT (v(a)) , if |Va | > 1 (9)
v(a)∈Va
the response with the existing data object. In addition to this,
the source degradation factor is also computed. The map-
+
Os are the output attributes from a data source for which ping, invocation, confidence and source degradation value
−
values were returned, Os are the output attributes from computation steps are repeated until either all values for all
the same source for which values were not returned, and attributes are computed or if all sources have been invoked.
PT (v(a)) is the relative frequency of the value v(a) over the The result is then written into the enrichment database.
5. In designing the enrichment framework, we have adopted • Inconsistent auth mechanisms across different APIs and
a service oriented approach, with the goal of exposing the mandatory auth makes it expensive (even when pulling
enrichment framework as a “platform as a service”. The core information that is open on the Web)
tasks in the framework are exposed as RESTful end points. • easy to develop prototypes, but do not instill enough con-
These include end points for creating data objects, import- fidence to develop a deployable client solution
ing datasets, adding data sources, and for starting an enrich-
ment task. When the “start enrichment task” task resource is • Data services and APIs are an integral part of the system
invoked and a task is successfully started, the framework re- • Have been around for a while, minimal traction in the en-
sponds with a JSON that has the enrichment task identifier. terprise
This identifier can then be used to GET the enriched data
from the database. The framework supports both batch GET • API rate limiting, lack of SLA driven API contracts
and streaming GET using the comet (?) pattern. • Poor API documentation, often leading the developer in
Data mapping is one of the key challenges in any data in- circles. Minimalism in APIs might not be a good idea;
tegration system. While extensive research literature exists
for automated and semi-automated approaches for mapping • Changes not “pushed” to developers, often finding out
(?; ?; ?; ?; ?), it is our observation that these techniques do when an application breaks
not guarantee the high-level of accuracy required in enter- • Inconsistent auth mechanisms across different APIs and
prise solutions. So, we currently adopt a manual approach, mandatory auth makes it expensive (even when pulling
aided by a graphical interface for data mapping. The source information that is open on the Web)
and the target schemas are shown to the users as two trees,
• easy to develop prototypes, but do not instill enough con-
one to the left and one to the right. Users can select the at-
fidence to develop a deployable client solution
tributes from the source schema and draw a line between
them and attributes of the target schema. Currently, our map-
ping system supports assignment, merge, split, numerical Related Work
operations, and unit conversions. When the user saves the Web APIs and data services, that have helped establish the
mappings, the maps are stored as mapping rules. Each map- Web as a platform. Web APIs have enabled the deployment
ping rule is represented as a tuple containing the source and use of services on the Web using standardized commu-
attributes, target attributes, mapping operations, and condi- nication protocols and message formats. Leveraging on Web
tions. Conditions include merge and split delimiters and con- APIs, data services have allowed access to vast amounts of
version factors. data that were hiterto hidden in proprietary silos in a stan-
dardized manner. A notable outcome is the development of
Challenges and Experiences mashups or Web application hybrids. A mashup is created by
integrating data from various services on the Web using their
• Data services and APIs are an integral part of the system APIs. Although early mashups were consumer centric ap-
• Have been around for a while, minimal traction in the en- plications, their adoption within enterprise has been increas-
terprise ing, especially in addressing data intensive problems such
as data filtering, data transformation, and data enrichment.
• API rate limiting, lack of SLA driven API contracts Web APIs and data services, that have helped establish the
• Poor API documentation, often leading the developer in Web as a platform. Web APIs have enabled the deployment
circles. Minimalism in APIs might not be a good idea; and use of services on the Web using standardized commu-
nication protocols and message formats. Leveraging on Web
• Changes not “pushed” to developers, often finding out APIs, data services have allowed access to vast amounts of
when an application breaks data that were hiterto hidden in proprietary silos in a stan-
• Inconsistent auth mechanisms across different APIs and dardized manner. A notable outcome is the development of
mandatory auth makes it expensive (even when pulling mashups or Web application hybrids. A mashup is created by
information that is open on the Web) integrating data from various services on the Web using their
APIs. Although early mashups were consumer centric appli-
• easy to develop prototypes, but do not instill enough con-
cations, their adoption within enterprise has been increasing,
fidence to develop a deployable client solution
especially in addressing data intensive problems such as data
• Data services and APIs are an integral part of the system filtering, data transformation, and data enrichment.
Web APIs and data services, that have helped establish the
• Have been around for a while, minimal traction in the en-
Web as a platform. Web APIs have enabled the deployment
terprise
and use of services on the Web using standardized commu-
• API rate limiting, lack of SLA driven API contracts nication protocols and message formats. Leveraging on Web
APIs, data services have allowed access to vast amounts of
• Poor API documentation, often leading the developer in
data that were hiterto hidden in proprietary silos in a stan-
circles. Minimalism in APIs might not be a good idea;
dardized manner. A notable outcome is the development of
• Changes not “pushed” to developers, often finding out mashups or Web application hybrids. A mashup is created by
when an application breaks integrating data from various services on the Web using their
6. APIs. Although early mashups were consumer centric appli-
cations, their adoption within enterprise has been increasing,
especially in addressing data intensive problems such as data
filtering, data transformation, and data enrichment.