Data Enrichment using Web APIs

  • 575 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
575
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
15
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data Enrichment using Web APIs Karthik Gomadam, Peter Z. Yeh, Kunal Verma Accenture Technology Labs 50, W. San Fernando St, San Jose, CA { karthik.gomadam, peter.z.yeh, k.verma} @accenture.com Abstract at a granular level, based on the input data that is available and the data sources that can be used. As businesses seek to monetize their data, they are leveraging Web-based delivery mechanisms to provide publically avail- 2. Quality of a data service may vary, depending on the able data sources. Also, as analytics becomes a central part input data: Calling a data source with some missing val- of many business functions such as customer segmentation, ues (even though they may not be mandatory), can result competitive intelligence and fraud detection, many businesses in poor quality results. In such a scenario, one must be are seeking to enrich their internal data records with data from able to find the missing values, before calling the source these data sources. As the number of sources with varying or use an alternative. degrees of accuracy and quality proliferate, it is a non-trivial task to effectively select which sources to use for a particu- In this paper, we present an overview of a data enrichment lar enrichment task. The old model of statically buying data framework which attempts to automate many sub-tasks of from one or two providers becomes inefficient because of the data enrichment. The framework uses a combination of data rapid growth of new forms of useful data such as social me- mining and semantic technologies to automate various tasks dia and the lack of dynamism to plug sources in and out. In such as calculating which attributes are more important than this paper, we present the data enrichment framework, a tool others for source selection, selecting sources based on infor- that uses data mining and other semantic techniques to au- tomatically guide the selection of sources. The enrichment mation available about a data record and past performance framework also monitors the quality of the data sources and of the sources, using multiple sources to reinforce low con- automatically penalizes sources that continue to return low fidence values, monitoring the quality of sources, as well as quality results. adaptation of sources based on past performance. The data enrichment framework makes the following contributions: 1. Granular context dependent ordering of data sources: Introduction We have developed a novel approach to the order in whichAs enterprises become more data and analytics driven many data sources are called, based on the data that is availablebusinesses are seeking to enrich their internal data records at that point.with data from data sources available on the Web. Consider 2. Automated assessment of data source quality: Our datathe example where a companys consumer database might enrichment algorithm measures the quality of the outputhave the name and address of its consumers. Being able of a data source and its overall utility to the enrichmentto use publically available data sources such as LinkedIn, process.White Pages and Facebook to find information such as em-ployment details and interests, can help the company col- 3. Dynamic selection adaptation of data sources: Usinglect features for tasks such as customer segmentation. Cur- the data availability and utility scores from prior invo-rently, this is done in a static fashion where a business buys cations, the enrichment framework penalized or rewardsdata from one or two sources and statically integrates with data sources, which affects how often they are selected.their internal data. However, this approach has the following The rest of the paper is organized as follows:shortcomings:1. Inconsistencies in input data: It is not uncommon to Motivation have different set of missing attributes across records in the input data. For example, for some consumers, the We motivate the need for data enrichment through three real- street address information might be missing and for oth- world examples gathered from large Fortune 500 companies ers, information about the city might be missing. To ad- that are clients of Accenture. dress this issue, one must be able to select the data sources Creating comprehensive customer models: Creating comprehensive customer models has become a holy grail inCopyright c 2012, Association for the Advancement of Artificial the consumer business, especially in verticals like retail andIntelligence (www.aaai.org). All rights reserved. healthcare. The information that is collected impacts key de-
  • 2. cisions like inventory management, promotions and rewards, set of sources for a particular attribute in particular data en-and for delivering a more personal experience. richment task. The proposed framework can switch sources1. Personalized and targeted promotions: The more infor- across customer records, if the most preferred source does mation a business has about its customer, the better it can not have information about some attributes for a particular personalize deals and promotions. Further, business could record. For low confidence values, the proposed system uses also use aggregated information (such as interests of peo- reconciliation across sources to increase the confidence of ple living in a neighborhood) to manage inventory and run the value. ADEF also continuously monitors and downgrade local deals and promotions. sources, if there is any loss of quality. Capital Equipment Maintenance: Companies within2. Better segmentation and analytics: The provider may the energy and resources industry have significant invest- need more information about their customers, beyond ments in capital equipments (i.e. drills, oil pumps, etc.). what they have in the profile, for better segmentation and Accurate data about these equipments (e.g. manufacturer, analytics. For example, a certain e-commerce site may model, etc.) is paramount to operational efficiency, proper know a persons browsing and buying history on their site maintenance, etc. and have some limited information about the person such The current process for capturing this data begins with as their address and credit card information. However, un- manual entry. Followed by manual, periodic “walk-downs” derstanding their professional activities and hobbies may to confirm and validate this information. However, this pro- help them get more features for customer segmentation cess is error-prone, and often results in incomplete and inac- that they can use for suggestions or promotions. curate data about the equipments.3. Fraud detection: The provider may need more informa- This does not have to be the case. A wealth of structured tion about their customers for detecting fraud. Providers data sources (e.g. from manufacturers) exist that provides typically create detailed customer profiles to predict their much of the incomplete, missing information. Hence, a so- behaviors and detect anomalies. Having demographic and lution that can automatically leverage these sources to enrich other attributes such as interests and hobbies helps build- existing, internal capital equipment data can significantly ing more accurate customer behavior profiles. Most e- improve the quality of the data, which in turn can improve commerce providers are typically under a lot of pressure operational efficiency and enable proper maintenance. to detect fraudulent activities on their site as early as pos- Competitive Intelligence. The explosive growth of ex- sible, so that they can limit their exposure to lawsuits, ternal data (i.e. data outside the business such as Web data, compliance laws or even loss of reputation. data providers, etc.) can enable businesses to gather rich in- telligence about their competitors. For example, companiesOften businesses engage customers to register for programs in the energy and resources industry are very interested inlike reward cards or connect with the customers using so- competitive insights such as where a competitor is drillingcial media. This limited initial engagement gives them the (or planning to drill); disruptions to drilling due to accidents,access to basic information about a customer, such as name, weather, etc.; and more.email, address, and social media handles. However, in a vast To gather these insights, companies currently purchasemajority of the cases, such information is incomplete and relevant data from third party sources – e.g. IHS and Dod-the gaps are not uniform. For example, for a customer John son are just a few examples of third party data sources thatDoe, a business might have the name, street address, and a aggregate and sell drilling data – to manually enrich exist-phone number, whereas for Jane Doe, the available infor- ing internal data to generate a comprehensive view of themation will be name, email, and a Twitter handle. Lever- competitive environment. However, this current process isaging the basic information and completing the gaps, also manual one, which makes it difficult to scale beyond a smallcalled as creating a 360 degree customer view, is a signifi- handful of data sources. Many useful, data sources that arecant challenge. Current approaches to addressing this chal- open (public access) (e.g. sources that provide weather datalenge largely revolve around subscribing to data sources like based on GIS information) are omitted, resulting in gaps inExperian. This approach has the following shortcomings: the intelligence gathered.1. The enrichment task is restricted to the attributes provided A solution that can automatically perform this enrichment by the one or two data sources that they buy from. If they across a broad range of data sources can provide more in- need some other attributes about the customers, it is hard depth, comprehensive competitive insight. to get them.2. The selected data sources may have high quality infor- Overview of Data Enrichment Algorithm mation about attributes, but poor quality about some oth- Our Data Enrichment Framework (DEF) takes two inputs – ers. Even if the e-commerce provider knows about other 1) an instance of a data object to be enriched and 2) a set sources, which have those attributes, it is hard to manually of data sources to use for the enrichment – and outputs an integrate more sources. enriched version of the input instance. DEF enriches the input instance through the following3. There is no good way to monitor if there is any degrada- steps. DEF first assesses the importance of each attribute in tion in the quality of data sources. the input instance. This information is then used by DEF toUsing the enrichment framework in this context would al- guide the selection of appropriate data sources to use. Fi-low the e-commerce provider to dynamically select the best nally, DEF determines the utility of the sources used, so it
  • 3. can adapt its usage of these source (either in a favorable or step until either there are no attributes in d whose values areunfavorable manner) going forward. unknown or there are no more sources to select. DEF considers two important factors when selecting thePreliminaries next best source to use: 1) whether the source will be ableA data object D is a collection of attributes describing to provide values if called, and 2) whether the source targetsa real-world object of interest. We formally define D as unknown attributes in du (esp. attributes with high impor-{a1 , a2 , ...an } where ai is an attribute. tance). DEF satisfies the first factor by measuring how well An instance d of a data object D is a partial instanti- known values of d match the inputs required by the source.ation of D – i.e. some attributes ai may not have an in- If there is a good match, then the source is more likely to re-stantiated value. We formally define d as having two el- turn values when it’s called. DEF also considers the numberements dk and du . dk consists of attributes whose values of times a source was called previously (while enriching d)are known (i.e. instantiated), which we define formally as to prevent “starvation” of other sources.dk = {< a, v(a), ka , kv(a) > ...}, where v(a) is the value DEF satisfies the second factor by measuring how manyof attribute a, ka is the importance of a to the data object D high-importance, unknown attributes the source claims tothat d is an instance of (ranging from 0.0 to 1.0), and kv(a) is provide. If a source claims to provide a large number of thesethe confidence in the correctness of v(a) (ranging from 0.0 attributes, then DEF should select the source over others.to 1.0). du consists of attributes whose values are unknown This second factor serves as the selection bias.and hence the targets for enrichment. We define du formally DEF formally captures these two considerations with theas du = {< a, ka > ...}. following equation:Attribute Importance Assessment kv(a) ka 1 a∈dk ∩Is a∈du ∩OsGiven an instance d of a data object, DEF first assesses (and Fs = Bs + (4)sets) the importance ka of each attribute a to the data object 2M −1 |Is | |du |that d is an instance of. DEF uses the importance to guide where Bs is the base fitness score of a data source s beingthe subsequent selection of appropriate data sources for en- considered (this value is randomly set between 0.5 and 0.75richment (see next subsection). when DEF is initialized), Is is the set of input attributes to Our definition of importance is based on the intuition that the data source, Os is the set of output attributes from thean attribute a has high importance to a data object D if its data source, and M is the number of times the data sourcevalues are highly unique across all instances of D. For exam- has been selected in the context of enriching the current dataple, the attribute e-mail contact should have high importance object instance.to the Customer data object because it satisfies this intuition. The data source with the highest score Fs that also ex-However, the attribute Zip should have lower importance to ceeds a predefined minimum threshold R is selected as thethe Customer object because it does not – i.e. many instances next source to use for enrichment.of the Customer object have the same zipcode. For each unknown attribute a enriched by the selected DEF captures the above intuition formally with the fol- data source, DEF moves it from du to dk , and computes thelowing equation: confidence kv(a ) in the value provided for a by the selected source. This confidence is used in subsequent iterations of X2 ka = (1) the enrichment process, and is computed using the following 1 + X2 formula:where,  1 −1 U (a, D)  |V | X = HN (D) (a) (2) kv(a ) = e a W , if kv(a ) = N ull (5) |N (D)|  λ(kv(a ) −1) e , if kv(a ) = N ulland HN (D) (a) = − Pv logPv (3) where, v∈a kv(a)U (a, D) is the number of unique values of a across all a∈dk ∩Is W = (6)instance of the data object D observed by DEF so far, |Is |and N (D) is all instances of D observed by DEF so far. W is the confidence over all input attributes to the source,HN (D) (a) is the entropy of the values of a across N (D), and Va is the set of output values returned by a data sourceand serves as a proxy for the distribution of the values of a. for an unknown attribute a . We note that DEF recomputes ka as new instances of the This formula captures two important factors. First, if mul-data object containing a are observed. Hence, the impor- tiple values are returned, then there is ambiguity and hencetance of an attribute to a data object will change over time. the confidence in the output should be discounted. Second, if an output value is corroborated by output values given byData Source Selection previously selected data sources, then the confidence shouldDEF selects data sources to enrich attributes of a data object be further increased. The λ factor is the corroboration factorinstance d whose values are unknown. DEF will repeat this (< 1.0), and defaults to 1.0.
  • 4. In addition to selecting appropriate data sources to use, past T values returned by the data source for the attribute a.DEF must also resolve ambiguities that occur during the en- W is the confidence over all input attributes to the source,richment process. For example, given the following instance and is defined in the previous subsection.of the Customer data object: The utility of a data source Us from the past n calls are (Name: John Smith, City: San Jose, Occupation: then used to adjust the base fitness score of the data source. NULL) This adjustment is captured with the following equation na data source may return multiple values for the unknown 1attribute of Occupation (e.g. Programmer, Artist, etc). Bs = Bs + γ Us (T − i) (10) n To resolve this ambiguity, DEF will branch the original 1instance – one branch for each returned value – and each where Bs is the base fitness score of a data source s, Us (T −branched instance will be subsequently enriched using the i) is the utility of the data source i time steps back, and γ issame steps above. Hence, a single data object instance may the adjustment rate.result in multiple instances at the end of the enrichment pro-cess. System Architecture DEF will repeat the above process until either du is emptyor there are no sources whose score Fs exceeds R. Oncethis process terminates, DEF computes the fitness for eachresulting instance using the following equation: kv(a) ka a∈dk ∩dU (7) |dk ∪ du |and returns top K instances.Data Source Utility AdaptationOnce a data source has been called, DEF determines the util-ity of the source in enriching the data object instance of in-terest. Intuitively, DEF models the utility of a data source asa “contract” – i.e. if DEF provides a data source with highconfidence input values, then it is reasonable to expect thedata source to provide values for all the output attributesthat it claims to target. Moreover, these values should not Figure 1: Design overview of enrichment frameworkbe generic and should have low ambiguity. If these expecta-tions are violated, then the data source should be penalized The main components of DEF are illustrated in Figure 1.heavily. The task manager starts a new enrichment project by in- On the other hand, if DEF did not provide a data source stantiates and executes the enrichment engine. The enrich-with good inputs, then the source should be penalized mini- ment engine uses the attribute computation module to cal-mally (if at all) if it fails to provide any useful outputs. culate the attribute relevance. The relevance scores are used Alternatively, if a data source is able to provide unam- in source selection. Using the HTTP helper module, the en-biguous values for unknown attributes in the data object in- gine then invokes the connector for the selected data source.stance (esp. high importance attributes), then DEF should re- A connector is a proxy that communicates with the actualward the source and give it more preference going forward. data source and is a RESTful Web service in itself. The en- DEF captures this notion formally with the following richment framework requires every data source to have aequation: connector and that each connector have two operations: 1)    a return schema GET operation that returns the input and 1  1 −1 PT v(a) output schema, and 2) a get data POST operation that takes Us = W  e |Va | ka − ka   as input the input for the data source as POST parameters |Os | a∈Os + a∈Os − and returns the response as a JSON. For internal databases, (8) we have special connectors that wrap queries as RESTfulwhere, end points. Once a response is obtained from the connec- tor, the enrichment engine computes the output value con- PT (v(a)) , if |Va | = 1 fidence, applies the necessary mapping rules, and integrates PT (v(a)) = argmin PT (v(a)) , if |Va | > 1 (9) v(a)∈Va the response with the existing data object. In addition to this, the source degradation factor is also computed. The map- +Os are the output attributes from a data source for which ping, invocation, confidence and source degradation value −values were returned, Os are the output attributes from computation steps are repeated until either all values for allthe same source for which values were not returned, and attributes are computed or if all sources have been invoked.PT (v(a)) is the relative frequency of the value v(a) over the The result is then written into the enrichment database.
  • 5. In designing the enrichment framework, we have adopted • Inconsistent auth mechanisms across different APIs anda service oriented approach, with the goal of exposing the mandatory auth makes it expensive (even when pullingenrichment framework as a “platform as a service”. The core information that is open on the Web)tasks in the framework are exposed as RESTful end points. • easy to develop prototypes, but do not instill enough con-These include end points for creating data objects, import- fidence to develop a deployable client solutioning datasets, adding data sources, and for starting an enrich-ment task. When the “start enrichment task” task resource is • Data services and APIs are an integral part of the systeminvoked and a task is successfully started, the framework re- • Have been around for a while, minimal traction in the en-sponds with a JSON that has the enrichment task identifier. terpriseThis identifier can then be used to GET the enriched datafrom the database. The framework supports both batch GET • API rate limiting, lack of SLA driven API contractsand streaming GET using the comet (?) pattern. • Poor API documentation, often leading the developer in Data mapping is one of the key challenges in any data in- circles. Minimalism in APIs might not be a good idea;tegration system. While extensive research literature existsfor automated and semi-automated approaches for mapping • Changes not “pushed” to developers, often finding out(?; ?; ?; ?; ?), it is our observation that these techniques do when an application breaksnot guarantee the high-level of accuracy required in enter- • Inconsistent auth mechanisms across different APIs andprise solutions. So, we currently adopt a manual approach, mandatory auth makes it expensive (even when pullingaided by a graphical interface for data mapping. The source information that is open on the Web)and the target schemas are shown to the users as two trees, • easy to develop prototypes, but do not instill enough con-one to the left and one to the right. Users can select the at- fidence to develop a deployable client solutiontributes from the source schema and draw a line betweenthem and attributes of the target schema. Currently, our map-ping system supports assignment, merge, split, numerical Related Workoperations, and unit conversions. When the user saves the Web APIs and data services, that have helped establish themappings, the maps are stored as mapping rules. Each map- Web as a platform. Web APIs have enabled the deploymentping rule is represented as a tuple containing the source and use of services on the Web using standardized commu-attributes, target attributes, mapping operations, and condi- nication protocols and message formats. Leveraging on Webtions. Conditions include merge and split delimiters and con- APIs, data services have allowed access to vast amounts ofversion factors. data that were hiterto hidden in proprietary silos in a stan- dardized manner. A notable outcome is the development of Challenges and Experiences mashups or Web application hybrids. A mashup is created by integrating data from various services on the Web using their• Data services and APIs are an integral part of the system APIs. Although early mashups were consumer centric ap-• Have been around for a while, minimal traction in the en- plications, their adoption within enterprise has been increas- terprise ing, especially in addressing data intensive problems such as data filtering, data transformation, and data enrichment.• API rate limiting, lack of SLA driven API contracts Web APIs and data services, that have helped establish the• Poor API documentation, often leading the developer in Web as a platform. Web APIs have enabled the deployment circles. Minimalism in APIs might not be a good idea; and use of services on the Web using standardized commu- nication protocols and message formats. Leveraging on Web• Changes not “pushed” to developers, often finding out APIs, data services have allowed access to vast amounts of when an application breaks data that were hiterto hidden in proprietary silos in a stan-• Inconsistent auth mechanisms across different APIs and dardized manner. A notable outcome is the development of mandatory auth makes it expensive (even when pulling mashups or Web application hybrids. A mashup is created by information that is open on the Web) integrating data from various services on the Web using their APIs. Although early mashups were consumer centric appli-• easy to develop prototypes, but do not instill enough con- cations, their adoption within enterprise has been increasing, fidence to develop a deployable client solution especially in addressing data intensive problems such as data• Data services and APIs are an integral part of the system filtering, data transformation, and data enrichment. Web APIs and data services, that have helped establish the• Have been around for a while, minimal traction in the en- Web as a platform. Web APIs have enabled the deployment terprise and use of services on the Web using standardized commu-• API rate limiting, lack of SLA driven API contracts nication protocols and message formats. Leveraging on Web APIs, data services have allowed access to vast amounts of• Poor API documentation, often leading the developer in data that were hiterto hidden in proprietary silos in a stan- circles. Minimalism in APIs might not be a good idea; dardized manner. A notable outcome is the development of• Changes not “pushed” to developers, often finding out mashups or Web application hybrids. A mashup is created by when an application breaks integrating data from various services on the Web using their
  • 6. APIs. Although early mashups were consumer centric appli-cations, their adoption within enterprise has been increasing,especially in addressing data intensive problems such as datafiltering, data transformation, and data enrichment.