Extend Udf Technology For Integrated Analytics

Qiming Chen, Meichun Hsu, Rui Liu* HP Labs, Palo Alto, California, USA *HP Labs, Beijing, China Extend UDF Technology for Integrated Analytics

Motivations Running data-intensive analytics outside database causes significant overhead Huge round-trip data transfer overhead between database platform and computation platform Analytics layer is burdened with many generic data management issues Opportunity to balance resource utilization between data management and analytic processing is lost UDF has been extensively investigated for pushing down computation

Challenges & Problems (1) UDF is lack of formal support of relational input and output Unable to model complex applications Inefficiency of execution Tuple-wise pipeline prohibits in–function batch and parallel processing

Challenges & Problems (2) There exists a conflict between UDF execution efficiency and coding easiness UDF is hard to code Analytics users have to deal with hard-to-follow system details, while MapRedcueisolates system details form developer Encoding arguments into strings simplifies argument passing while incurs performance penalty

Solution (1) Relation Valued Function Invocation pattern Mechanisms for dealing with inputs and return values, e.g. tuple by tuple, or as a whole set ,[object Object],Classify Relation Valued Functions based on invocation patterns Deterministic steps of system interaction Single out application logic from system utilities

Solution (2) Simple Relation Object Mapping (SROM) Separate RVF into RVF shell and ‘user-function’ Automated RVF shell generation

Example: Corner kick scene rank Tables: ,[object Object]

Alarge set of images on corner kicks

Acollection of sample images of ‘typical’ corner kick scenescorner kick In soccer games

Calculate Image Similarity For each image Extract SIFT features Each point as a128-dimensional vector Generate a composite feature vector The closeness of two images is determined by the similarity of their composite feature vectors 8 8/31/2009

Rank Sample Images SELECT Sid, COUNT(Neighbor) AS n FROM (SELECT P.ID AS Neighbor, (SELECT S.ID FROM CKSamples S WHERE sim(P.feature, S.feature) = (SELECT MAX(sim(P2.feature, S2.feature)) FROM CKSamples S2, CKImages P2 WHERE P2.ID = P.ID)) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n; Derive the closest sample image of each corner kick image (by maximal similarity) For each sample image s, calculate the number of images having s as the closest sample Rank the sample images by that number 9 8/31/2009

Inefficiency of execution SELECT Sid, COUNT(Neighbor) AS n FROM (SELECT P.ID AS Neighbor, (SELECT S.ID FROM CKSamples S WHERE sim(P.feature, S.feature) = (SELECT MAX(sim(P2.feature, S2.feature)) FROM CKSamples S2, CKImages P2 WHERE P2.ID = P.ID)) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n; CKSamples relation is not cached CKSamples relation is retrieved in a nested query for each (tuple) instance p of CKImages 10 8/31/2009

Relation Value Function RVF is specified as DEFINE RVF f (x, y, R1, R2) RETURN R3 { float a, b; Relation R1 (/*schema1*/); Relation R2 (/*schema2*/); Relation R3 (/*schema3*/); PROCEDURE fn(/*dll name*/); RETURN MODE SET_MODE; INVOCATION PATTERN BLOCK } RVFs can be naturally composed along with other relational operators or sub-queries SELECT * FROM rvf1(Q4, rvf2(Q1, Q2, Q3));

Invocation Patterns PerTuple Input Mode Block Input Mode PerTuple/Block Input Mode Tuple Return Mode Set Return Mode

PerTuple Input Mode SELECT ID, Summary FROM per_image_summery_rvf (“SELECT feature FROM CKSamples”);

Block Input Mode SELECT r.sid, COUNT(r.neighbor) AS n FROM ck_ rvf1 (“SELECT * FROM CKImages”, “SELECT * FROM CKIsamples”) r GROUP BY r.sid ORDER BY n;

PerTuple/Block Input Mode SELECT Sid, COUNT(Neighbor) AS n FROM ( SELECT P.ID AS Neighbor, ck_ rvf2 (P.ID, P.feature, “SELECT * FROM CKIsamples”) AS Sid FROM CKImages P) GROUP BY Sid ORDER BY n;

Separating RVF Shell and User-Function Separate an RVF into RVF shell and user-function Provide high-level RVF Shell APIs for building the shell Shading the DBMS internal details from RVF developers Generate RVF shells based on RVF specifications, input and output modes

RVF-Shell and APIs SQLUDR_INT32 ck_rvf2(RVF_ARGS) { intrv; RVFCallContext *h; ck_rvf2_args *hARGS; CKSamples *samples; if (RVF_IS_FIRST_CALL()) { …. } if (RVF_IS_NORMAL_CALL()) { …. /*user-function*/ intsid = find_closest_sample (ID, feature, samples); RVF_RETURN_NEXT(sid); RVF_NORMAL_CALL_END(h); } if (RVF_IS_LAST_CALL()) { …. } return rv; }

Extend Udf Technology For Integrated Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (6)

Similar to Extend Udf Technology For Integrated Analytics

Similar to Extend Udf Technology For Integrated Analytics (20)

Extend Udf Technology For Integrated Analytics