Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Malak, Oracle)

4,927 views

Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Malak, Oracle)

  1. 1. Extending Word2Vec for Performance and Semi- Supervised Learning Michael Malak June 15. 2015 References yvomzvec Mao Tnglam SmlrxnculM1uchmq Jaccard mom 5 Smmanly Memes Engmeenng “DIscuver" In Knnwledge semce %-'Iv'-
  2. 2. Extending Word2Vec for Performance and Semi- Supervised Learning Is. , . t = -T-1'-e*s‘. i:. ._, ; Michael Malak June 15, 2015 Spciir I summit2o15
  3. 3. E Oracle om Enrichment x 4- C‘ s| c09mni. us. orac| e.com ‘ '1 . " -1 1- - 1 — 1 - — 1 - —~i mmnlak v ORACLEEIQ Data Preparation cloud Service . . I ~. >1>i: “I Dashboard E Policies Docuinemallon I mala| (2015051l_1 Done Editing I "I -: 'q Undo A} Redo k/ Remmmendmons ouplrae Analysis (1 X Transform Script m ia, CL‘ 1111 column Typo Sample Values P"°f”e R95U"5 , I __ _ . _ Metnc Count Percent _ ' ' " da: e_time timestamp 13311748451 13315139305 1331089417 133 YotnlRov4I 2000 i: ou: i=, 1.1-1 1 1 111-1 111 1 1 1331059330 1331089179 133100252? Empty How: U 1100» caiumm 173 ion 00'‘- ‘*‘ " * 1' “ ' CoI_0002 string “i. Iozi11ai: ’1 0 (Macintosh Intel Mac OS X 10_6 Null Column: 123 . <, E . V [_7_, V‘ . 5 , Versionifi 1 2 Sa'ari1f»3.1 E12 7“ r. Ioz111ai5 01V 1:,1., ,,¢ c. ,1.. ,.. ,., U 7 i Firetoxito 0 2 | ‘.| OZl1|1]‘5 0 (compatible LISIE smqieion Column! 13 ' " ' (compatible LISIE 7 o Wlndovis NT 5 1 NE Piece-no coiumm I76 1,, .11. . aooasosao NETCLR3045062152 N LISIE 9 O W| l1dO'1'S NT 6 0 Trident S 01 1.101 Co1s mm Aims 2 i N ‘ ‘ V W W ' Gecko1201001D1 Fire1ox1D 0 1 "MOZI| |a 5 O cots with Recammendltion! 13 7 39-1 11-. - Mke Geckoi Ch1ome‘17 0 963 66 Saiatlf-3:71 _ AppIeWeDKit 53-1 .113 3-1KHTl, iL hke Geckoi Aveingepeicoiumn i App| eWebKit E1 1 11 iKHTI‘, I. like Gecko) Ch :94 . —1--: ' : - i‘_'r 1 Intel Llac OS X 10 6 rv 7 O1iGecko‘2D1001 ‘ ' “ ' date_lime_02 aaie 2012.03.12 09 2145 2012.03.07 18 47 2:1 19 03 on 2012.03.12 0:135. :17 2012.02.12 2o12.o3.o717 2120 Recommendations for All 1=11~1., —1~1..11».111~.1.» 1. . t url string nttp ‘ mix-. acme com SH: ‘1E1126f. .I: ’1.VD - 1 ~ 1. 1 _ 111. 1 . 1 1 . " ' ' “ Nip‘ Source: 1i1ami111sm: ui5i: .1:7_1 1 1 11 11.», 1 111» 111.7111. 1 1 i 11 1 - h'. tp v. iv1-. acme com Si-1:7512:E ~7VD_. . File: ori-Muirommurerifi-| 1 nttp ‘ v1w. acme com SH5 i0.10VD » 11 11 11 . ’ rirtp ‘ iwii. acme cor'i1SH:158016E1JVD:1:1155:‘1 1 E :1 >1 >1 . t CoI_000f1 string shoes clothing movies handbags nomeaga : :1——11— ~-~11-V 11-1——1p 1 ‘ : .1 ,1 M 1 , ,, 1 , _ ti/ _channet string WABC V»/ OI KLKN WCJB WJBF WFTS W ‘ ' “‘ “ " ' ‘ " ' ‘ ' uri string coriicast net rrcom oi-vest net Vei| Z0|'1 net ‘ , 1‘ W W, ‘ W W _ cIearv. ire-v. ~‘ii1x Vie‘. niiidstream net 991261 f EAUJC‘ 1,c. =11=1 x ' u - A « geographicaI_fea , string nawthorne. nendersonvilie. seminole adel. n
  4. 4. OR’ACLE®
  5. 5. Application UX (webl Endpoints) Knowledge Service Oracle Storage Cloud Service (Persistence of Source Data and Result Data)
  6. 6. Discover I-E Enrich
  7. 7. Application UX (webl Endpoints) Knowledge Service Oracle Storage Cloud Service (Persistence of Source Data and Result Data)
  8. 8. Engineering "Discover" In Knowledge Service 7 7 7 7 Tire Manufacturer . . . l l l - . . . I . Wiiaii I I . . l . l . l l . . . l . l l l .
  9. 9. V '1 5 Machine Learning Types Tools supervised Learning Yago Crowzlsouiceo Labeled Dan E ‘1 "Cat" LD4 :91 Ill [Unsupervised Leaming El E‘ "*9 Eggw T‘ *2 1* {Semi-Supervised Leaming '5 :1 ~ E1 E &"car'
  10. 10. Machine Learning Types Supervised Learning 111-; r 1: 1 I31‘-s’ ' " 1 Cat Dog 3-e‘ . ~.»'. A"’. *'-3 Cat Dog Unsupervised Learning its » “'1' Semi- upervised Learning 1 i";7“iI‘~‘. ~. . . J '1*-1: I <: f», - > £12‘ . -54¢ —
  11. 11. Tools Yago Crowdsourced Labeled Data | ’ ’ fire Manufacturers 'Tire Industry People I Michelin Bridgestone Harvey Samuel Firestone 5 million Wikipedia One-time Entities extracted and placed in a pages, WordNet, processing of graph (of nodes and edges) Geonames, Wikipedia et al DBPedia sources by Yago paper authors Word2Vec Unsupervised Learning 300 Columns Bridgestone * ' Firestone 'Bndgestone 2,000,000 rows from training corpus Soooimensional vector space
  12. 12. Naive Comizining of i. Iorc!2Vec and YAGO Grand Prix Tire Manufacturer Lrom Brady Ir: |l. 1-: rm. ’nl‘~'i| In: | Grand Prix Tire Manufacturer Tom Brady 0.638 1.» r_~: . III ‘: .'IlnLg Bridgestone Firestone Grand PM Ford A °» _ Michelin ‘ .1t: |n«: >1} Jaccard Grand Prix Jaccard Michelin Forrnu| a1 Z *3“
  13. 13. Jaccard Similarity Metric | Ar1B| JH’ B) : |A u B|
  14. 14. Naive Comizining of iIoi'c!2Ve~c and YAGO Grand Prix Tire Manufacturer Lrom Brady Ir: |l. 1-: 'l. u ’l| l>'i'Ili‘vI Grand Prix Tire Manufacturer Tom Brady 0.638 1.» i—~’3l III ‘LJIIILQ Bridgestone Firestone Grand PM Ford A °» _ Michelin ‘ .1t: |n«: >1} Jaccard Grand Prix Jaccard Michelln Formu| a1 Z *3“
  15. 15. Engineering "Discover" In Knowledge Service 7 7 7 7 Tire Manufacturer . . . l . l . l - . . . l . Wiaii l l . . l . l . l l . . . l . l l l .
  16. 16. Tire Manufacturer r: -4-‘ .9.-. ». 5 ii ei«. _j*= v:-: .[__-_ Ii ig} lrlfl Cl’ ‘I-‘, -‘-‘. ‘fl#I i’ I l F l l
  17. 17. /T Manufacturers Michelin Bridgestone r 371,000 categories ’ from YAGO, each with an average of 100 text ‘Q: Indutw People strings Harvey Samuel Firestone I One-time processing to index trigrams Y: |'nn- -. ... A-I -, ... - 4 invention 0.93 ireManufacturers - j 7 "Communications ar Manufacturers ~ ‘ Equipment" itv 2 million words from Wordzvee model Usemam augmented Each word from the augmented list The potentially matching Gooele News ten utilization (ufind with svmmvms Ish split Into tngrams and drun thrlougi; categgry names are sc<; red corpus Syn°nYms, ,hmm°"amy) t etrigram in ex to pro uce a ist o accor mg to t elaccar potentially matching categories. similarity metric, and scored.
  18. 18. YAGO "Aristot| e"@eng "Aristote| "@eng 3 5 ‘”“”””““‘”” phiiosophers - YAGO3 released Alternative '3 § Aristotle March, spellings gg be . e g Plato - Projected <P| ato> ' e secretes 400,000 lists Key 3 Kl - B‘ Relatlo" E‘: intellectuals . llvlumple g _ anguages I1 Aristotle > <wordnet_phi| osopher_110423589> 3 ~. Plato 3 E —— 8 2 g 3 Socrates “ RI t E Eiffel ; <wordnet_scho| ar_110557854> § ; Peo le 6 g p C A 0 ' Ar I Alternative ii: Istot e classifications E °° P| ato S <wordnet_intel| ectual_109621545> § Socrates g Eiffel V, N “-5 En Kiml£§;9|. §.s. hian > <wordnet_person_1000O7846>
  19. 19. el_ ‘NLI N I‘
  20. 20. YAGO "Aristot| e"@eng "Aristote| "@eng 3 5 ‘”“”””““‘”” phijosophers - YAGO3 released Alternative '3 § Aristotle March, spellings gg Us . e g Plato - Projected <P| ato> ' e secretes 400,000 lists Key 3 Kl - B‘ Relatlo" E‘: intellectuals . llvlumple g _ anguages I1 Aristotle > <wordnet_phi| osopher_110423589> 3 ~. Plato 3 E —— 8 2 g 3 Socrates “ RI t E Eiffel ; <wordnet_scho| ar_110557854> § ; Peo le 6 g p C A 0 ' Ar I Alternative ii: Istot e classifications E °° Plato S <wordnet_intel| ectual_109621545> § Socrates g Eiffel V, N “-5 En Kiml£§;9|. §.s. hian > <wordnet_person_1000O7846>
  21. 21. W Arisrcrle - Wikipedia The it ‘ en. wikipedia. org . V 'i : ,MiCh3€il113l3l‘ Tall Sandbo~ Preferences Beta Watchlist Contributions Log out Read Edit ‘lienhistory Morev WI](IpEDIA ' Voting Has Begun In Wikimedla Foundation Board of Trustees Elections! 11“. pm. g, ,c‘-c1(, p‘. d,-3 _ Voting Ends at 13:59, 31 May 2015 (UTC) Verify your eligibility and vote now. Main page Contents Featured content cimeni event; From Wikipedia the tree encyclopedia Random article Donate to Wilipedia For other uses, see Arrsrolie (di‘53i'i". 'Ji‘gJi3(i'Oi')) “W‘°ed'a Sm"? Aristo 'ei aerr stoteli 3'3 Greek Apiororcling laristotelc s] Aristoteié-5 384 — 322 BC)“ was a Greg? ’ in~. eiar. iioi- philosopkand scientist born in the Macedonian city of Stagira Chalkidice on the northerri’g(phery' H” of Classical Gt ce His father Nicomachus died when Aristotle was a child i'. ‘hereafty"2‘oxenus 0‘ About hiiipeuia Atarneus became uardian [32 At eighteen he Joined Pla-_ol5 Academy in $410 remained Comrnunlt, portal there until the age of thy. »seven ic 347 Bci His writings cover manv subject including DTNSICS Recentchanges - ' Contmpage biologi zoology metaphisixloéic ethics aesthetics poetn‘ theater? i; rhetoric linguistics d , T ‘ politics and government — an stitute the first comprehensive S! i of Western phl|0SOD7'lV O0 3 ha‘ W _ We Shortly after Plato died Aristotle left ens and at the request hilip o‘ Macedon tutored . i ‘Q . > Related changes Alexander the Great starting from 343 B "‘- According tot : ::icic‘o_aaedia B. 'i'Iai'l’l‘Ca "Aristotle was Upload ‘lie the first genuine scientist in history [and] e ‘scie i is in his debt "151 Spwm pages Teaching Alexander the Great gave Aristotle mag. ' portunities and an abundance of supplies He Permanent linl é ' Page inrormamn established a library in the Lyceum which a "led in the . duction of many of his hundreds of books -_-Wdata ‘mm The fact that Aristotle was a pupil of PI contributed to hi rmer views of Platonisni but following Cite this page Plato's death. Aristotle immersey. £elt in empirical studies an ‘ httted from Platonism to _ empiricism 1'31 He believed al "co les‘ conce ts and all of their kno e was ultimate ' based on r i J p ‘L n P P 9 "‘ “"90 _ . _ I V Roman copy in marble ola Greek bronze meal? a D00, percep. ion Aristotle s VI <§ on natural sciences represent the groundviiow underlying many of his bus“, . V V‘ ,9 C 330 EC Domioad as pop e alabastermanlle isi - n. Printable version 384 BC St Langiiaqes O extend? -into the Renaissance and were not replaced systematically until the Enlighte fit and C, ?:: d,ce , C,, a,, ,dm ""‘*“a”5 th/ ewies such as classical mechanics some of Aristotle's zoological observations such as 1 the normem Greece Al. n fiwanmsc €ClOC0f/ | (FEDTOGLICIIVEJ arm Of the OCIOPUS WSIE l’l0I confirmed Of refuted Unill the 19th cent _ HI ' 322 BC (aged 62) r " ' " vi= n1'ormal tudv llooic which was in orborated in the lat 1 th enturv EUDOEB-Greece
  22. 22. 5 : _g Q’ 2 W Anstotle - Wikipedia, the x ‘ (- C‘ i” https ‘re. -niwikipedia. org/ v vol ram/ oi 2i? -vol 3§“ vol 4:9-vol are ‘ 0 (English) Aristotle Collectionfi (translation) Peripatetic philosophers Philosophy Notable Iogicians Metaphysics Ethics Natural history Philosophy of science Philosophy of language Ancient Greece lsrioi. -ii; ‘. ‘i/ orldcatfl - VIAF 752-1651r9~ LCCN l'l7900~l1l-32:9 ~ ISNI 0000 00012374 8095§-GND118650130 r§7~ SELIBR 17475-IQ - SUDO0 026r390276:§-BNF clJ13091331sr§ldalaii§~ULAN.500259120i§~NLA:3r32»36937r§-NDL:00431694r§-NKC ln199B10001~10i§~RLS. 000078930i9-BNE XX4S74B5'3l§’ Authority control Ancient Stagirites Attic Greekwriters Cosmologists Pro-slavery activists Empiricists Ancient Greek biologists History otlogic Humor researchers hletaphysicians Meteorologists reek logicians lrletic philosophers In Classical Athens Natural philosophers Peripatetic philosophers Philosophers and tutors of Alexander the Great Philosophers of ancient ChalCldlCe Philosophers of language Philosophers otlaw Philosophers ofmind Philosophersottechnology Political philosophers Rhetorictheorlsts Tropetheorists Ancient literary critics Wrtue ethicists Zoologists This page was last modified on I9 May cars. at i: 44 Text is available under the Creative Commons Attribution-ShareAlike License, additional terms may appl/ By using this sue. you agree to the Terms or Use and Privacy Policy . ‘l/ iiipediaié is a registered trademark critic Warned-a Foundation, Inc a noriaprcm organ; -ation Prnracy policy Aimuiwiupeaia Disclaimers Contactwlkpedlal Developers Mobievlew (. ¢,)w. K,; ‘.m. , la» il; %—d-£“; i(_ i-w
  23. 23. /Te Manufacturers Michelin Bridgestone r 371,000 categories ’ from YAGO, each with an average of 100 text ‘Q: Indutw People strings Harvey Samuel Firestone I One-time processing to index trigrams . .., ... -. ... A-I -, ... - 4 invention 0.93 ireManufacturers - j ‘ "Communications ar Manufacturers ~ ‘ Equipment" vlv 2 million words from Wordzvec model Usemam augmented Each word from the augmented list The potentially matching Google News ten utilization (ufind with svmmvms Ish split Into trigrams and drun thrlougl; categzry nam: s are sc<; red corpus Syn°nYms, ,hmm°"amy) I etrngram In ex to pro uce a Is! o accor mg to t elaccar potentially matching categories. similarity metric, and scored.
  24. 24. V Word2Vec ‘ “Aristotle and Socrates stipulated a model of citizenship. ” Socrates Architecture Input Hidden Output Layer Layer Layer
  25. 25. MAN WOMAN UNCLE QUEEN KING AUNT KINGS QUEENS KING I QUEEN
  26. 26. Spark MLlib Word2Vec class Nord2VecModel private[mllib] ( private val model: Map[String, Array[F1oat]]) Spark 1.2 Broadcast { a% of Spark 1.3 [ - [ IndexedRDD mIIib. IinaIg. distributed lN l-y I l -l -i A i l
  27. 27. V Word2Vec ‘ “Aristotle and Socrates stipulated a model of citizenship. ” Socrates Architecture Input Hidden Output Layer Layer Layer
  28. 28. /Te Manufacturers Michelin Bridgestone r 371,000 categories ’ from YAGO, each with an average of 100 text ‘Q: Indutw People strings Harvey Samuel Firestone I One-time processing to index trigrams . .., ... -. ... A-I -, ... - . invention 0.93 ireManufacturers - j , "Communications ar Manufacturers ~ ‘ Equipment" vtv 2 million words from Wordzvec model Usemam augmented Each word from the augmented list The potentially matching Google News ten utilization (ufind with svmmvms ish split into trigrams and drun thrlougl; categzry nam: s are sc<; red corpus Syn°nYms, ,hmm°"amy) t etrigram in ex to pro uce a ist o accor ing to t elaccar potentially matching categories. similarity metric, and scored.
  29. 29. fugun not or -no ll -_‘_l{1 r , ."~- _ L 4 . ___ I‘ _““} . __~I. _—— J ‘ r 1 "C . ~ , . . . L O Each word from the augmented list is split into trigrams and run through the trigram index to produce a list of potentially matching categories.
  30. 30. Tire Manufacturer r: -4-‘ .9.-. ». 5 ii , i«. _j*= vr: -:. i__-_ Ii lg} WT! [Ii ‘i-‘g-l: l"‘I i’ I I F I I
  31. 31. Engineering "Discover" In Knowledge Service 7 7 7 7 Tire Manufacturer . . . I I I - . . . I . Wiiala I I . . I . I . l I . . . I I I I .
  32. 32. E Oracle om Enrichment x 4- C‘ s| c09mni. us. orac| e.com ‘ ‘i - " -1 1- - 1 — 1 - — I - —~i mmnlak v ORACLEEIQ Data Preparation cloud Service » . I ~. >i>i: “I Dashboard E Policies Documentation I mala| (2015051l_1 Done Editing I "I -: 'q Undo A} Redo . .,/ iqeoummenilatlons ouplrae Analysis (1 X Transform Script m ia, CL‘ 21.11 column Typo Sample Values P"°f| Ie RESUIIS , I __ _ . _ Melnc Count Percent _ ' ' II da: e_lime Iiiriesiamp 1331174845 133-l: ’»i3S130:'i 1331089417 133 lola| Rov4I 2000 i:00:i= . 1.1-1 1 1 II"’ ‘ll 1 - 1331059530 1CI3ID8SI‘|7§l 133100252? Empty How: U 000% caiumm i7s I00 00'‘- ‘*‘ " * 1' “ ' C0l_0002 string IIl. l0Zlllai‘5 0 Il. laCll’IDSh Intel lilac OS X 10_6 Null Column: 123 . <, E . V , __, V, . 5 , Versionifi 1 2 Sa'arii 53.1 :72 7“ l'. Iozillai5 0 ii‘ p, .., ,,. , c. ,1.. ,.. ,., U 7 i Fireloxvio 0 2 I‘. lOZll| J‘5 0 (compatible LISIE smqieion cniumm I3 ' " ' lcompalible LISIE 7 0 Wlndovis NT 5 1 NE Pvoce-no coiumm I76 . »,, .11. . 3004501330 NETCLR3045082152 N i. lSIE 9 0 Windows NT 6 0 Trident 5 0i liloz Cols mm Alerts 0 I "I I I I II‘ III ' Geckoi20100101 Firelox 10 0 I "mozilla 5 0 Col: with Recommendation: I3 7 39» 11-. - like Geckoi Chlome‘17 0 963 66 Salailf-3:71 _ AppleWeDKii 53-1 .18 3 (Ki-lTi. iL like Geckoi Avengepeicoiumn I App| eWebKit :1 I 11 iKHTl. I. like Gecko) Ch :94 . —i--: ' : - II. 'r . Intel Llac OS X 10 6 rv 7 01iGecko‘2D1001 ‘ ' “ ' daIe_lIrTie_02 dale 2012.03.12 09 2145 2012.03.07 18 47 2:1 19 03 on 2012.03.12 09 35. :17 2012.03.12 2o12.o3.0717 2120 Recommendations for All 1 = .i~i. .— 1 ——. .r. ~.a». III‘-'-1-" 1. . t url slring ntlp ‘ '. i‘. 'l1'. acme com SH: ‘»E-126f. .l: ’iiVD - I ~ 1. 1 . ‘l'» 1 . I i . ‘I I I II liilp‘ Source: mamiu. sm:0i5i: .i:7_i 1 v ll 11.», I ill’ Il"7lll' 1 i i ii 1 - h'. Ip v. iw-. acme com Si-l:7512:: ‘ ~7VD_, . File: Orv-Muvrommurerifi-I 1 http ‘ v. iw. acme com SH5 r0-10VD » 1 li ii . ’ hrtp ‘ iwiv. acme cor'iiSH: .58016EiJVD: i:ii55:‘i 1 E :1 '[ 'I . t CoI_000fi suing shoes clothing movies handbags homeaga : tr"ir' ~-~11-V ‘l‘i"lr-'- . ‘ : ., ,, M , ,, , , , _ iv_channeI string WABC V»/ OI KLKN WCJB WJBF WFTS W ‘ ' “‘ “ " ' I " ' ‘ ' iiri slring comcastnel rrcom qi-ieslnel Vel| Z0l'l nel ‘ , Q N W, ‘ W W _ cIearv. ire-viimx ne‘. mndstream net 991261 f ex:1;. ;1 bcéllzl x ' u - A « geographical_fea , string nawthorne. nendersonville. seminole adel. n
  33. 33. References Word2 Vec “Exploiting Similarities among Languages for Machine Translation", 2013, Mikolov et al m YAGO “YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia”. 2007, Suchanek et al Trigram Statistical Matching “Automatic spelling correction using a trigram similarity measure”, 1983, Angeli et al h : Jaccard “The distribution of the flora in the alpine zone", 1912. Jaccard h: wvvw. r rh . n rfilP| r Other Set Similarity Metrics
  34. 34. Extending Word2Vec for Performance and Semi- Supervised Learning Is. , . I = -Te‘-sr. I:. ._, , Michael Malak June 15, 2015 Spor I summit2oi5

×