Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Twitter for Research

Slides prepared for an April 23, 2015 "Twitter for Research" conference at Emlyon University in Lyon, France.

  • Be the first to comment

Twitter for Research

  1. 1. for . vi» ‘ . .a wealth of information creates a poverty o 3 attention. ” . - Herbert Sim 197 Text Mining on witter Data with DiscoverText “Twitter for Research” Workshop Emlyon University, April 23, 2015 Dr. Stuart Shulman Founder & CEO, Texifter
  2. 2. i5’i3|’[€Xi'IC"[€i” / , I _ ‘i. r u-; ~- — . ,_r‘_ V 3‘ / '/ " . :.-7.. .;-F_ ‘_ . ' 77‘ '. ' " —~ -' "I . ,v-'’ v! _; xi _ . Pronounced “tech-sifter” the metaphor is of a ci-F-I-or
  3. 3. Text Classification A 2500 year—o| d problem Plato argued it would be frustrating and machine learning _ algorithm feature extractor ll ‘ feature extractor features classifier model features
  4. 4. Grimmer & Stewart “Text as Data” Political Analysis (2013) Volume is a problem for scholars Coders are expensive Groups struggle to accurately label text at scale Validation of both humans and machines is “essential” Some models are easier to validate than others All models are wrong Automated models enhance/ amplify, but don’t replace humans There is no one right way to do this “Validate, validate, va| idate” “What should be avoided then, is the blind use of any method without a validation step. ”
  5. 5. Our free, open-source, web—based text ana ' ' minuim In go texifter In . FISMA-coniuiant Clvlmmllull ‘JCAT ’ - Coding Analysis Toolkit i—"'l iJl) I 1'4" , N00‘! !! | About omv i Aboulull | Dlscnvcileirt i Ieniudsuvu I vrwuy sutemeru I ui Nell: mil 1 (mun us Welcome to the Coding Analysis Toolkit (CAT) CAT is a free service of the Qualitative Data Analysis Progrrn (QDAP). and hosted by the University center for Social and Urban Research, at the University cl Pittsburgh, and QDAPA UMass. in the College of Social and Behavioral Sciences, at the university :1 Massadlusells Amherst. UT was the 2008 winner of the 'Best Research Si7itwIre' award train the organized section on Information Tedirioiogy A Politics in the Amencan Political Science Association. For the CAT Quick Stan: Guide. you can View the For nle here: CAT Quickstart Guide To View a tutorial on using CAT. did: here: (AT Tiitriiiai - FPDHTNY 93, 7009 May 5. 2010 ~ CAT is now an open source project! vou can host your arm veision of CAT from the project source code a: imp: i/saurceforge. net/ pro1ects/ catooikit/ CAT Statistics There are currently 7,984 primary CAT accounts and hill sub-attourits. CAT users have uploaded 7,06! coded datasets and 11.104 raw duasets. They have coded a ttxal of l.9l! i.0M items and adjudicator: have made 16l.1IB vdidotion choices in CA1. / oliscovertext If you like CAT. Vou'| I love Discoverfext. Discoueffeitt is a dood—oased. collaborative text ondvtls solution. Generate vflumle Insights about customers, products. employees, news. citizens. and more. Sign up for .1 30 day fine trial. iVh)l('ATDaQs I EA‘l’Fuliins I Ftdsalorur _ what can you do in CAT? Flliiit-nlly ruiiiv mw tmil min ms. Aiiiiriiale ri-iiiing with sliarerl inemnr. Manage team coding pt-i missions via the web Create unlimited collaborator sub~aci: ounts Assign multiple coders to specific tasks Easily iiiuasuic ll| ll: ! inter reliability Ailjudimtiv Volitl Ki iiimliil iuiim (1l’H‘-NITl nepmi vrlllillly hy ii. -iiasei, (ride iii tntlFl Export coding in RTF, LSV oi XML ioimat Archive or share completed projects what file types can CAT import? Plain text HTML CAT xMi M. -«gm An/ is II iuilllig CAT Resources Raw Data Preparation Guide AlLA$. ti Upload Preparation Have you tried Discoverlextl ieaturing the I acebook Graph iii twitter APB Q 20074015 ~ CAT ls malnlaned by TL-xiliei, LI C and powered by Miuusnlt ASP, NFT. with hosting In a FISMA uxiipliuiii envlionmerit and "I "ll Uni project support provided by the Q vvrsity C an Researrhi it the I . =i u
  6. 6. The original software kernel: tools for What can you do in CAT? Efficiently code raw text data sets Annotate coding with shared memos Manage team coding permissions via the Web Create unlimited collaborator sub-accounts Assign multiple coders to specific tasks Easily measure inter-rater reliability Adjudicate valid & invalid coder decisions Report validity by dataset, code or coder Export coding in RTF, CSV or XML format Archive or share completed projects
  7. 7. A mission to avoid tennis elbow Higher efficiency and less wrist pain " We have implemented an efficient workflow to help you avoid those aching joints.
  8. 8. Keystroke human coding: alone or in QFOAKLS return to Dashboard l discovertext uemo. m.. s.. .r. ! ‘ Sort: KeyCode v #co| umns:6 v (1) News or Current Events (2) Sports (3) Food or Restaurants (5) other i 1 I _s Europafoot ‘°"°W l: K Id? -‘ ’ l i i K i i A 5 Mercato Lyon : Un grand d'Angleterre réve de Lacazette " " V — €Ll: 'OlJdl'L)£}l. (Dill/ lil~; fl(, cilf_) lyon U . dans -‘-‘Football ‘—— {kl--El Human coding can be distributed to individuals; fillffillfifl O %IF3lllA3
  9. 9. Twitter data should be coded only in the d_ignIg, I Paragon Follow ii‘ P.1r. i< ioi i_ VIDEO. PSG—ASSE: Les quatre derniers bons souvenirs stéphanois au Parc des Princes: FOOTBALL Les déplacement. .. bit. |y/1aKyP6s win l‘M :1, Ap Q ll. VILJLO. i’S(1'/ 5SLILr: qualre tlerniers bons souveiiirs ‘»lt‘lJil-lll()i*. an Les déplacements dans la capitals n'onl pas loujours été aussi douloureux pour Ies Verts que le 5 0 sub: en .3001 dernier Retour sur quatre belles performances depuis 2005. 20 Minutes x-”§ll‘uliiii, li~: . The rush toCSV is a mistake; the data is severely
  10. 10. Computer science & NSF influences: measure everything How accurate?
  11. 11. lnter—rater reliability is one critical measurement Standard Comparisons Jrazei C-=73 la : - Standard Comparisons *' v3 Malaysia Non-Exacts Perform another comparison Current view: Comparison Table v Code Coder Coder Coder Coder Coder Coder Exact Partial Kappa 5981 9953 9955 2668 2279 4341 Match Match MH-17 11 13 27 20 8 11 7 12 0.69 Not 189 187 173 180 192 189 172 17 0.96 MH'17 Totals 200 200 200 200 200 200 179 29 0.94
  12. 12. This is our mantra as seen in the new t—shirts I V _ bzroeuegehtsfoer you Piocmci Aooui Login Trialfiegistialion Humans and machines U learning together A “ A Humans are good at some things while computers are good at others. A consistent back and forth between human coders and machines increases the ability of both to learn, and makes for accurate results. LEARN MORE
  13. 13. Plugged in to APls PREMIUM SOCIAL DATA . ' via the Gnip PowerTracl< For Twitter l ‘D7 PLUGGED IN Tumblr_ Disqus. and WorclPress. IMPORT RESULTS C -" vi ' I directly from your existing «ea U (V3 9 ii . . O I1 g Surveylvlonkey account
  14. 14. Import data directly via APls or from your Import Data (C-Inlnrnvoirmttmrxi Su rveys @@ Surveywlonkey Poweflrack Historical 53,913 credits via 5If? EF purthase more Blogs Social O@@ Cf)®@© Disqus wordpress Export Wordpress Facebook Googlei» YouTube via Tumblr 0 credits 0 credits l"t°EX"a°1°r 333 credits purchase more purchase more purchase more . FDMS | ’ST XML (DT XML)
  15. 15. Import data directly from SurveyMonkey (5 SurveyMonkey“ Authorize DiscoverText to use your SurveyMonkey account. Enter your usemame: Enter your password: Forgot uaemnine or password? Don't have an account? Sign Up This application will be able to: - Access surveys and survey data in your account - Access inlonnation on the respondents of your surveys - Access collector data associated with your surveys . Create surveys on your behalf This application will not be able to . See your SurveyMonkey password By authorizing an application you continue to operate under Survey/ Monkey's Terms of Service. in particular. some usage information will be shared back with SurveyMonkey. For more, see our Privacy Policy
  16. 16. The 5 Best Integrations to Help Analyze Your SurveyMonkey Data By Tony M 1 Survey Tips | On: 08/19/14 @ ! Tw= °*i12‘ 8*‘ 6 5 Home —» Survey Tips a The 5 Best Integrations to Help Analyze Your SurveyMonkey Data _ DiscoverText provides the ability to take / mountains of raw open-ended textual data and process it into useful information. With its roots in automated processing of public comments for the US Government, DiscoverText has developed the tools to process text at scale‘ Their workflow tools and ActiveLearning classification technology feature allow people to be working and classifying the text at the same time. The results are text filtering, sifting, sorting, and clustering functionalities, which pale in comparison to many of its competitors. Add to that the ability to redact sensitive data, and DiscoverText has created a tool that is a must-have for users who collect a large number of responses to open-ended questions.
  17. 17. Twitter is quite big and can feel _ove_r_whe| m.ing
  18. 18. Full historical Twitter access with free pqtirnainq / sitter Search and retrieve data from the complete undeleted history of Twitter email address ] password J Logln or ’ Sign in with Twitter
  19. 19. Stack operators for more precise queries Rule Text* Lyon haszlinks bio_contains: Twitter iszverified haszgeo x, l_JNeg. l ‘ll text match: 'l'Lyon l + l i. -iNeg. l 'll haszlinks ‘l - + l L. lNeg. bio_contains: 'liTwitter - + l UNeg. l ' iszverified 'l I - + l I I. .. sNeg. l 'll haszgeo Vl - + l A single rule can support up to 29 positive and 49 negative operators, subject to some caveats.
  20. 20. Sifter Administrator 10:49 AM (1 minute ago) '~ to we - (this is an automated message, please do not reply to it) I . -l (. ‘-f: in, ‘ ‘II’ Hi Stuart, The estimate has completed for Job: 20150423084801-1025 Rule Text: nike has: |inks haszgeo W Start Date: 04/01/2015 End Date: 04/07/2015 / ‘ Estimated Activities: 4.000 The total cost of accepting this job is $170.00. To accept or reject this estimate. log onto httg: /«‘sifter. texifter. com and choose to accept or reject this item from your dashboard. If you choose to accept the estimate. an invoice will be emailed to you and once paid, the job will run and compile the data files for download. This estimate will expire on Thursday, April 30. 2015 8:49:14 AM.
  21. 21. New beta feature: Bulk Twitter handle upload . 1 -; 1 of 1 w . »i Meta Value Total V Fllter nytimes 554 K . ’ ‘ Cocacola 206 " . ’. Starbucks 206 ii‘. *. FoxNews 204 ii, 9, Nike 33 I, 6, IBM 15 . , *, MaryKay 16 I. ‘‘ Lexus 13 T , ’. Microsoft 8 7.. 9. Dell 3 ‘. I ‘
  22. 22. VB Insight asked us to build it, so of course we did April 21, 2015 10:31 AM John Koetsier 665 255 25 ooooo To answer those questions and find the best social media management tools for big, medium, and growing brands, we surveyed 1,133 social media managers who build their companies’ brands on Facebook, Twitter, Linkedln, Goog| e+, and emerging mobile social networks. With help from Texifter’s DiscoverText technology, we analyzed over 250,000 tweets. We tracked the 1,600 brands with the most followers and the most engagement in their use of social media. W [ I G SOURCE: SOCIAL MEDIA MANAGEMENT: TOOLS, TACTICS AND HOW TO WIN SEE MORE AT INS| GHT. VENTURE8EAT. COM
  23. 23. Big brands - SMM tools used VB I SOURCE: SOCIAL MEDIA MANAGEMENT: TOOLS, TACTICS AND HOW TO WIN SEE MORE AT lNSlGHT. VENTUREBEAT. COM
  24. 24. If you ask us to build it, we will try to get it in Data partners and contributors We would like to thank our data partners and contributors for providing us with invaluable information in the creation of this report. Textifer, via its DiscoverText tool, provided us with the ability to monitor tweets sent across a nine-day period, within which we gathered over 250,000 status updates being sent by 1,600 brands. DiscoverText also provided us the analysis tools capable of gathering insights from such a large data set, and the team developed a new product feature (social ID list uploads) to enable the monitoring of such a large number of brands. We would like to extend additional thanks to Texifter for that fast feature turnaround. SOURCE: SOCIAL MEDIA MANAGEMENT: TOOLS, TACTICS AND HOW TO WIN W I I G SEE MORE AT | NS| GHT. VENTUREBEAT. COM
  25. 25. Twitter data exports limited to 100,000 near (7i$II Please note that per lwittel ‘s API Terms of Service, we are not allowed to export any results that come from Twitter sources to XML. If you export a dataset in XML format that contains Twitter-based as well as other data, only the non-Twitter data will be exported. For CSV or ZIP exports, you can export up to 100,000 Twitter-related items per day. You currently have 100,000 Twitter Credits remaining for this day. This is global to your account any Twitter-related items will not be exported beyond that until the credits reset. l Export functions for dataset: Lyon All Items No Duplicates French >90 31204 Uncoded ‘ Include unit id: J (ifayaflable) I include unit text: ~/ l include metadata: ~/ N Include codes and coders: ~/ ‘ Include classification: Include reference text: (ifavailable) include adjudications: Include annotations: include 3l'laChm€nl51 (only available for . zip file export) I Output format: CSV File v
  26. 26. Another new beta feature is our bulk Twitter ID export If you wish to export a list of all Twitter IDs in this archive for rehydration at a later time: Export TwitterIDs Replication datasets are increasingly required by federal funding agencies and publishers of AAAAIAIIQAIA 1311141153“ 585988595191648256 :585988589164572673 585988S65286440960 585988561075216384 58598854033G217472 585988532705058817 585988S30943479808 585988497439227906 585988399414149120 585988398533324860
  27. 27. The Five Pillars of Text Analytics Search Filtering De—duplication and Clustering Human Coding Machine—Learning
  28. 28. Pillar #1 : Search in-1-)! Ilsa! -uinlhrfiiwl gal. -in-a mmiliaxa dl, .ll Smuvlullolflflflaflllflol -1 o 5 ; ‘ hiuilnludoultc-| KIist| v‘| '-ekdad E - < tum - ‘I | J; ,; lfiy : ,.¢, :" http: //t. oo/ nCfYGvtqe7 http: //t. co/ LOIFNLRFG ,4‘ n . ..; _o, all . —., ..ai, .)l. .:li flip , ,I , .:. .ll J; l,| l, ... I--, .,. ; , ,I , ,,l. ;.. ; l. .l : g . ..i. l,, l JI, b . ,S. _,o‘J| ., .,i, il olcl -I- . .a . ,., .ll ,9 Li)L: lini. .. http: //t. co/ F2svI5mVd < u , >l l. -ll-i um "ax»"1I aha hItp: //t. co/ Iaz: |4d: w « u . .5I. u-ll ‘i-ceeiien-I-= lix’n-. eiNiii J -"II" al. iM'rJ.5o°rJ~5 :1 . ..s_nsl| Juli la. .i. l,l I. in {L3 Ji. , “,5;-93;, Issacaii, -dlgil tlpli an . s, w . : | ,sl l, ll. uJ l, n:_. "gig! "-aka http: //t. co/ PGXiZVnta ~ n ---is-:2 6»: U5 am: at will -a-, s>rI er, » V4-It--3! . -ls mun vi! mp: //z. n I Ye Baghdad Nlsheed--2 http: //t. oo/ JJaE5wPojn via @VouTube ttlslamicsute mthiiafn ' I J J: -1&5 0.-o ~. .-ibis J>. v —¢)i--JL, ... : http: //t. co/ qznvhiqfnt , g'i. oUl aha»; (ill up, pl I3 I); till ,5!» n as» "abut opgiril . .lc in-ext. 03.-nib: all a» A3 Isl tar hi-3!-ii us. »-‘--Iii am! as am :11 use « u u . ..§lil. la. .,>l. .=_ « n am Q? !‘-ii” semi mu a» hall w L. » . =I 4:! J36:-ll-. -in‘-i nee war-mu ouaaiiit at-ii» in-93" -olii "I. ..85632;l¢_. o1l. ,.nal. .|, ,l, l4a. au. s,. .., l;a; .a, S.: asn4,, ll.3,oI, i.. ,;. ldlp73:. aosa. a:-. ;,L. J,asL. .|sl »-‘ I MI-I--ii: 5531-4;‘-ii Jo) so -u. -ibis Jncis i-his I->; i: obi: J-ai mi -. -oiiii-we Jinn; Jul? an we - ~ e ul-i1u}§&5))-'9-? "a5i-ik = ru= xI-Jr= ,.st. .«l= l»n tpn= sJL-)= rrJ>l2>I-»». :-- '* I . ..-_s; IhS»_<1|Vi»i»>i>'u. s5ioi J, -iui
  29. 29. Pillar #1: Defined multi-term search Search and Browse Archive 3""‘NI| vII-TIIIW-Y" ' '3 . III I'l‘ 4 -_ lw. '_iI"I'fil_' »III. .v. i‘ imiI. In, uiI«. Anamuurm ~- - * fl I Im. I Ecvlllvi p >- . II: I1m§%: » ~. v»i. ii >2: II'IIIi. iI17l: i§)cF: li[1l§: )I9‘]. I . . I: r.‘2§-I ’a‘r. ;'~. § ii ‘l. %”i7C-.1735l'< i‘ l3.i. "I1JIF£'l Sill iiwif: xiii. L'r: ‘~l I ~S‘éII*a298 Ii. -io ‘ii http / /tlilrn/ lmi. vID . ll’il¥U! ¥T “ E#lIi: :r: ~:§‘i *; ‘F)E )2-V = lIIIi. i117Bl§. i'=3I‘ii11r'x. A194] I‘ . ."r. ?fi-I 54.16 E ’i. €§:1X.173 ‘ ; '5;‘€B4iIf'? ¥i? €"it‘2~’£‘I: . *. :ltl't+.3lfi12oo’I‘ Hits: 7 ‘; Ii’; If. F: . .-9a , 12131. -}.11.i‘iIfl. ffxtil . l39f$f%:7’tBxJ1i$flrihl£e<J1s3i€? I:i9$fi ‘meg; i§: ij¢‘_I7~lfl f" ill . are . “fa. ."fI . Hié I , iirififitrll? ‘a: l:. ¢1I; l:: ;§ '1?15fl‘i; j‘<i~I1itt1IZ9*i2oiUcIiItll . ;;"~. :lI‘.1lI. l' '8’-I200.‘i%"I¥f"‘ iflIa3tt, ?.'if; iIIII #13 ‘iiléifi IEEIII . ’n"Aa3 Ziiillntli. : —,e; :'» .3 Fri: IEJII §‘l1sa’I‘E JIIIII §H£ Jiltfi . .‘; i¢2. léafl I . E . MEI. /1:-i . are: Eiiritflitzlihsifi r. :— / Mia Esrilintlt. 2:: -.. mi .3 'i~“. £' SE3; . . . 5o$’i*'¢': e: mxrtii ‘ ‘? ri‘-aIifL’£’3rIMH17Ii I? £'ifi, Ellt§€ -.5.El‘-i€"l"5‘. .K . .:—3l<lIt"3rr§: .+:413E': Z‘s2§; £?f: € L1 Billet’ " HEB’-ii! ‘;~ rri‘«f -Iiifiitirl tItMHi7&%. iiI7li17H “.1-. ‘I ? i~£‘i*', I5€ Jltifztcilf‘-lit’ z's‘lIl£l, #l£ fit I . |lIl 2'1- ’é? :€3«’t‘. i all. E"]illi? Z.‘ 777511117811’ ? EiIil1i' f¥'I8'*] IE 1,5 REL: . lilil. 23,1 . SHIN ’T§'-'. R'¥I1itGF. ‘=, ifi . l.'i‘§§so§’i‘L*Ti? [It: iZ ‘l’«“;0l ii? .§’z)ulMH17ll'1E9'l? I ‘Li " Elli? ' ‘-"'= l”s'l‘? E’-17:‘: ii‘ A . i1fi§. ‘i‘lii3LT§2S’. ill§. Ll. &*Ifl, I*‘ rrl‘1i .5JI»£rT~ li7.MH17¥I_iil17Fl17H “I. l'7.l"": "i= ’."‘I5'= L -. $il. lil'. El? ‘.liE }": €i‘ll. ;'l IF. :1 iétl-‘IWQ XII II Ii. t1- ? '7 ‘XE 3:3.-7.T29a) F. ?2';3% JL B’-i-£i§§. ‘.777'1"'. »'tII. i7Li? Z?: i’tiilii I159’-I SLIIEIZ 1? r, -Il»-iz 1 to 100:‘ 4.517 m: Ii
  30. 30. Pillar #1: Positive search term matches torairiines . ..his is the craziest airline experience, over an hr late to leave, 1 flight attendant days & still no my luggage. No customer service. HELP a premier customer out‘ . ..official, I have no airline status, 77th on the upgrade list. #united #ord . ..official, I have no airline status, 77th on the upgrade list. #united #ord Piss off drilling outside my house #united utilities doing my head in. .. . ..ird strike problem. Plane landed safe @O'Hare. Know anyone on this flight I ca #United Airlines took away Premier flyers most important benefit of seating in E not to mention the delay #United time for #HotTubTimeMachine @tomc| evz23 what time is kick-off? is it only on MUTV? #united . ..ht being held up by passenger who has to change clothes at Dulles won't check . ..y layover flight 80 gates apart in the B terminal. OMG, how big is DIA? . ..y tore 2 zippers on luggage & put duct tape on to keep it close. Poor service. To
  31. 31. Pillar #1 : Negative airline search term matches . ..last 15 min): from #United States . ..rika #Cup & Che| sea—Manchester #United http: //t. co/3LXyYSuw of their game with Manchester #United. Manchester #United defender Rio Ferdinand does not want to be Manchester #United legend Sir Bobby Charlton would love to see . ..rsenal are rivaling Manchester #United for Benfica midfielder Jav Manchester #United boss Sir Alex Ferguson has confirmed Tomasz Need High PR website #Back| inks ? Incredible #Marketing 1hr #Vid . ..last 15 min): from #United States
  32. 32. id: link: posted time: reel neme: rule match: source: statuses count: ueer bio summery: uur mention: user mention: BIC! HIIIIUOII UIIHIIIDC: user mention ueememe: ueer twitter page: Pillar #2: Filters 438 205 tagzseerch. twlttenoom,2005:508322744757276672 http: //twitter. com/ Sahmhf_sn/ statuses/508322744757276672 9/6/2014 6:36:01 PM laill yl ML)! llpll Twitter for Android 3778 Lu. us! I M as at J1-I vi . I.aII. ..5I. -li_. .s_a. -at-7lI_al»Iu. i a-nib bu mi .44 .53.» I. as: = I 9-rialsii -A-IF K5-A -5»-rail sill esaa85632 fnlz54321 http: //www. twItter. com/ Sahrnhf_sn
  33. 33. Search and Browse Dataset Lyon All Items No Duplicates French >90 31204 Unooded (Dataset) Showing 1 to LG” of 31,204 total e y_. .i a‘ 2‘ Add to Bucket: New] Existinglseiected Advanced filters Search query: Defined search: — use predefined search — v Filter by date: to Filter by meta: _ created_at: (Date) v = v Filter by annotations: Has File Annotations v Filter by classification: News or Cunent Events ~ Equals v Set Classifitztion Bounds: set classification boundary filter Selected filters: None search close Dimanche a 16h, Rachmaninoff 1 a | 'Auditorium de Lyon, avec | '0rchestre national de Russie et Lawren. .. D @audrey_noursi24 Paris a aussi perdu des points bétement tout comme Lyon, no regrets C: Q RT @SNESFSU: RT @leSNUtwitte: #CSE texte sur les classes préparatoires. Le snuipp et le Snes toujour. .. Q @| equipedusoir Super match de mart'ia| ,encore un joueur formé a Lyon. Q Sa me manque mes delires avec elle a Lyon . . Advanced filters 1 of 32 r 9 bl Add filter Add filter Add filter - codebi wdabi (odahi codabl oodabl V
  34. 34. Pillar #2: Filters >'-ll'ii'l; ' ~ii: » . TopMc-ta Discovery 8! Advanced Nun loo values luv: postsinlhread: 1 ul 66 > N Mcu Value ' Youl film: —‘a in. ’ . "In'ii'l _ . rm. ‘ .1., ~ -, .14: . , -, if, A -1 1 . '. w. 9*: i i’ :1 i "2 A i -1,‘. B’) i. fihowing‘, 1 m 10 or 551 mu: .8 A. L‘; . M 7.. l
  35. 35. Pillar #3: Deduplication & clustering / activeclustering powered by Sifter 2;, Exact Duplicates Near Duplicate Clusters 4966 total groups and 52085 single Items $33’ Items per page: 10 V Page: 1 2 3 4 5 6 7 8 9 10 11 > Group Id Items Snippet Group 1 1242 RT @PSG_inside: 100e but d'@| bra_official sous les couleurs du @PSG_inside ! ! #FierDeZlatan h‘PSG Group 2 1092 RT @Histoire_du_PSG: Joyeux anniversaire Blaise MATUlDll Notre milieu aux 3 poumons féte aujou Group 3 884 RT @PSG_inside: 90+3' Fin du match le @psg_inside s'impose 4 buts 'a 1 face 'a Saint-Etienne H! #P Group 4 625 RT @FootNews_Fr: Le PSG en finale l PSG 4-1 ASSE IICDF La belle saison du PSG Finaliste CDF Final Group 5 578 RT @BixeLizarazu: Zlatan a zlatané, Pastore a pastoré, @PSG_lnside a survolé IIPSGASSE Group 6 538 RT @UberFootball: Zlatan lbrahimovic"s record at PSG. Games: 124 Goals: 102 Assists: 41 #DareToZ Group 7 532 RT @Histoire_du_PSG: 1/4 de LDC Leader de L1 Finaliste des 2 coupes Vainqueur des 2 classiques Group 8 516 RT @ActuFoot_: OFFICIEL l Zlatan lbrahimovic est suspendu 4 matches suite Ta ses propos tenus ap Group 9 516 RT @GeniusFootball: Zlatan Ibrahimovic's record at PSG. Games: 124 Goals: 102 Assists: 41 tIDareT Group 10 448 RT @Juezcentral: Llegé hace 3 temporadas, y esta' a 7 goles de Pauleta(109), rnéximo goleador de
  36. 36. Pillar #3: Deduplication & clustering l activeclustering powered bysifter Exact Duplicates Near Duplicate Clusters 7532 clusters and 28111 single items ' items per page: 10 V Page: I Z 3 4 5 6 7 8 9 10 11 - Cluster Items Snippet I Clustei_i 552 Pique: PSG Lawan yang Berbahaya: Gerard Pique menilai PSG akan jadi lawan yang sangat http: //t. co/ J H ,1‘ ciu5ri; ._2 540 RT @MCPSG: @Donital RT My official BDay Bash @ Konnect LDN formally Pacha 24th april http: //t. co/3Tk7S ,1‘ Cluster % 431 Pique: PSG Lawan yang Berbahaya http: //t. co/ g1EsSk308h I Cluster 4 380 Con gol de Lavezzi y hat~Tr1ck de Zlatan, PSG goleo a St. Etienne http: //t. co/ mB4rSoEXPf I Clustm_'i 353 PSG Pantau Perkembangan 'Ronaldo Lazio‘: Felipe Anderson sedang rnengkilap di Lazio. http: //t. co/ aACnelk I Clu5[.3iA6 327 lbrahlmovic trigol. PSG jejaki partai puncak Plala Francis: Penyerang Paris Saint-Germain. Zlatan lbrahimovi E I Clusiwgl 324 PSG Pantau Perkembangan 'Ronaldo Lazio‘ . /" Cluster 3 271 Lolos ke Final Coupe de France, Asa PSG Raih Empat Gelar Makin Terbuka: Niat Paris Saint-Gennain untuk m I Cluster 9 246 Enrique Anggap duel PSG vs Barca Paling Menarik http: / / t.co/7oW9kQAg7B .1‘ Clu5[(I[_‘IO 192 #SegueSigoDeVoIta lbrahirnovic é suspense por quatro jogos por ofender a Franca: O atacante lbrahlrnovic.
  37. 37. Pillar #4: Human coding (a. k.a. labeling or taciciinci or annotation) : ¢c:1C'_'J: ::>: Z jcgj 'l , u
  38. 38. Pillar #4: Human coding opti - Random Dataset name: |-Y0" 10733 >95 S a m p l e S Sample dataset? 4 Sample size: 200 1.9% of 10,738 from Lyon 10738 >95 (bucket) Select code set: 0 Create new Select existing Grounded Advanced options e O ‘ Turn on a verification step for cod - Allow user-defined codes ‘ Allow coders to select more than one code ~ Use hierarchical code set M U e Coding type 0 O n 0 Standard coding Assign users a range of Items to code Automatically load the next uncoded item for users to code
  39. 39. Pillar #4: Human coding should be fast and accurate Coder Stats Total Coding Time: 02:04:30 Avg. Coding Time: 65 Coder Units Coded Avg. Coding Time Total Coding Time l/ iiao Feng 200(100.00?6) 6.525 00:21:44 Cheng Feng 200(100.00?6) 4.655 00:15:30 Amy Wu 200(100.00?é) 10.325 00:34:24 Yunkang Yang 200(100.00?é) 4.305 00:14:19 Wei Si 200(100.00‘? é) 4.305 00:14:19 ( Fei Liu 200 100.00%. ) 7.275 00:24:14
  40. 40. Pillar #4: Coding off a list is a project ACPP ‘ Search and Browse Dataset Code Off “St Lyon All Items No Duplicates French >90 NSF? (Dataset) Showing 1 to use of 15; total b n Advanced filters PSG Add to Bucket: New | Existing | Selected , , Q . ..hampionnat, avec un - qui a bcp de rencontre et lyon devra tenir le rythme , , D . ..anique a l'OM et au Voir une équipe de Gones championne de France, ca les fait flipper ! , , D . ..n classement révé- Monaco Lyon OM . , Q Foot: |e. et l'OM boycottent Cana| + jusqu'au 30 mai « La République des Pyrénées http: //t. co/ QRXO. .. 4 Q Le-et L'OM qui boycottent Canal P| us. . j'ai plus de reproches a faire a la Ligue qui fait n'impo. .. . . c'est l'#OM et le t. qui vous font vivre. L'audience c'est pas lyon bordeaux nice lille ou evian . ..n repasse devant Ie - mais Marseill http: //t. co/ rEO, tYUOGHj . . un compte ultra du-iiDesBarres . ..Lavezzi + le bus du-caillassé, les incidents contre Lyon il y a rien eu donc . ..@| fp de quoi que le. prend rien? A l'époque c'était Lyon avec aulas qui tronait la ca marche plus. .. : "Deuis ue "e suis assé ro "e n'avais 'amais vu a méme lors d'un der. ..
  41. 41. Pillar #4: Human coding (adjudication); ni Lr qpnrpj qm mp . Ix'i~'MNlIYnll 0 , [3 Codq; N()lN1H 1/ | mvuu | msupumn | ‘ Ililluvllid i valiaariansremamunazoo . ». -.5. , 3-! my 20 00% or users coded this as Not MM-17 Document Motodotn omhu use Id: chnayls Cog" chow" Ilium lL5379051995629 Codi Cad- in-ma: 7/21/2014 -13532 Mn , M, ‘ lug». -ad: true maid: :50-34031055 . _,, ,,, . 00., _ Innuncc: Oiinese - Simoimed nu-qtypo: nos! 7 i‘; >:1M 908: _ arflnal Inn: Man 75 '§¥—i. i§§1ti$3’§I'i!5$é)‘i§§‘IT —1'§1I! ‘§’S§! /»Q§. ‘—‘.1l, "l'¥. i‘¥1§E»l7E. 2% 5 5% -“E3917 SHIRE 4' If-‘iT’. !.~‘é§. §flMLHHIDfiJ§fl! -it E‘. “IxS5e. *!t. .. . E iL"i"H Ii . lifl'§37¥ii3)fli37o. ... ..—, ’§&'_~Z $'T8’i". ? ~E3., _.. tl'. #§! ,+. U '. ’L§1EE‘3777., ... .:‘_-IT ? F rdm 078% ndalmhrudz 1 puuidudx 7/77/7014 93437 AM CW! ‘ 1179 H Iilldd: 11537905199563 ll‘: hllpl/ /l. fl3.L0fl1/P/ U4X53,9(B1995629 MH 17 'in. ’.- '+; . flti fiifn ii; ‘ I: '. i;i*_< (15-11 1' - +’z 1&1‘ ‘Hit N‘ "Na kiwi KM "mi . H.~fi‘ ‘:3 . f:‘— {i: "IJ J‘ "M iv: .i. ‘ri7 (F1 .432. i, %'-, ‘ . limit fififirii HI iii-1»: i! _L. i&~: nu '. :.i1‘L #15: . ... .. (: JL 1* ii iii] ~ 7~i"fi: 1.41 {I 44 iii 'Wi37o . ... .. A ’A'. '- (1: "%‘? ‘i' K i‘ 8 1‘ 'Jxu»i it ‘flii ); H’: 1'. E9 is‘. in 777 .1: EEL 1' 2’
  42. 42. Pi| lar#5: Machine-learning Search and Browse Dataset 3R'*. U|unInc-Imam AuDllIlf'iIl. I*»cl) A, ,. ,,. _,(_j, ,‘, _C, . Smvmg 1 IL‘ IN :1 6l, flD'. :Hl lnlhul ' D PI Classification Boundary Flltor I l| .|Il| l . m.m llilii: _|m| in . .i A ill . i . . ‘V. V. ,. iii. 1, __, . . , , . ‘iii . .. l; l>| !l (K ‘ii Vi Ill Vi (ll V ll 7 iiin r- inn . : r- H l~ , 2 'i'- 7.: (lnulu Allun ! .lInIcl| l.un Histogram Range: 01 to 100'! ’ . , . Elli 3. ’L‘l‘l ‘ll’ “ ii‘: 7. it 4.42‘ “ ‘i1. ”"‘"" ''7'‘''''‘ ''“’'V'''‘'' " .15. Mix 2: in vi; 1%‘. as: :'ii 1:». «:1, V '. ~ I = ii. i 4:» 7 . »l M 1 @ ‘L i‘. ‘-ii :4 l-hrtp : //urtcn/ LZSrHv ' ,4 'f. - an .4 +0 :1‘ i j 'J- ’. l“' 'lT. 'l. . . ‘if +1. _
  43. 43. :1 ~ I _ Jfiiiii? ll_i'_J__i’JI_1.fi. ‘§ . _M, H_17_: i'1:J ii; -»'; ; 3 _l ‘i if: .1 _il ‘L ‘ . .~L‘, i l . i’: '§_. hItP.3ILU['-CD/ Ml13_UL Pi| lar#5: Machine—learning Search and Browse Dataset —‘—': H“; -". Ukraine-Tencen! All Data (mu. mi) Showing 1 to 1oo of 28,074 total ? ‘-9’? '- ’ ' >2“ Lil Add to Bucket: "i~= '.. -‘ | Fxiiiin. ‘ | mnimari p H iiwveiinw HHHHU: rmuK: Jfi#MH17]lHW? HfiHUflV iv'wmi7%i r: mvwi"miirwwn~¢w: a%. iinwwrriii aiwriiwvt(iiw*muwn7=. rmii; ig; MH17)ium>hwizmi i“”lHRfi“i fii? W lien my %¢flfifi& 'iw%iw: ¢ni 5 mnmniiw1i: imx77[ii) wfnm‘uHiiimiMH17wuoa‘*m1 @iiWiH1 um i1. 7bfLWAHWiM5.fl| EhWHfi Hi xwwsw iwmw. dflwwifififiiflflik”Wl%H*ih“fiflW“W”UaHU >11 , I U ifiil-'d'e Frill’ i ii ‘5. NIH 17 "ii “3 ~37’. -7 Nliti. ii‘Lh: 'i iii .3? http : // url. cn/ M|18uT r6 ; ‘°/ ill? iii 91’; ‘iii-’-l N - 7/ "2 ‘CZ Wiii lliij ’§i'i”. ~:’ - W '»’i| i=i| :"‘v. '*. ““lV '1‘)! 19 .5‘? ii‘? .5‘? ii: "’l£"-/ J Elli - 5 = .%’l" lifi ‘Ii-. "‘i”: ‘_ ‘Xi iilllx / Q l‘v'i'U U-‘Eel! *i ii i’: , lit‘? /'. ’e‘i '/ i‘é; i iii El‘!
  44. 44. / oliscoverte><"' Our ActiveLearning engine and coding tools ccgmbine. we Z? $ 7 A I r R w , what humans do with what l*? ah1ans and°i*fiat§htr? e§e5‘ Keep humans| §g-itfifiv, l<so-gfiggqfigcre accurate results and etter insights
  45. 45. Word sense disambiguation (relevance)
  46. 46. Word sense disambiguation (relevance)
  47. 47. Word sense disambiguation (relevance) ‘ ‘“ >> /7‘. NO
  48. 48. AVON the company For women . ii. vcrt Wire? ‘ _ ‘iii 3:‘. .Gli'$fl. ‘3LJc , , ‘~ . , _ - . ‘ ilill” -‘ . , 7 1 G .7 : ; ' . :' 'A‘. "'_—‘ ‘ V {I ‘Ci. ‘ ; . ‘ I r- 7 §Town Hall ' "—, ‘ & Par‘ . A ' »: /5’ --I . —, 14
  49. 49. Human coding can be converted into machine classifiers D “““ °‘(°d°' AciiveLearning Siftef“ 161 , '*"""""' 001- Amazon Product Sentiment ’ ' 161/161 (1005!) 941 Cla-; s:": «:a: ici*. Reps’: Ca: C izciaanc-: —dC; :'. ii: ns :05 T v: NEUYRAL NEGAT E Classification Percentages Ceding Derceniages 53?: 22,5 24,: 56’. £6‘: '- Accumulated human coding becomes training data . . , . lII% %%MfII33 fi%IFIfII%fiI
  50. 50. Users can drill into interactive reporting POSITIVE NEUTRAL NEGATIVE 5 (101)_ 73'. 21 g 4 (24)_ 21 452 3 E 9/ 18"’ 2 > 3 (1 1)— " (U u 11 11'» 73 § 2 (9)— ' 1 (16)_ 6/ I I l l i o 25 so 75 mo Classification Percentage Use metadata to examine sub-sets of responses
  51. 51. Ultimately all text analytics are filtering techniques Advanced filters :91.’ by ‘I ' Search query: L 7 , . V "HI: -Ii"7€‘. Defined search. .. use pre-defined search Filter by date: to Filter by meta: _ counmicode vl Cm-, [a. n5 Add ‘rltei Filter by annotations: Has File Annotations vl Alltl Miter Filter by coding: (all coded items: j Add "1lt. ei' Filter by classification: we-ea -l Equals Add futei Set Classification Bounds: set rl.1r, si'? ir ‘rim in-in. .m*. :.-V ii"tm selefled lille”: to‘ owersvcountz (Hum) Greateilhan IOUOOCI V tobacco GI eatei Than 00 (items not coded) search 1 close Slicing big piles of text into smaller, more focused . (I'+l In I/ r-:
  52. 52. Distributed for synchronous & asyncb ronous _colla boi: _ation . .- V » 4 ‘. . _. :- ' 7 “av ‘ ‘ . ; '1 '. . ‘ . ‘£75.? an _ _ __J ll. _‘ ‘.6 ‘ ‘I- r . - . . ’ ' ' g , ‘R _. g , - ‘_ I - " . W * ' I ‘In ‘ - V ‘ ' is‘: F .5 >“ "g: ‘V’. .. 5.’. AA _‘ -, —. " _ . . ‘.3; 31:01:. ‘ - “ ; ‘ _‘ V > _, »_ . _ ‘l G 3 . .g - . +5’. g . ‘_ 3 _. , k ~ on up ~ ' u - ' - _. . ‘. . L 1 1- "- , xx ‘ "’ ,1 ‘1- 5 — . L 0 u v _ — v _ . . _ K. 5 ‘W X no ' v 3 * _ — .4 . - '9, e. A _ N4‘ “ « 0 4 . . - t ‘ “ ~ I . . Crowdsourcing accelerates the insight generation DFORQRR thrniinh mnrthinimlgnrninn
  53. 53. ““"“"" “"17 T°‘” Dataset: MH17 Test 2 View statistics on other datasets View statistics on other datasets Total Valid Answers: 1010/ 1209(83.S4%) Total Valid Answers: 734/ 800 (91.7S%) Coder Valid Answers Coder3S10 4/4(1oo. oo%) a| ’d i ns C der: Coder 9817 197 / 200 (98.50%) codgf V. ||d Answgn e r n k °°‘‘°' 56“ 1°“ ’ 2°° ‘9“'°°"” Coder S611 197 / zoo (9s. so%) Coder 0043 195 / 200 i97.50%) Coder 9783 169 / zoo i84.50%) C°d°' 9783 184 / 200 (92'°°%) Coder 5931 124 / 200 (62.00%) Coder 0043 179 / 200 (89.50%) Coder 2279 122 /200 031.00%) Coder S981 174 / zoo (a7.oo%) Coder 6549 3/ 5 (60.00%) ¥a. |Jd11t1.9ni_b¥_CQd£; MH-17 672/681 (986856) MH-17 564 / 557 (99-47%) Not MH-17 338 / 528 (64-02%) Not MH-17 170/ 233 (72.96%) CoderRank (patent pending) for enhanced machine- lnnrninn is: niir km: innmmtinn
  54. 54. Getting from #mediumdata to #bigdata ‘: ,{ft-*. =}: Etéifiitafiilifififiéfiffiinsxwfilta ‘ ’ t ' T T" * 1: .
  55. 55. Another beta: http: //elasticsearch. texifter. com TWEETS OVER TIME View I | Q Zoom Out | . ‘CIA OR FBI (69685) count per 10m | (69685 hits) 500 400 300 200 100 0 A. .l. -_. ... i.. L.‘. ... .:n. J.i. _.. ... i.LNX‘iA. ‘L. ..iiu . .L il. ALL. 1 00:00 00:00 00:00 00:00 00:00 00:00 09-08 09-16 09-23 10-01 10-08 10-16
  56. 56. A Nascent Social Data Industry Association
  57. 57. DRAFT: CODE OF ETHICS & STANDARDS FOR SOCIAL DATA Posted on November 14 2014 by Jason Gowans under Bi Boulder Initiative [DEF versiorzc FINAL DRAFT Code of Ethics for Social Data] PREAMBLE Social media offers an unprecedented set of opportunities and responsibilities for individuals and organizations. For individuals. social media offers new routes to self- expression mixed with shifting. sometimes unfamiliar expectations regarding ownership and privacy of that data. For organizations. social data offers new ways to glean insight into customer and consumer attitudes. emotions and behaviors down to the individual level. and therefore also raises ethical dilemmas with regard to direct use.
  58. 58. I ll-’ _ Idilln 3143 11121.1‘ 'l: I.‘lII: -:21-Hi mun 't~IH'l: :l= r1|-*l= t.'. :: Ir III! ” _II: _; | |I. 'lI I-' I. - I A. ._ ,5‘ , W : V 1 an 1 I , .. , i. L “:5 -7 *-.3 4 -. ‘-9 . . n so 1 4. . ‘Geo A . - June 15th & 16th, 2015 ‘ii iii iui --,1», ,i'. ;. i :71 vii CREATING THE FUTURE OF SOCIAL DATA The annual Big Bouider conference features two iom—poct<eci days of educational sessions, networking events. and outdoor cictivities cill centered around SOCiC}l data. This years conference wiii feature session from top sociul media sources, industry ieoders. and premier consumers of publiciy-civciiicibie socicxi dotcz who will discuss trends. best practices. use coses. and the future of the industry.
  59. 59. /o| iscoverte><t For more information discovertext. com @discovertext Thank-you for listening! Stuart Shulman — Texifter

    Be the first to comment

    Login to see the comments

  • molly_bullock

    May. 27, 2016

Slides prepared for an April 23, 2015 "Twitter for Research" conference at Emlyon University in Lyon, France.

Views

Total views

753

On Slideshare

0

From embeds

0

Number of embeds

5

Actions

Downloads

7

Shares

0

Comments

0

Likes

1

×