Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Baseball data with r (@tech ver.) 공개본
Next
Download to read offline and view in fullscreen.

Share

Command Line으로 분석하는 사용자 패턴

Download to read offline

csvkit을 사용해서 command line으로 간단히 데이터 분석을 할 수 있습니다.

Command Line으로 분석하는 사용자 패턴

  1. 1. The User Pattern Analysis with 3?-J7§; U| I_I_ cojette@gmai| .com 2015-OB-27 httpt/ /bit. |y/1NiNdeF
  2. 2. +- Situation +- What is User Pattern? +- Case with Simple Command-line tools +- Conclusion
  3. 3. +- M%Kt§2| §%*% 33%| m°4oH0F 2’ [Iii +- tin? ’ / ‘| ti| ’_* Hi| Et Ei| £:§ +- §2 75§2|3i 3 % 8*EHOi| »’t §2% %*H1§ [EH +- Q? S-4 Ei| £§ 3 Ei%. * 171% +— 3d KtEi0t| /Kt §2% §t33tE»’t 3‘H%*% HHt§7i| ’é’0| 3tIl_ 12% [EH > %*%0i|3i| fit 2013545 2 EH 1 Lig- = ll
  4. 4. +~ lit R¢‘: 'J2§E ’8’%”~F9_| 4‘-| txi (DIS NE) 27‘<l (ijl E-£1) - Who I Aigxi (user id, it). .. 3 - Why 1 2H (purpose, target . .,) - When T 31Xi| (timestamp) - Where 3 0iCi/ H (l: >age_id, action_id, category ) * What 1 -‘I: -91% (iogin. buy. actioniii ) - How I 01%? ” (purchase method, device, reference . ..) Ftl Héii ¢ Qifit H0 HI] Willi 0|.
  5. 5. fiiiil / ~t%Xi 3%’ O JIE XIEQI / |? i| iiflié Ai§Ki )H'§ MEN fiiot C> Aitiii E3 $31 Ei%‘§. %§. M 2|-‘? '- Et|0|Et9| 3 % E9.’ # [U 3? : :r 347$ tg At 3/K 0i0 Eut‘3i El 5: M8 r| Ln = ll
  6. 6. +- UH%'I 735 REQI 7i| %i0i| /‘i flu? 0t0|§i E"'—; 7H% ? :i/ |§ %*3| J|5i§fi% I +- EEI / t§? §i3i 3111i 01% OiO| §i% ’-§§ ME Fl, flu‘-P OiO| %' E*%0| +- Et| ’_§§ %0tEIl § Iii? §0|§* ’§0| 913’. E*%E 8129 “W313 +- Eli flat file +- / iii-‘T4 LH§i §1 §1%*§i8. QDH E? 9-4 »‘| Zi%/3%? QDH §_*% -3-
  7. 7. +- Data Collection -- | Z1|0|E-i $7.2 +- Data Cleansing / WEN 5iE §‘EH§ Ci|0|Et @111 +- Data Exploration -- Ci|0|Et ‘EH +- Data Modeling ~- DiE §3i| LH% Ea. ’ +- Data interpretation -- 3e‘Di H3 3 3H’31
  8. 8. +- Data Collection -- | Z1|0|E-i $7.2 +- Data Cleansing / WEN 5lE §‘Ell§ Cll0lEt @111 +- Data Exploration -- Cll0lEt ‘EH > Data Wrahgl ing (Z*. _*§* E1|0lEt -E-’i1r°_-’ 7191 3H%‘ 3t 0|! ’ )
  9. 9. +- csv (httpsi/ /ko. wikipedia. org/ wiki/ CSV_(Dl%¥_§§44)) +- original I comma-separated values +- usuali character-separated values (included tab, other delimeter) +- basic text log file format +- csvkit (httpsi/ /github. com/ ohyxfish/ csvkit) +- suite of utilities for working with cvs format +- installi sudo apt—get install
  10. 10. 1 ie 't 3'1 la. 1 ti’ T hi-‘liar’: l Lirie +- lnteine, WI ' I curl $ curl -L -0 httpsi/ /github. com/ colette/ UserPatternwithCommandLine/ l0g_sample. xlsx % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 158 0 158 0 0 24 0 -- -- -- 0100 08 -- -- -- 38 100 85331 100 55331 0 0 4918 0 0 00 13 0 00 13 --. l-- 15818 -13: 1 $ in2csv data/ l0g_samp| eix| sx > data/ |og-Samp| e.csv +— Dat‘ $ sq|2csv --db 'sq| iteI/ //data/ iris db' -‘query ‘SELECT * FROM iris limit 5'
  11. 11. ti" l'~liilii13''i'l , rie $ head -n 5 tr_sample csv lcsvlook i type I user_id 1 description I del_item_id l gem l bean i recip I recdate 4* 1883178 l 13 I 13485 i 113.32..08 1 2015-07-01 00i00i01.000000 1883180 I 8 l 2810 2 l 211.187..28 1 2015-07-01 00 00 01.000000 1883181 I 8 I 12341 i 202.43..42 I 2015-07-01 00100i01.000000 1883182 I 8 l 8288 l 2 1 2015-07-01 00'00'04 000000 ntac error ch $ csvclean tr_samp| e csv No errorsi
  12. 12. ég csvcut -c 1,3,5,8.8 tr_sample. csv I csvloot Ihead I-----—---—+———--—---+----———-—-—--+---——+——------—----———-—-——-——---—-I I id I user_id I de| _item_id I gem I recdate I----------+---------+-------------+-----+-----------------------------I I 1883178 I 13485 I 200002 I 0 I 2015-07-01 00100101 000000 I 1883180 I 2810 I I 2 I 2015-07-01 00:00:01.000000 : 1883181 I 12341 . ... I I 0 I 2015-07-01 oozoozoroooooo ' . 1aB31B? ¢|wB2BB IL L53 1 3°i§i9TTfflc£! Qi°0‘94 999999t I type I user_id I description I deI_item_id I gem ' I recdate I ———————— ——+ ———— ——+ ——————— ——+ ———————————————————————— ——+ —————————————— ——+ ——————————————————————————— ——| + 1883185 I 13 I 7410 I 200003 1883188 I 13 I 7410 I 200003 015-07-01 00:00I05.000000 I I 2 I 2015-07-01 00I00i08.000000 | + --------------------------- —-
  13. 13. +- Simple statistics I csvstat 1.1 1' ‘ N
  14. 14. +- Exploration with SQL 1 csvsql S csvsql --query “select substr(recdate, 15,2) as hour, count(1) > from tr_sample group by substr(recdate. 15,2)” tr_sample. csv I csvlook I ----- --+ --------- --I I hour I count(1) I I ————— ——+ ————————— ——| I 00 I 108 I I 01 I 122 I I 02 I 118 I 03 I 88 I R 04 I 53 : +- Exploration with Python 1 csvpy Scsvpy tr_sample. csv Welcome! ”tr_sampleicsv” has been loaded in a CSVKitReader object named "reader". In [1]i reader. next() 0ut[1]1 u‘1883178', I u'13'. u'13485'. 14 5 ll IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
  15. 15. oIC4§§iH| C493 8 $ csvsql --query “select user_id from (select user_id, count(1) as cnt from tr_samp| e group by user_id) > where cnt >=2” tr_sample. csv >tr_user. csv $ head -n 5 tr_up. csv Icsvlook | ———————— ——+ ——————— ——+ ———— ——+ ——————— ——+ ——————————— ——+ ——————————— ——+—————+ ———— ——+ —————————— ——+ ——————————————————————————— ——| user_id I id I type I user_id I description I del_item_id I gem I bean I recip ———————— ——+——-——-———+———-——+——-——-—-—+———-——-——-———+———-——-——-———+———-—+———-——+————-——-——-—+——-——-——————————————————————— 231 I 1883188 I 8 I 231 I SNS3Efi§%I I 180.70..18 I 2015-07-01 00I00i08.000000 231 I 1883204 I 13 I 231 I 180.70..18 I 2015-07-01 00100I18.000000 231 I 1883235 I 8 I 231 I 180.70..18 I 2015-07-01 00100338 000000 231 I 231 . .
  16. 16. $ csvstat tr_up. csv 1. user_id <type 'int'> Nullsi False MinI 231 Maxi 13482 Sumi 4303878 Meanl 8855.71804838 $ csvsql --query “select substr(recdate, 15, 2) as hour, de| _item_id, couht(1) as cnt from tr_up > where | ength(deI_itemuid) > 2 group by substr(recdate, 15, 2) . deluitemeid" tr, up. csv Icsvlook I ----- --+ ----------- --+ ---- --I hour I del_item, id I cnt __________________ __+______ + I 200001 I 200002 I 200003 I 200004
  17. 17. +- Other Command-line toolsi BigML . .. +- Programming pipeline 1 modeling code (python, R44 ) +- Pandashellsi shell pipeline with the statistical and visualization tools of the python data-stack (httpsiI/9ithub. com/ robdmc/ pandashelIs) II‘ 1-‘ ll
  18. 18. +_ +_ +_ Data Anal User Pattern 1 Useful, csvkitl Handful command-line tools for simple data anal Back To The Basic important and easy +- Handful of options cover 80% of use cases Useful for Ad-hoc analysis 1 Dan Speed up ordinary pipelines and code
  19. 19. +- Author? Jeroen Janssens +- Published by O'Reilly in October 2014 +- Overview of all tools on http / /datascienceatthecommandline. com +- csv*** --help Thank you for listening. 18 5
  • wizmusa

    Sep. 5, 2017
  • SooKyungChoi

    Jan. 2, 2017
  • yongjaepark50

    Sep. 28, 2016
  • 831jsh

    Jul. 23, 2016
  • mrsohn

    Jun. 30, 2016
  • greg82p

    Feb. 18, 2016
  • chanshiklim

    Dec. 1, 2015
  • sono0

    Oct. 31, 2015
  • minwoopark7549

    Oct. 24, 2015
  • hyuntaeklee399

    Oct. 18, 2015
  • nuthack

    Oct. 3, 2015
  • beejeikim

    Oct. 1, 2015
  • gedwarp

    Sep. 24, 2015
  • haejuk99

    Sep. 22, 2015
  • dhbhrl

    Sep. 17, 2015
  • choiseoksoon

    Sep. 15, 2015
  • SungJoonPark

    Sep. 15, 2015
  • ssuserba90dd1

    Sep. 4, 2015
  • gilbird

    Sep. 3, 2015
  • yulisys

    Aug. 28, 2015

csvkit을 사용해서 command line으로 간단히 데이터 분석을 할 수 있습니다.

Views

Total views

6,356

On Slideshare

0

From embeds

0

Number of embeds

3,028

Actions

Downloads

22

Shares

0

Comments

0

Likes

30

×