Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Bash Dashboard (Or: How to Use Bash for Data Analysis)

1,266 views

Published on

Tutorial on how to use basic Bash concepts and commands to analyze CSV files. It uses a real-life data set and structures the content along concrete analysis questions. Feel free to contact me with questions or suggestions!

Published in: Data & Analytics
  • Be the first to comment

The Bash Dashboard (Or: How to Use Bash for Data Analysis)

  1. 1. (or: How to Use Bash for Data Analytics) The Bash Dashboard Bram Adams Polytechnique Montreal M C IS
  2. 2. Yes, this kind of stuff :-)
  3. 3. Last time I checked, every PC on earth had Excel installed, so what gives? (quote by random grad student)
  4. 4. One word: automation!
  5. 5. Let me rephrase: Why Bash if one has Python or R? (fictitious quote)
  6. 6. To better understand and prepare your data before deeper analysis!
  7. 7. Basic Constructs echo “Bram” > file.txt echo “Michel” >> file.txt echo “Giovanni” >> file.txt cat file.txt | head -n 2
  8. 8. Basic Constructs echo “Bram” > file.txt echo “Michel” >> file.txt echo “Giovanni” >> file.txt cat file.txt | head -n 2 replace file content
  9. 9. Basic Constructs echo “Bram” > file.txt echo “Michel” >> file.txt echo “Giovanni” >> file.txt cat file.txt | head -n 2 replace file content append file content
  10. 10. Basic Constructs echo “Bram” > file.txt echo “Michel” >> file.txt echo “Giovanni” >> file.txt cat file.txt | head -n 2 replace file content append file content pipe: send output of first command to input of second command
  11. 11. Basic Constructs echo “Bram” > file.txt echo “Michel” >> file.txt echo “Giovanni” >> file.txt cat file.txt | head -n 2 Bram Michel replace file content append file content pipe: send output of first command to input of second command
  12. 12. http://www.cs.wm.edu/semeru/data/tse-android/files/apps.csv
  13. 13. http://www.cs.wm.edu/semeru/data/tse-android/files/apps.csv example data 1
  14. 14. apps.csv package_app,name,category,version,rating_average,votes,star1,star2,star3,star4,star5 a8.kv.chilly,a8 chili slot machine lite,CARDS,1.2,3.7,42,10,2,4,2,24 [censored apps] accessline.spy_camera,Hidden camera free version,MEDIA_AND_VIDEO,1.41,2.4,67,34,7,5,5,16 acciones.chile,Acciones Chile,FINANCE,1.0,4.2,24,1,1,0,11,11 acgs.topanime.evawp.photos,Evangelion HD Live Wallpaper,SPORTS,1.1,3.7,70,16,2,7,10,35 Adam.androiddev,Anti Dog Repellent / Whistle,TOOLS,2.4,3.1,748,288,30,51,77,302 […]
  15. 15. apps.csv package_app,name,category,version,rating_average,votes,star1,star2,star3,star4,star5 a8.kv.chilly,a8 chili slot machine lite,CARDS,1.2,3.7,42,10,2,4,2,24 [censored apps] accessline.spy_camera,Hidden camera free version,MEDIA_AND_VIDEO,1.41,2.4,67,34,7,5,5,16 acciones.chile,Acciones Chile,FINANCE,1.0,4.2,24,1,1,0,11,11 acgs.topanime.evawp.photos,Evangelion HD Live Wallpaper,SPORTS,1.1,3.7,70,16,2,7,10,35 Adam.androiddev,Anti Dog Repellent / Whistle,TOOLS,2.4,3.1,748,288,30,51,77,302 […] typical csv file has comma-separated list of attribute names on line 1
  16. 16. apps.csv package_app,name,category,version,rating_average,votes,star1,star2,star3,star4,star5 a8.kv.chilly,a8 chili slot machine lite,CARDS,1.2,3.7,42,10,2,4,2,24 [censored apps] accessline.spy_camera,Hidden camera free version,MEDIA_AND_VIDEO,1.41,2.4,67,34,7,5,5,16 acciones.chile,Acciones Chile,FINANCE,1.0,4.2,24,1,1,0,11,11 acgs.topanime.evawp.photos,Evangelion HD Live Wallpaper,SPORTS,1.1,3.7,70,16,2,7,10,35 Adam.androiddev,Anti Dog Repellent / Whistle,TOOLS,2.4,3.1,748,288,30,51,77,302 […] typical csv file has comma-separated list of attribute names on line 1 … followed by one line per different observation, each of which has a value for each attribute
  17. 17. http://www.cs.wm.edu/semeru/data/MSR14-android-reuse/files/apps_labels.csv
  18. 18. http://www.cs.wm.edu/semeru/data/MSR14-android-reuse/files/apps_labels.csv example data 2
  19. 19. apps_labels.csv App package,Category,Type air.com.huale.Basketball,ARCADE,Obfuscated air.com.smch.climatekiten,BOOKS_AND_REFERENCE,Obfuscated air.comicc.app9019,BOOKS_AND_REFERENCE,Obfuscated ait.podka,MEDIA_AND_VIDEO,Obfuscated ak.alizandro.smartaudiobookplayer,MUSIC_AND_AUDIO,Obfuscated amor.developer.android,LIFESTYLE,Obfuscated […]
  20. 20. What Kind of Data does apps.csv Contain?
  21. 21. What Kind of Data does apps.csv Contain? head -n 1 apps.csv
  22. 22. What Kind of Data does apps.csv Contain? head -n 1 apps.csv show first line
  23. 23. Oh, does the File Contain the birthdayChocolate package?
  24. 24. Oh, does the File Contain the birthdayChocolate package? grep -e "birthdayChocolate" apps.csv
  25. 25. Oh, does the File Contain the birthdayChocolate package? grep -e "birthdayChocolate" apps.csv search for a literal string
  26. 26. How Many Apps are There?
  27. 27. How Many Apps are There? wc -l apps.csv
  28. 28. How Many Apps are There? wc -l apps.csv #lines in a file
  29. 29. Wait a Minute, What about the First Line?
  30. 30. Wait a Minute, What about the First Line? tail +2 apps.csv | wc -l
  31. 31. Wait a Minute, What about the First Line? tail +2 apps.csv | wc -l all the lines of a file starting with line 2 (i.e., removing line 1)
  32. 32. … and what about Apps with >1 Version?
  33. 33. … and what about Apps with >1 Version? tail +2 apps.csv | cut -f 2 -d , | sort -u | wc -l
  34. 34. … and what about Apps with >1 Version? tail +2 apps.csv | cut -f 2 -d , | sort -u | wc -l only keep second column of comma- delimited file
  35. 35. … and what about Apps with >1 Version? tail +2 apps.csv | cut -f 2 -d , | sort -u | wc -l only keep second column of comma- delimited file sort alphabetically and remove duplicate lines
  36. 36. What is the Maximum #Versions of an App?
  37. 37. What is the Maximum #Versions of an App? tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n
  38. 38. What is the Maximum #Versions of an App? tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n sort, but keep all the lines
  39. 39. What is the Maximum #Versions of an App? tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n sort, but keep all the lines count #occurrences of each unique line, i.e., group per line and give #occurrences of each group
  40. 40. What is the Maximum #Versions of an App? tail +2 apps.csv | cut -f 2 -d , | sort | uniq -c | sort -n sort, but keep all the lines count #occurrences of each unique line, i.e., group per line and give #occurrences of each group sort numerically
  41. 41. Which App Category Contains Most of the Apps?
  42. 42. Which App Category Contains Most of the Apps? tail +2 apps.csv | cut -f 2,3 -d , | sort -u | cut -f 2 -d , | sort | uniq -c | sort -n
  43. 43. Which App Category Contains Most of the Apps? tail +2 apps.csv | cut -f 2,3 -d , | sort -u | cut -f 2 -d , | sort | uniq -c | sort -n only keep app name and category
  44. 44. Which App Category Contains Most of the Apps? tail +2 apps.csv | cut -f 2,3 -d , | sort -u | cut -f 2 -d , | sort | uniq -c | sort -n only keep app name and category keep one version per app name
  45. 45. Which App Category Contains Most of the Apps? tail +2 apps.csv | cut -f 2,3 -d , | sort -u | cut -f 2 -d , | sort | uniq -c | sort -n only keep app name and category keep one version per app name throw away app name
  46. 46. Which App Category Contains Most of the Apps? tail +2 apps.csv | cut -f 2,3 -d , | sort -u | cut -f 2 -d , | sort | uniq -c | sort -n only keep app name and category keep one version per app name throw away app name group and count per category
  47. 47. Which App Category Contains Most of the Apps? tail +2 apps.csv | cut -f 2,3 -d , | sort -u | cut -f 2 -d , | sort | uniq -c | sort -n only keep app name and category keep one version per app name throw away app name group and count per category sort categories per count
  48. 48. Let’s Take a Look at the Obfuscation Data
  49. 49. Let’s Take a Look at the Obfuscation Data less apps_labels.csv
  50. 50. Let’s Take a Look at the Obfuscation Data less apps_labels.csv buffer file to scroll up and down (vs. more)
  51. 51. What a Mess?! More on line-ending: http://www.cyberciti.biz/faq/howto-unix-linux-convert-dos-newlines-cr-lf-unix-text-format/
  52. 52. What a Mess?! tr 'r' 'n' < apps_labels.csv > apps_obfus.csv More on line-ending: http://www.cyberciti.biz/faq/howto-unix-linux-convert-dos-newlines-cr-lf-unix-text-format/
  53. 53. What a Mess?! tr 'r' 'n' < apps_labels.csv > apps_obfus.csv fix Windows end- of-line issues by replacing the r character by n More on line-ending: http://www.cyberciti.biz/faq/howto-unix-linux-convert-dos-newlines-cr-lf-unix-text-format/
  54. 54. How to Merge the App Data with Obfuscation Results? (1)
  55. 55. How to Merge the App Data with Obfuscation Results? (1) TMP=`head -n 1 apps.csv` echo "${TMP},obfuscated" > apps_join.csv tail +2 apps.csv | sort > sorted_apps.csv tail +2 apps_obfus.csv | sort > sorted_apps_obfus.csv
  56. 56. How to Merge the App Data with Obfuscation Results? (1) TMP=`head -n 1 apps.csv` echo "${TMP},obfuscated" > apps_join.csv tail +2 apps.csv | sort > sorted_apps.csv tail +2 apps_obfus.csv | sort > sorted_apps_obfus.csv store result of command in variable
  57. 57. How to Merge the App Data with Obfuscation Results? (1) TMP=`head -n 1 apps.csv` echo "${TMP},obfuscated" > apps_join.csv tail +2 apps.csv | sort > sorted_apps.csv tail +2 apps_obfus.csv | sort > sorted_apps_obfus.csv store result of command in variable storing the column names first
  58. 58. How to Merge the App Data with Obfuscation Results? (1) TMP=`head -n 1 apps.csv` echo "${TMP},obfuscated" > apps_join.csv tail +2 apps.csv | sort > sorted_apps.csv tail +2 apps_obfus.csv | sort > sorted_apps_obfus.csv store result of command in variable storing the column names first merging requires sorted files
  59. 59. How to Merge the App Data with Obfuscation Results? (2)
  60. 60. How to Merge the App Data with Obfuscation Results? (2) join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv | cut -f -11,13 -d , >> apps_join.csv
  61. 61. How to Merge the App Data with Obfuscation Results? (2) join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv | cut -f -11,13 -d , >> apps_join.csv comma- separate files
  62. 62. How to Merge the App Data with Obfuscation Results? (2) join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv | cut -f -11,13 -d , >> apps_join.csv comma- separate files lines with same value for first column in file 1 and in file 2 should be merged
  63. 63. How to Merge the App Data with Obfuscation Results? (2) join -t , -1 1 -2 1 sorted_apps.csv sorted_apps_obfus.csv | cut -f -11,13 -d , >> apps_join.csv comma- separate files lines with same value for first column in file 1 and in file 2 should be merged join removes the specified -2 column, but keeps rest of columns of file 2; here we only want the last column of file 2, so we remove the 12th column (keeping only the first 11 columns and the 13th)
  64. 64. Which Category has Most of the Obfuscated Code?
  65. 65. Which Category has Most of the Obfuscated Code? tail +2 apps_join.csv | grep -e ",Obfuscated" | cut -f 2,3 -d , | sort -u | cut -f 2 -d , | sort | uniq -c | sort -n
  66. 66. Which Category has Most of the Obfuscated Code? tail +2 apps_join.csv | grep -e ",Obfuscated" | cut -f 2,3 -d , | sort -u | cut -f 2 -d , | sort | uniq -c | sort -n only consider lines that are obfuscated
  67. 67. Bonus: How to Create a Comma- Separated List from a List of Words?
  68. 68. Bonus: How to Create a Comma- Separated List from a List of Words? cut -f 3 -d , apps.csv | sort -u | paste -d , -s -
  69. 69. Bonus: How to Create a Comma- Separated List from a List of Words? cut -f 3 -d , apps.csv | sort -u | paste -d , -s - take input from pipe
  70. 70. Bonus: How to Create a Comma- Separated List from a List of Words? cut -f 3 -d , apps.csv | sort -u | paste -d , -s - take input from pipe concatenate all lines
  71. 71. Bonus: How to Create a Comma- Separated List from a List of Words? cut -f 3 -d , apps.csv | sort -u | paste -d , -s - take input from pipe concatenate all lines… and put commas between them
  72. 72. If you’re Interested, Check Out these Books for More (and less ;-))

×