A data scientist's daily life involves collecting and storing large amounts of data from various sources, preprocessing and analyzing the data using tools like Linux, SQL databases, Python and R, and applying machine learning algorithms like clustering, classification, and regression to derive insights. The data scientist must effectively manage terabytes of data and choose the appropriate machine learning techniques and algorithms to gain knowledge from big data in an efficient and intelligent manner. Visualization tools are then used to showcase the findings and insights discovered.
11. A WEB SERVICE RECEIVE THE LOG DATA MORE THEN 50G PER DAY
TOTAL SPACE USED LAST THREE MONTH :4500G
TOTAL SPACE USED LAST ONE YEAR :18,000G(17.6T)
12. • Data Storage/ Backup
• 2T/per HDD
• How to save the data MORE than 2T?
• $0.3 USD/per gigabyte
• Pay 900 USR for KEEPING data but do nothing else.
• Read/Write Speed
• Read: 131.6 MB/s / Write 131.4MB/s
• Spend 393s(6 min) reading just ONE day data.
• Large number of transactions immediately
21. P E R F O R M A N C E O F H A D O O P ?
• Not good, but at least can run.
• Count 86,389,084 rows/per day in 39 sec.
(64G ram, E5 8core * 2/per node * 10)
• How about 39sec * 30days ?
39. E T C …
• Excel
• Google Analytics
• Visualisation tools (tableau)
• Web Crawler
• Version control management (git)
• ETL and job scheduling tools (jenkins)
• …
43. W H Y D O W E N E E D M A C H I N E
L E A R N I N G ?
• Clustering
這些人可以分成幾類
• Classification
哪個人屬於哪一類?
• Regression
某個事件發生或某人屬於哪類的機率是多少?
• Dimensionality reduction
降維
44. C L U S T E R I N G
http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/
source http://humble-developer.blogspot.tw/2011/01/kmeans-clustering-algorithm-part-1.html
45. C L A S S I F I C A T I O N
http://letsmakerobots.com/content/tcs3200-color-sensor-with-k-nearest-neighbor-classification-algorithm
51. M A C H I N E L E A R N I N G A L G O R I T H M N
http://amueller.github.io/sklearn_tutorial/
52. S T A T I S T I C V S M L
S T A T T I S T I C
M A C H I N E
L E A R N I N G
F O C U S O N
U N D E R S T A N D I N G D A T A
I N T E R M S O F M O D E L S
F O C U S O N T H E A N A L Y S I S
O F L E A R N I N G
A L G O R I T H M S
I N T E R P R E T A B I L I T Y ,
H Y P O T H E S I S T E S T I N G
G R E A T E R F O C U S O N
P R E D I C T I O N
53. S Y S T E M A T I C S A N D A U T O M A T I O N
http://www.slideshare.net/CetasAnalytics/cetas-e-baymeetupprezofinal
62. • Codecademy http://www.codecademy.com/
Include kinds of programming language, i.e. python,
JavaSrtipt, even shell script and sql
• Coursera http://www.codecademy.com/
Famous self-learning MOOC website.