Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Con2015 - The Data Scientist’s Toolbox

722 views

Published on

The Data Scientist’s Toolbox

Published in: Technology

Hadoop Con2015 - The Data Scientist’s Toolbox

  1. 1. LEN CHANG • MACHINE LEARNING & DATA MINING • DISTRIBUTION SYSTEM & NOSQL • CRAWLER & CHINESE MINING • Communication Engineering, General Study - CCU • Software Engineering, Master Study - NCU • Pixnet Hackathon 2014 – EXIT MINING • Pixnet Hackathon 2015 – Spam User Detection • Taipei Open Data Hackathon 2015 – The relation between Religion and Taipei City • BI SYSTEM & DATA VISUALIZATION • FINANCE & EDUCATION & ART & SPORT • THE PLAYER OF BLIZZARD GAMES
  2. 2. AGENDA • A GOOD STORY • TOOL 1 : DATABASE • TOOL 2 : COLLECTION AND REPLICATE. • TOOL 3 : VISUALIZATION. • TOOL 4: MACHINE LEARNING • SAMPLE • SUMMARY
  3. 3. A GOOD STORY DIGITAL CUSTOMER EXPERIENCE
  4. 4. how much money do you want to pay ?
  5. 5. 45 NT / Latte 95 NT / Latte WHY ?
  6. 6. 如果說家庭是人際交流的「第一個好去處」,而職場是 「第二個好去處」,那麼像咖啡館(如星巴克)這樣的公 共場所,就是我常提到的「第三個好去處」。咖啡館的環 境介於住家和辦公室兩者之間,既能社交,也能獨處,人 們可以在這裡與他人聯絡感情,也能重新面對自我。星巴 克的創業宗旨,就是想為一般人提供這種寶貴的機會。 ~Howard Schultz • the loyalty card • pay in advance on mobile • wireless device charging Digital customer experience Chief Digital Officer: Adam Brotman
  7. 7. Location Mobile pay loyalty card A Good Digital Customer Experience Social network BI System, Data warehousing…etc
  8. 8. A GOOD STORY TELL US… • FIND YOUR “UNIQUE CUSTOMER DATA”. • USE “CUSTOMER DATA” TO IMPROVE “DIGITAL CUSTOMER EXPERIENCE" • USE “DIGITAL CUSTOMER EXPERIENCE” TO HELP ORGANIZATION “MAKE MONEY”.
  9. 9. TOOL 1: DATABASE OLAP AND NOSQL
  10. 10. Location Mobile pay loyalty card A Good Digital Customer Experience Social network BI System, Data warehousing…etc
  11. 11. BI System, Data warehousing…etc Relation-DB NOSQL How to choose ?
  12. 12. THE PURPOSE IS IMPORTANT CDC ETL SQL 100 % accurate answer when I see the report
  13. 13. THE PURPOSE IS IMPORTANT Marching Learning Real time feedback Real-time dashboard less accurate, faster response when I need a rough answer
  14. 14. THE PURPOSE IS IMPORTANT Marching Learning Powerful at full-text search, weak at number computing.
  15. 15. THE PURPOSE IS IMPORTANT High frequency Real-time dashboard To ensure accurate and speed, costing isn’t important.
  16. 16. DATABASE • 100 % ACCURATE • RELATION DATABASE • LESS ACCURATE, MORE FASTER • HBASE, SPARK ,CASSANDRA, MONGODB, OTHERS.. • SPECIAL CASE • FULL-TEXTING SEARCH: ELASTICSEARCH • ACCURATE AND SPEED: REDIS OR OTHER IN-MEMORY DB.
  17. 17. COLLECTION AND REPLICATE LOGSTASH AND FLUENTD REPLICATION TOOL
  18. 18. Location Mobile pay loyalty card A Good Digital Customer Experience Social network BI System, Data warehousing…etc Collection: Any Data in, Any Data out
  19. 19. Location Mobile pay loyalty card Social network BI System, Data warehousing…etc Collection: Any Data in, Any Data out
  20. 20. FLUENTD: BUILD YOUR UNIFIED LOGGING LAYER
  21. 21. LOGSTASH: COLLECT, ENRICH & TRANSPORT DATA
  22. 22. COMPARISON FLUENTD • LANG: C EMBEDDED IN RUBY • PLATFORM: LINUX • MAJOR OUTPUT DB: MONGODB LOGSTASH • LANG: JAVA • PLATFORM: LINUX AND WINDOWS • MAJOR OUTPUT DB: ELASTICSEARCH • ELK ARCH.
  23. 23. Location Mobile pay loyalty card Social network BI System, Data warehouse…etc Replicate: replicate data from DB_A to DB_B RDB RDB Case 1 NOSQL RDB Case 3 Transaction DB NOSQL NOSQL Case 2 ETL: Extract-Transform-Load
  24. 24. RDB RDB Case 1
  25. 25. NOSQL NOSQL Case 2
  26. 26. NOSQL RDB Case 3 Node PostgresNode Node Node Node mongo
  27. 27. COMPARISON RDB TO RDB NOSQL TO RDBNOSQL TO NOSQL • TRADITIONAL MECHANISM • TO ENSURE THE “DATA CONSISTENCY” • FINANCIAL INDUSTRY • HUGE DATA ANALYSIS • LOW COSTING HARDWARE , POWERFUL AND FAST COMPUTATION • NEED PROGRAMMING SKILL, NOT ONLY SQL • MAKE A RDB AS A NODE OF NOSQL CLUSTER • MAYBE IT IS A BALANCE BETWEEN NOSQL AND RDB
  28. 28. VISUALIZATION VISUALIZE YOUR DATA
  29. 29. 1,999 USD
  30. 30. MACHINE LEARNING GENETIC ALGORITHM
  31. 31. Genetic algorithm
  32. 32. Travelling salesman problem Self-help tourism Scheduling Genetic Algorithm System
  33. 33. Linear algebra and Probability are important Bayesian probability Decision Tree Regression Support Vector Machine
  34. 34. SAMPLE SOME INTERESTING APPLICATION SAMPLE ….
  35. 35. “ ” FINANCIAL DISTRESS PREDICTION SYSTEM
  36. 36. financial index Company Share price Genetic Algorithm 3000 financial indices 20 financial indices Support Vector Machine Matlab & C# & ASP.NET
  37. 37. “ ” GAME TREND MONITOR SYSTEM
  38. 38. Crawler System Crawler System Crawler System Crawler System DB Text Mining System Article => Emotional Value C# & MSSQL & SSRS
  39. 39. DB C# & MSSQL & SSRS
  40. 40. “ ” APP BEHAVIOR ANALYSIS SYSTEM
  41. 41. RDB s3fs Node PostgresNode Node Node Node mongo Pentaho R R & RUBY & MONGODB & POSTGRES & Pentaho & MOSQL & FLUENTD & s3fs
  42. 42. SUMMARY FOR THE SAME THING, YOU WILL MAKE A BETTER SOLUTION OR MECHANISM WHEN YOU'RE A MULTI DOMAIN-EXPERT.
  43. 43. Crawler System Text Mining System Article => Emotional Value 8 years up… Shortcut?
  44. 44. What’s the fastest method to understand zombie ?

×