Hadoop dev 01

2,495 views

Published on

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,495
On SlideShare
0
From Embeds
0
Number of Embeds
1,020
Actions
Shares
0
Downloads
34
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop dev 01

  1. 1. NYC Data Science Academy Hadoop Application Development with Real Cases Hadoop Application Development with Real Cases
  2. 2. NYC Data Science Academy Hadoop Application Development with Real Cases Multi-layer Model 2
  3. 3. NYC Data Science Academy Hadoop Application Development with Real Cases Data Pyramid and Character  Business personnel  ETL Engineer  Data Warehouse Engineer  Analyzer  Data Visualization Engineer  IT supporter: Operation- Maintanence, Programmer 3
  4. 4. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis  Analyze collected data with statistical methods on purpose, then understand and implement the result 4
  5. 5. NYC Data Science Academy Hadoop Application Development with Real Cases Data Mining  Data Mining is a technique focusing on retrieving hidden information in the data. It is a process that apply knowledge-discovery algorithms to large database and show the associations to the users.  Original Idea: Hypothesis testing, Pattern Recognition, Artificial Intellegence, Machine Learning  Common Data Mining Projects: Association Rules, Clustering, Outlier Analysis  Case: Beer and Diaper  Science: Detecting Novel Associations in Large Data Sets 5
  6. 6. NYC Data Science Academy Hadoop Application Development with Real Cases Business Intelligence  BI = Data Warehouses (Storage) + Data Analysis and Data Mining (Analysis) + Report (Demonstration)  Our course 6
  7. 7. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis Algorithms  Popular Algorithms 7
  8. 8. NYC Data Science Academy Hadoop Application Development with Real Cases Regression 8
  9. 9. NYC Data Science Academy Hadoop Application Development with Real Cases Time Series Analysis
  10. 10. NYC Data Science Academy Hadoop Application Development with Real Cases Classifier 10
  11. 11. NYC Data Science Academy Hadoop Application Development with Real Cases Clustering 11
  12. 12. NYC Data Science Academy Hadoop Application Development with Real Cases Association Rules 12
  13. 13. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis  Data Analysis Tools 13
  14. 14. NYC Data Science Academy Hadoop Application Development with Real Cases Popular Data Analysis Tools Ranking 14
  15. 15. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis stages  stage 1: Dominate by Business personnel  stage 2: Dominate by both Business personnel and Analyzer  stage 3: Dominate by Analyzer 15
  16. 16. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis in stage 1  Business staff set all the requirements and most analysis plans  According to experiences, Business staff select features, set threshold, and IT staff search, integrate data, analyzer make report  Feature selection and choice of threshold is based on experience and personal knowledge  Suitable for simple cases, analysis technique is equivalent to the simplest decision tree  Business staffs has valuable experiences and hard to be replaced, analyzers are just for graphing and is easily replaced  This is common in the traditional industry 16
  17. 17. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis in stage 2  More complex. Business staffs could analyze a small number of data records while cannot figure out all the features and the relationship among them. They have no experience with large number of samples.  Analyzer come to clean data and select features, and finally build suitable model to solve problem.  Business staffs and analyzer could evaluate the result together, very likely to success. Analyzer prefer this step because their ability and value is confirmed. 17
  18. 18. NYC Data Science Academy Hadoop Application Development with Real Cases Spammer in Wordpress
  19. 19. NYC Data Science Academy Hadoop Application Development with Real Cases Data Analysis in stage 3  Business staffs have no experience for the case, and cannot offer any useful prior knowledge  Data analyzers use various tools and models to mine the data and trying to have interesting discovery  It is analyzer’s ideal world, while it is likely to fail  Business staffs cannot get involved, and they dislike this stage 19
  20. 20. NYC Data Science Academy Hadoop Application Development with Real Cases Step Forward  The first stage(Gold on the ground) -> The second stage(Gold beneath the ground) -> The third stage (Gold deeply buried)  If analyzers are reckless, business staffs will resist to help  Data analysis is rooted in the business background. The goal of analysis is increasing profit. Successful analysis could not be apart from business  Interesting topic is more important than the model 20
  21. 21. NYC Data Science Academy Hadoop Application Development with Real Cases What is Big Data
  22. 22. NYC Data Science Academy Hadoop Application Development with Real Cases Features of Big Data
  23. 23. NYC Data Science Academy Hadoop Application Development with Real Cases Challenges for Analyzers  Bottleneck for both insertion and query due to the increasing amount of data  The trend of integrating users’ application and analysis result is asking for faster real-time computation and response time  More complex models require more expensive computation 23
  24. 24. NYC Data Science Academy Hadoop Application Development with Real Cases Dilemma of Traditional Data Analysis Tools  R, SAS, SPSS are experimental tools  Capable data size is restricted by the memory size  Use Oracle database for large volume of data, but lack of professional and fast analyzing ability  Sampling is a limited solution, it is not useful for clustering and recommendation system  Solution: Hadoop cluster and Map-Reduce parallel computing 24
  25. 25. NYC Data Science Academy Hadoop Application Development with Real Cases Case 1: analysis and monitor for a telecommunication company 25
  26. 26. NYC Data Science Academy Hadoop Application Development with Real Cases Case 1: analysis and monitor for a telecommunication company  Configuration of the original database server: HP minicomputer, 128G memory, 48- core CPU, RAC with two nodes, one node for insertion and the other for query  Storage: HP virtual storage, over 1000 disks  Architecture: Oracle RAC with two nodes  Bottleneck: 1. Insertion 2. Query 26
  27. 27. NYC Data Science Academy Hadoop Application Development with Real Cases Case 2: DNA database 27
  28. 28. NYC Data Science Academy Hadoop Application Development with Real Cases Case 3: Social analysis, activity fingerprint detection  28| Public Voice mail intersect IMSI 1 IMSI 2 …… IMSI n total call duration User A IMSI 20% 12% …… 5% 365 User B IMSI 15% 13% …… 2% 310 Public SMS intersect IMSI 1 IMSI 2 …… IMSI n Monthly SMS count User A IMSI 50% 10% …… 5% 200 User B IMSI 20% 13% …… 2% 260 Public base station CGI 1 CGI 2 …… CGI n Shutdown User A IMSI 20% 12% …… 5% 20% User B IMSI 15% 13% …… 2% 5% Public Fingerprint (0.2, 0.12, …, 0.05) (0.15, 0.13, …, 0.02) (0.5, 0.1, …, 0.05) (0.2, 0.13, …, 0.02) (0.2, 0.12, …, 0.05, 0.2) (0.15, 0.13, …, 0.02, 0.05 eigenvector
  29. 29. NYC Data Science Academy Hadoop Application Development with Real Cases  When equals to , these two vectors are independent When equals to 0 , these two vectors are perfectly dependent The closer is from 0, the more dependent these vectors are 90 Case 3: Social analysis, activity fingerprint detection 29
  30. 30. NYC Data Science Academy Hadoop Application Development with Real Cases Case 3: Social analysis, VIP detection 30
  31. 31. NYC Data Science Academy Hadoop Application Development with Real Cases Solution that analyzers look forward to  Perfectly eliminate the bottleneck in the foreseeable future  Smoothly transplant available techniques, for example SQL and R.  The cost of new platform: hardware and software, re-development, skill training, maintenance 31
  32. 32. NYC Data Science Academy Hadoop Application Development with Real Cases Path to Big Data
  33. 33. NYC Data Science Academy Hadoop Application Development with Real Cases Idea of Hadoop 33
  34. 34. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce Programming 34
  35. 35. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce program for meteorological data analysis 35
  36. 36. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce implementation for popular algorithms 36
  37. 37. NYC Data Science Academy Hadoop Application Development with Real Cases Map-Reduce implementation for popular algorithms 37
  38. 38. NYC Data Science Academy Hadoop Application Development with Real Cases Why not Hadoop?  Java?  Hard to control?  Hard to integrate data?  Hadoop vs Oracle 38
  39. 39. NYC Data Science Academy Hadoop Application Development with Real Cases Analysis under Hadoop system  Mainstream: Java program  Light-weighted script language: Pig  Smooth transplant from SQL: Hive  NoSQL: HBase 39
  40. 40. NYC Data Science Academy Hadoop Application Development with Real Cases Family of Hadoop 40
  41. 41. NYC Data Science Academy Hadoop Application Development with Real Cases pig  Pig could be treated as a client software to the hadoop, could connect to hadoop and analyze  Pig is convenient for users unfamiliar with java, using a SQL-like language, pig latin, dealing with data flow  Pig latin could perform sorting, filtering, sum, grouping, association, and define custom functions. It is a light-weighted script language for data operation and analysis  Pig could be treated as the mapping from pig latin to map-reduce 41
  42. 42. NYC Data Science Academy Hadoop Application Development with Real Cases Hive  Data warehouse tool, could turn primary data structure in Hadoop into tables in Hive  Support HiveQL, a language almost the same as SQL, its function is the same as SQL except updating, indexing and  could be treated as the mapping from SQL to map-reduce  Offering interfaces for shell、 JDBC/ODBC、Thrift、Web 42
  43. 43. NYC Data Science Academy Hadoop Application Development with Real Cases Features of Mahout  Mahout is for scalable machine learning algorithms (M-R implementation), and Hadoop platform is not necessary. The core library also have efficient algorithms on single machine  Mature and popular algorithms are 1. Frequent Itemset Mining 2. Clustering 3. Classifier 4. Recommendation System 5. Frequent Subgraph Mining 43
  44. 44. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks
  45. 45. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks
  46. 46. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks
  47. 47. NYC Data Science Academy Hadoop Application Development with Real Cases Reference Textbooks 47
  48. 48. NYC Data Science Academy Hadoop Application Development with Real Cases Typical Experiment Environtment(with server)  Server: ESXi, capable of deploying multiple virtual machines and could run 3 machines at the same time  PC: Linux or Windows+Cygwin, linux could be standalone or a virtual machine  SSH: Use command ssh under linux, and SecureCRT or putty under Windows to connect with remote linux server  Vmware client: Management of ESXi  Hadoop: Use version 1.x or 2.x 48
  49. 49. NYC Data Science Academy Hadoop Application Development with Real Cases Typical Experiment Environtment(with only PC or laptop running Windows)  At Least 4G memory, 64bit windows is preferred, because 32bit machine can use only more than 3G memory.  Install vmware workstation or virtual box  Deploy 3 virtual machines and running at the same time. If can only run two VMs, treat host as a node (by cygwin), and use bridged networking for virtual network  Install Linux and Java  Old computers could consider pseudo-distributed environment 49
  50. 50. NYC Data Science Academy Hadoop Application Development with Real Cases Experiment Environment  Deploy Pig  Deploy Hive  Deploy Mahout
  51. 51. NYC Data Science Academy Hadoop Application Development with Real Cases List of Cases of the Course  Analysis of high volume website log system; Retrieve KPI data(Map-Reduce)  LBS application for telecommunication company; Analysis of trace of user‘s mobile phone(Map- Reduce)  User analysis for telecommunication company; Labeling duplicated users by the fingerprint of calls(Map-Reduce)  Recommendation system for E-commerce company(Map-Reduce)  Complicated recommendation system application(mahout)  Social network; Distance between users; Community detection(Pig)  Importance of nodes in a social network(Map-Reduce)  Application of clustering algorithm; Analysis of VIP(Map-Reduce, Mahout)  Financial data analysis; Retrieve reverse repurchase information from historical data(Hive)  Set stock strategies with data analysis(Map-Reduce, Hive)  GPS application; Sign-in data analysis(Pig)  Implementation and optimization of sorting on Map-Reduce  Middleware development; Cooperation of multiple Hadoop clusters

×