Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big-data analytics: challenges and opportunities

52,188 views

Published on

林智仁 (Chih-Jen Lin)
國立臺灣大學資訊工程學系特聘教授

林智仁教授畢業於台大數學系,之後前往美國密西根大學獲得碩士與博士學位。他的研究團隊多年來致力於機器學習之相關研究,所發展的資料分類軟體在世界上被廣泛使用,是台灣計算機科學界突出的重要成果。

Published in: Technology
  • Very Nice, If you want more good Presentations Please visit www.ThesisScientist.com, Its a wonderful website for latest Presentations and Research
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking --- http://amzn.to/1VqU3t9
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Big Data, Analytics, and the Future of Marketing & Sales --- http://amzn.to/1T3lSHE
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Data Analytics Made Accessible --- http://amzn.to/22qCqz8
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Excellent presentation that cuts through a lot of the hype, by going back to CS history.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Big-data analytics: challenges and opportunities

  1. 1. Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at ðcǙÑx}t, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.) 1 / 54
  2. 2. Everybody talks about big data now, but it's not easy to have an overall picture of this subject In this talk, I will give some personal thoughts on technical developments of big-data analytics. Some are very pre-mature, so your comments are very welcome Chih-Jen Lin (National Taiwan Univ.) 2 / 54
  3. 3. Outline 1 From data mining to big data 2 Challenges 3 Opportunities 4 Discussion and conclusions Chih-Jen Lin (National Taiwan Univ.) 3 / 54
  4. 4. From data mining to big data Outline 1 From data mining to big data 2 Challenges 3 Opportunities 4 Discussion and conclusions Chih-Jen Lin (National Taiwan Univ.) 4 / 54
  5. 5. From data mining to big data From Data Mining to Big Data In early 90's, a buzzword called data mining appeared Many years after, we have another one called big data Well, what's the dierence? Chih-Jen Lin (National Taiwan Univ.) 5 / 54
  6. 6. From data mining to big data Status of Data Mining and Machine Learning Over the years, we have all kinds of eective methods for classi
  7. 7. cation, clustering, and regression We also have good integrated tools for data mining (e.g., Weka, R, Scikit-learn) However, mining useful information remains dicult for some real-world applications Chih-Jen Lin (National Taiwan Univ.) 6 / 54
  8. 8. From data mining to big data What's Big Data? Though many de
  9. 9. nitions are available, I am considering the situation that data are larger than the capacity of a computer I think this is a main dierence between data mining and big data So in a sense we are talking about distributed data mining or machine learning (a), (b): distributed systems Image from Wikimedia Chih-Jen Lin (National Taiwan Univ.) 7 / 54
  10. 10. From data mining to big data From Small to Big Data Two important dierences: Negative side: Methods for big data analytics are not quite ready, not even mentioned to integrated tools Positive side: Some (Halevy et al., 2009) argue that the almost unlimited data make us easier to mine information I will discuss the
  11. 11. rst dierence Chih-Jen Lin (National Taiwan Univ.) 8 / 54
  12. 12. Challenges Outline 1 From data mining to big data 2 Challenges 3 Opportunities 4 Discussion and conclusions Chih-Jen Lin (National Taiwan Univ.) 9 / 54
  13. 13. Challenges Possible Advantages of Distributed Data Analytics Parallel data loading Reading several TB data from disk is slow Using 100 machines, each has 1/100 data in its local disk ) 1/100 loading time But having data ready in these 100 machines is another issue Fault tolerance Some data replicated across machines: if one fails, others are still available Chih-Jen Lin (National Taiwan Univ.) 10 / 54
  14. 14. Challenges Possible Advantages of Distributed Data Analytics (Cont'd) Work ow not interrupted If data are already distributedly stored, it's not convenient to reduce some to one machine for analysis Chih-Jen Lin (National Taiwan Univ.) 11 / 54
  15. 15. Challenges Possible Disadvantages of Distributed Data Analytics More complicated (of course) Communication and synchronization Everybody says moving computation to data, but this isn't that easy Chih-Jen Lin (National Taiwan Univ.) 12 / 54
  16. 16. Challenges Going Distributed or Not Isn't Easy to Decide Quote from Yann LeCun (KDnuggets News 14:n05) I have seen people insisting on using Hadoop for datasets that could easily
  17. 17. t on a ash drive and could easily be processed on a laptop. Now disk and RAM are large. You may load several TB of data once and conveniently conduct all analysis The decision is application dependent We will discuss this issue again later Chih-Jen Lin (National Taiwan Univ.) 13 / 54
  18. 18. Challenges Distributed Environments Many easy tasks on one computer become dicult in a distributed environment For example, subsampling is easy on one machine, but may not be in a distributed system Usually we attribute the problem to slow communication between machines Chih-Jen Lin (National Taiwan Univ.) 14 / 54
  19. 19. Challenges Challenges Big data, small analysis versus Big data, big analysis If you need a single record from a huge set, it's reasonably easy For example, accessing your high-speed rail reservation is fast However, if you want to analyze the whole set by accessing data several time, it can be much harder Chih-Jen Lin (National Taiwan Univ.) 15 / 54
  20. 20. Challenges Challenges (Cont'd) Most existing data mining/machine learning methods were designed without considering data access and communication of intermediate results They iteratively use data by assuming they are readily available Example: doing least-square regression isn't easy in a distributed environment Chih-Jen Lin (National Taiwan Univ.) 16 / 54
  21. 21. Challenges Challenges (Cont'd) So we are facing many challenges methods not ready no convenient tools rapid change on the system side and many others What should we do? Chih-Jen Lin (National Taiwan Univ.) 17 / 54
  22. 22. Opportunities Outline 1 From data mining to big data 2 Challenges 3 Opportunities 4 Discussion and conclusions Chih-Jen Lin (National Taiwan Univ.) 18 / 54
  23. 23. Opportunities Opportunities Looks like we are in the early stage of a research topic But what is our chance? Chih-Jen Lin (National Taiwan Univ.) 19 / 54
  24. 24. Opportunities Lessons from past developments in one machine Outline 3 Opportunities Lessons from past developments in one machine Successful examples? Design of big-data algorithms Chih-Jen Lin (National Taiwan Univ.) 20 / 54
  25. 25. Opportunities Lessons from past developments in one machine Algorithms for Distributed Data Analytics This is an on-going research topic. Roughly there are two types of approaches 1 Parallelize existing (single-machine) algorithms 2 Design new algorithms particularly for distributed settings Of course there are things in between Chih-Jen Lin (National Taiwan Univ.) 21 / 54
  26. 26. Opportunities Lessons from past developments in one machine Algorithms for Distributed Data Analytics (Cont'd) Given the complicated distributed setting, we wonder if easy-to-use big-data analytics tools can ever be available? I don't know either. Let's try to think about the situation on one computer
  27. 27. rst Indeed those easy-to-use analytics tools on one computer were not there at the
  28. 28. rst day Chih-Jen Lin (National Taiwan Univ.) 22 / 54
  29. 29. Opportunities Lessons from past developments in one machine Past Development on One Computer The problem now is we take many things for granted on one computer On one computer, have you ever worried about calculating the average of some numbers? Probably not. You can use Excel, statistical software (e.g., R and SAS), and many things else We seldom care internally how these tools work Can we go back to see the early development on one computer and learn some lessons/experiences? Chih-Jen Lin (National Taiwan Univ.) 23 / 54
  30. 30. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product Consider the example of matrix-matrix products C = A B; A 2 Rnd ; B 2 Rdm where Cij = Xd k=1 AikBkj This is a simple operation. You can easily write your own code Chih-Jen Lin (National Taiwan Univ.) 24 / 54
  31. 31. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) A segment of C code (assume n = m here) for (i=0;in;i++) for (j=0;jn;j++) { c[i][j]=0; for (k=0;kn;k++) c[i][j] += a[i][k]*b[k][j]; } For 3; 000 3; 000 matrices $ gcc -O3 mat.c $ time ./a.out 3m24.843s Chih-Jen Lin (National Taiwan Univ.) 25 / 54
  32. 32. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) But on Matlab (single-thread mode) $ matlab -singleCompThread tic; c = a*b; toc Elapsed time is 4.095059 seconds. Chih-Jen Lin (National Taiwan Univ.) 26 / 54
  33. 33. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) How can Matlab be much faster than ours? The fast implementation comes from some deep research and development Matlab calls optimized BLAS (Basic Linear Algebra Subroutines) that was developed in 80's-90's Our implementation is slow because data are not available for computation Chih-Jen Lin (National Taiwan Univ.) 27 / 54
  34. 34. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) CPU # Registers # Cache # Main Memory # Secondary storage (Disk) : increasing in speed #: increasing in capacity Optimized BLAS: try to make data available in a higher level of memory You don't waste time to frequently move data Chih-Jen Lin (National Taiwan Univ.) 28 / 54
  35. 35. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) Optimized BLAS uses block algorithms A B = 2 4 A11 A14 ... A41 A44 3 5 2 4 B11 B14 ... B41 B44 3 5 = A11B11 + + A14B41 ... . . . If we compare the number of page faults (cache misses) Ours: much larger Block: much smaller Chih-Jen Lin (National Taiwan Univ.) 29 / 54
  36. 36. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) I like this example because it involves both mathematical operations (matrix products), and computer architecture (memory hierarchy) Only if knowing both, you can make breakthroughs Chih-Jen Lin (National Taiwan Univ.) 30 / 54
  37. 37. Opportunities Lessons from past developments in one machine Example: Matrix-matrix Product (Cont'd) For big-data analytics, we are in a similar situation We want to run mathematical algorithms (classi
  38. 38. cation and clustering) in a complicated architecture (distributed system) But we are like at the time point before optimized BLAS was developed Chih-Jen Lin (National Taiwan Univ.) 31 / 54
  39. 39. Opportunities Lessons from past developments in one machine Algorithms and Systems To have technical breakthroughs for big-data analytics, we should know both algorithms and systems well, and consider them together Indeed, if you are an expert on both topics, everybody wants you now Many machine learning Ph.D. students don't know much about systems. But this isn't the case in the early days of computer science Chih-Jen Lin (National Taiwan Univ.) 32 / 54
  40. 40. Opportunities Lessons from past developments in one machine Algorithms and Systems (Cont'd) At that time, every numerical analyst knows computer architecture well. That's how they successfully developed oating-point systems and IEEE 754/854 standard Chih-Jen Lin (National Taiwan Univ.) 33 / 54
  41. 41. Opportunities Lessons from past developments in one machine Example: Machine Learning Using Spark Recently we developed a classi
  42. 42. er on Spark Spark is an in-memory cluster-computing platform Beyond algorithms we must take details of Spark Scala into account For example, you want to know the dierence between mapPartitions and map in Spark, and the slower for loop than while loop in Scala Chih-Jen Lin (National Taiwan Univ.) 34 / 54
  43. 43. Opportunities Lessons from past developments in one machine Example: Machine Learning Using Spark (Cont'd) During our development, Spark was signi
  44. 44. cantly upgraded from version 0.9 to 1.0. We must learn their changes It's like when you write a code on a computer, but the compiler or OS is actively changed. We are in a stage just like that. Chih-Jen Lin (National Taiwan Univ.) 35 / 54
  45. 45. Opportunities Successful examples? Outline 3 Opportunities Lessons from past developments in one machine Successful examples? Design of big-data algorithms Chih-Jen Lin (National Taiwan Univ.) 36 / 54
  46. 46. Opportunities Successful examples? Example of Distributed Machine Learning I don't think we have many successful examples yet Here I will show one: CTR (Click Through Rate) prediction for computational advertising Many companies now run distributed classi
  47. 47. cation for CTR problems Chih-Jen Lin (National Taiwan Univ.) 37 / 54
  48. 48. Opportunities Successful examples? Example: CTR Prediction De
  49. 49. nition of CTR: CTR = # clicks # impressions . A sequence of events Not clicked Features of user Clicked Features of user Not clicked Features of user A binary classi
  50. 50. cation problem. Chih-Jen Lin (National Taiwan Univ.) 38 / 54
  51. 51. Opportunities Successful examples? Example: CTR Prediction (Cont'd) Chih-Jen Lin (National Taiwan Univ.) 39 / 54
  52. 52. Opportunities Design of big-data algorithms Outline 3 Opportunities Lessons from past developments in one machine Successful examples? Design of big-data algorithms Chih-Jen Lin (National Taiwan Univ.) 40 / 54
  53. 53. Opportunities Design of big-data algorithms Design Considerations Generally you want to minimize the data access and communication in a distributed environment It's possible that method A better than B on one computer but method A worse than B in distributed environments Chih-Jen Lin (National Taiwan Univ.) 41 / 54
  54. 54. Opportunities Design of big-data algorithms Design Considerations (Cont'd) Example: on one computer, often we do batch rather than online learning Online and streaming learning may be more useful for big-data applications Example: very often we design synchronous parallel algorithms Maybe asynchronous ones are better for big data? Chih-Jen Lin (National Taiwan Univ.) 42 / 54
  55. 55. Opportunities Design of big-data algorithms Work ow Issues Data analytics is often only part of the work ow of a big-data application By work ow, I mean things from raw data to
  56. 56. nal use of the results Other steps may be more complicated than the analytics step In one-computer situation, the focus is often on the analytics step Chih-Jen Lin (National Taiwan Univ.) 43 / 54
  57. 57. Opportunities Design of big-data algorithms How to Get Started? In my opinion, we should start from applications Applications ! programming frameworks and algorithms ! general tools Now almost every big-data application requires special settings of algorithms, but I believe general tools will be possible Chih-Jen Lin (National Taiwan Univ.) 44 / 54
  58. 58. Discussion and conclusions Outline 1 From data mining to big data 2 Challenges 3 Opportunities 4 Discussion and conclusions Chih-Jen Lin (National Taiwan Univ.) 45 / 54
  59. 59. Discussion and conclusions Risk of This Topic It's unclear how successful we can be Two problems: Technology limits Applicability limits Chih-Jen Lin (National Taiwan Univ.) 46 / 54
  60. 60. Discussion and conclusions Risk: Technology limits It's possible that we cannot get satisfactory results because of the distributed con
  61. 61. guration Recall that parallel programming or HPC (high performance computing) wasn't very successful in early 90's. But there are two dierences this time 1 We are using commodity machines 2 Data become the focus Well, every area has its limitation. The degree of success varies Chih-Jen Lin (National Taiwan Univ.) 47 / 54
  62. 62. Discussion and conclusions Risk: Technology Limits (Cont'd) Let's compare two matrix products: Dense matrix products: very successful as the
  63. 63. nal outcome (optimized BLAS) is much better than what ordinary users wrote Sparse matrix products: not as successful. My code is about as good as those provided by Matlab For big data analytics, it's too early to tell We never know until we try Chih-Jen Lin (National Taiwan Univ.) 48 / 54
  64. 64. Discussion and conclusions Risk: Applicability Limits What's the percentage of applications that need big-data analytics? Not clear. Indeed some think the percentage is small (so they think big-data analytics is a hype) One main reason is that you can always analyze a random subest on one machine But you may say this is a chicken and egg problem { because of no available tools, so no applications?? Chih-Jen Lin (National Taiwan Univ.) 49 / 54
  65. 65. Discussion and conclusions Risk: Applicability Limits (Cont'd) Another problem is the mis-understanding Until recently, few universities or companies can access data center environments. They therefore think those big ones (e.g., Google) are doing big-data analytics for everything In fact, the situation isn't like that Chih-Jen Lin (National Taiwan Univ.) 50 / 54
  66. 66. Discussion and conclusions Risk: Applicability Limits (Cont'd) A quote from Dan Ariely, Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ... In my recent visit to a large company, their people did say that most analytics works are still done on one machine Chih-Jen Lin (National Taiwan Univ.) 51 / 54
  67. 67. Discussion and conclusions Risk: Applicability Limits (Cont'd) A quote from Dan Ariely, Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ... In my recent visit to a large company, their people did say that most analytics works are still done on one machine Chih-Jen Lin (National Taiwan Univ.) 51 / 54
  68. 68. Discussion and conclusions Risk: Applicability Limits (Cont'd) A quote from Dan Ariely, Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ... In my recent visit to a large company, their people did say that most analytics works are still done on one machine Chih-Jen Lin (National Taiwan Univ.) 51 / 54
  69. 69. Discussion and conclusions Risk: Applicability Limits (Cont'd) A quote from Dan Ariely, Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ... In my recent visit to a large company, their people did say that most analytics works are still done on one machine Chih-Jen Lin (National Taiwan Univ.) 51 / 54
  70. 70. Discussion and conclusions Open-source Developments Open-source developments are very important for big data analytics How it works: The company must do an application X. They consider an open-source tool Y. But Y is not enough for X. Then their engineers improve Y and submit pull requests Through this process, core developers of a project are formed. They are from various companies Chih-Jen Lin (National Taiwan Univ.) 52 / 54
  71. 71. Discussion and conclusions Open-source Developments (Cont'd) For Taiwanese data-science companies, I think we should actively participate in such developments Indeed industry rather than schools are in a better position to do this Chih-Jen Lin (National Taiwan Univ.) 53 / 54
  72. 72. Discussion and conclusions Conclusions Big-data analytics is in its infancy It's challenging to development algorithms and tools in a distributed environment To start, we should take both algorithms and systems into consideration Hopefully we will get some breakthroughs in the near future Chih-Jen Lin (National Taiwan Univ.) 54 / 54

×