Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Analytics (ML, DL, AI) hands-on


Published on

Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Big Data Analytics (ML, DL, AI) hands-on

  1. 1. Machine Learning, Deep Learning, Big Data Hands-On by Dony Riyanto Prepared and Presented to Panin Asset Management January 2019
  2. 2. Hands-on Agenda • Machine Learning Re-Visited • Python Example of Machine Learning • Introduction to Deep Learning • Immentation of 'Big Data' (Hadoop Ecosystem) • Hadoop File System • Hadoop Map Reduce • Case Study • More advance implementation of Big Data
  3. 3. The Learning Problem The essence of ML: 1. We have data 2. Patterns exist in data 3. We can't do math formula (don't know the formula yet) Examples:  Movie Rating  Credit Approval  Hand Written Recognition Domain Areas  Computer Vision  Natural Language Processing  Business Intelligence
  4. 4. Components of Learning Example in Banking: Credit Card Approval Input : x (customer application) Output : y (good/bad customer) Unknown Target Function f :XY Dataset {x, y} (customers record database) Hypothesis Set: H : X Y Final Hypothesis g Learning Model = Hypothesis Set + Learning Algorithm
  5. 5. Machine Learning Model Spatial Data (Text, Image) {x, y} Sequence or Time Series Data {x, t} Classifier Class Score Regression Cont. Values
  6. 6. Main Paradigms Automatic discovery of patterns in data through computer algorithms and the use of those patterns to take actions such as classifying or clustering the data into categories. Supervised Learning: Learning by labeled example E.g. An email spam detector We have (input, correct output), and we can predict (new input, predicted output) Amazingly effective if you have lots of data Unsupervised Learning: Discovering Patterns E.g. Data clustering Instead of (input, correct output), we get (input, ?) Difficult in practices but useful if we lack labeled data Reinforcement Learning: Feedback & Error E.g. Learning to play chess Instead of (input, correct output), we get (input, only some output, grade of this output) Works well in some domains, becoming more important
  7. 7. What/why is Python Python is an interpreted, high-level programming language, general- purpose programming language. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.
  8. 8. Machine Learning with Python • We need Python 2.7.x or 3.7.x • Libraries, ex.: • numpy (fundamental package for scientific computing with Python) • matplotlib (plotting library for the Python programming language and its numerical mathematics extension NumPy) • pandas (software library written for the Python programming language for data manipulation and analysis) • seaborn (Python data visualization library based on matplotlib) • sklearn (Scikit-learn is a machine learning library for the Python programming language) • IDE, ex: pycharm • Alternatives, install Anaconda (distribution of the Python programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.)
  9. 9. Machine Learning with Python • Python 3 installation • Introduction to pip (python package installer) • Install PyCharm • or install Anaconda
  10. 10. Lesson 1 *data preprocessing
  11. 11. Lesson 1 (with Anaconda)
  12. 12. Lesson 2 *Class labeling with preprocessing
  13. 13. Lesson 3 *Load CSV and data observation
  14. 14. Lesson 4
  15. 15. Introduction to Deep Learning • Deep learning has produced good results for a few applications such as computer vision, language translation, image captioning, audio transcription, molecular biology, speech recognition, natural language processing, self-driving cars, brain tumour detection, real-time speech translation, music composition, automatic game playing and so on. • Deep learning is the next big leap after machine learning with a more advanced implementation. Currently, it is heading towards becoming an industry standard bringing a strong promise of being a game changer when dealing with raw unstructured data.
  16. 16. Introduction to Deep Learning • Deep learning is currently one of the best solution providers fora wide range of real-world problems. Developers are building AI programs that, instead of using previously given rules, learn from examples to solve complicated tasks. With deep learning being used by many data scientists, deeper neural networks are delivering results that are ever more accurate. • The idea is to develop deep neural networks by increasing the number of training layers for each network; machine learns more about the data until it is as accurate as possible. Developers can use deep learning techniques to implement complex machine learning tasks, and train AI networks to have high levels of perceptual recognition.
  17. 17. Introduction to Deep Learning • Deep learning finds its popularity in Computer vision. Here one of the tasks achieved is image classification where given input images are classified as cat, dog, etc. or as a class or label that best describe the image. We as humans learn how to do this task very early in our lives and have these skills of quickly recognizing patterns, generalizing from prior knowledge, and adapting to different image environments.
  18. 18. Deep Learning Performance
  19. 19. Deep Learning with TensorFlow • Googles TensorFlow is a python library. This library is a great choice for building commercial grade deep learning applications. • TensorFlow grew out of another library DistBelief V2 that was a part of Google Brain Project. This library aims to extend the portability of machine learning so that research models could be applied to commercial-grade applications. • Much like the Theano library, TensorFlow is based on computational graphs where a node represents persistent data or math operation and edges represent the flow of data between nodes, which is a multidimensional array or tensor; hence the name TensorFlow
  20. 20. Deep Learning Implementation with Tensorflow and Python • Preparation (Python + libraries) • Installing Tensorflow • Running Several Tensorflow built-in example, ex.: • Regression • Image Classification
  21. 21. Introduction to Hadoop
  22. 22. Hadoop Hadoop is: • - scalable. • - a “Framework”. • - not a drop in replacement for RDBMS. • - great for pipelining massive amounts of data to achieve the end result.
  23. 23. Hadoop
  24. 24. Hadoop
  25. 25. Hadoop
  26. 26. Hadoop
  27. 27. Hadoop
  28. 28. Hadoop
  29. 29. Hadoop
  30. 30. Hadoop
  31. 31. Hadoop
  32. 32. Hadoop • example of file/text search
  33. 33. Hadoop • Planning • Installation step • Using HDFS • Using Map Reduce
  34. 34. Hadoop Map Reduce • MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. VS
  35. 35. Hadoop Map Reduce • The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types. • The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
  36. 36. Hadoop Map Reduce • Words Count (without map-reduce)
  37. 37. Hadoop Map Reduce • Words Count (mapper)
  38. 38. Hadoop Map Reduce • Words Count (reducer)
  39. 39. Hadoop Map Reduce • Run on HadoopMR input file from local or HDFS mapper application (see prev. slide) reducer application (see prev. slide) *mapper and recuder apps can be written in Python , R, Java, Scala, etc
  40. 40. Hadoop Map Reduce
  41. 41. Hadoop Map Reduce • Map Reduce is not magic. It's a method • Map Reduce is not always about big data (ex: find pi value) • Map Reduce is not silver bullet. (e.g: batch vs streaming data) • Map Reduce is usually solved: • Batch processing flow • Unstructured/Semi-structured data
  42. 42. Bigger Image of Hadoop (Hadoop Ecosystem)
  43. 43. Data Stream Why Stream Processing? • Processing unbounded data sets, or "stream processing", is a new way of looking at what has always been done as batch in the past. Whilst intra-day ETL and frequent batch executions have brought latencies down, they are still independent executions with optional bespoke code in place to handle intra-batch accumulations. With a platform such as Spark Streaming we have a framework that natively supports processing both within-batch and across-batch (windowing). • By taking a stream processing approach we can benefit in several ways. The most obvious is reducing latency between an event occurring and taking an action driven by it, whether automatic or via analytics presented to a human. Other benefits include a more smoothed out resource consumption profile.
  44. 44. Introducing Spark • Better speed compared to HadoopMR • Minimized disk read-write (on memory processing) • Comes with Spark Streaming (later, Hadoop also create Hadoop Stream) • Still in Hadoop Ecosystem
  45. 45. Data Stream with Spark Streaming
  46. 46. Simple Spark Streaming Implementation Example near realtime dashboard data stream processing and analytics (bigger/reliable capabilities) multiple channel/type of data
  47. 47. Different programming style. Spark libraries included in app returned data of processing/analytics Infinite run
  48. 48. Spark Streaming Implementation • Review some spark streaming example • Review some Spark Streaming architecture
  49. 49. Example of Bukalapak • Save all data from 2014 'til now • >1.5PB data including: • Product images • Products data • Messaging
  50. 50. Buka Lapak 'Big Data' Implementation
  51. 51. Example: Application Health Monitoring
  52. 52. Example: Recomender Engine source:
  53. 53. Example: Recomender Engine
  54. 54. Example: Gojek Data Visualization
  55. 55. Example Gojek Problem
  56. 56. Example Gojek Problem
  57. 57. Example Gojek Problem
  58. 58. Example Gojek Problem