In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics


Published on

Published in: Technology, Education

In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics

  1. 1. In-Hadoop, In-Database, and In-Memory Processing for Predictive Analytics Predict to Act.
  2. 2. Stop Looking at the Rear View Mirror 2 From BI… Business Intelligence can only show you what already has happened. This is like driving a car by only looking into the rear view mirror. Do you really want to drive your business like that?
  3. 3. Stop Looking at the Rear View Mirror 3 CONTINUED …to… Data Discovery and Real-time Analytics offer a view through the windscreen. You can see what is happening right now but you still cannot identify upcoming chances or threats.
  4. 4. Stop Looking at the Rear View Mirror 4 CONTINUED Predictions. Predictive Analytics delivers future outcomes and provides a look- ahead view. This is like projecting what lies behind the next curve already on your wind screen. Predictive Insights will show you that there is an accident. Prediction-based actions will trigger automatically that the car slows down. Warning: Accident Ahead!
  5. 5. Value is Higher for Prediction-based Actions 5 Inform. Aggregation of micro-predictions will show you what can be expected Allow for better decision making Useful for supporting strategic decisions Will not disrupt business processes Limited & unspecified total value Operationalize. Millions of micro-predictions Each Predictive Action is embedded into your business process You will know how often you will be right and what your total gain will be Brings your business processes to a new, pro-active level Huge total value Predictive Insights Predactions* *Predactions are Prediction-based Actions. You will predict what is going to happen. And then you will predact on this.
  6. 6. Science?  Predictive analytics is complex  Hadoop is complex  Proposed solution: Let’s create more Data Scientists!  But there are flaws with this approach: – Scientists are supposed to create new things. Data scientists spend 95% of their time on integrating and transforming data. – Shortage of data scientists predicted (KcKinsey report) – Being a hardcore programmer, having a PhD in Statistics, and being able to understand business problems is a rare skill mix… 6
  7. 7. What else can we do? 7
  8. 8. Radoop: RapidMiner on Hadoop  We do this with RapidMiner + Hadoop = Radoop – Hadoop is primarily used for batch analytics workloads (ad-hoc reporting, machine learning, etc.) – Hadoop only provides programming APIs and command line tools – Radoop is a partner of RapidMiner who brought the simplicity of RapidMiner for advanced analytics to Hadoop clusters – Radoop is developed since 2010 8 We need to empower collaborative teams with different backgrounds to analyze data in Hadoop – one team member might be the data scientist.
  9. 9. RapidMiner for Prediction-based Actions 9 Empower business users: Easy-to-use GUI for the design of processes. Predictive insights shown to improve decision making. Business analysts in the driver’s seat: Let your analysts transform business problems into Prediction- based Actions. Create millions of micro-predictions and automate everyday decision making. Facilitates Collaboration among business users, business analysts, data scientists, and IT professionals.
  10. 10. Radoop: RapidMiner on Hadoop 10  RapidMiner Data Flow Interface: Simple design, execution and maintenance of analytics processes – Focus: ad-hoc reporting and machine learning – Also supports data import/export, data transformations, ETL workloads, visualization  Combines distributed and in- memory analytics
  11. 11. Supported Hadoop Distributions 11
  12. 12. Client- or Server-based Architecture 12 Client-based Architecture Server-based Architecture
  13. 13. Segment Users based on Service Usage (ex.)  Task: Define K user segments and assign users to segments  Solution with Hadoop + Mahout: – CREATE TABLE: define a schema for the service usage log file by manually listing columns, types, defining separator character, etc. – Write HiveQL queries (or Pig scripts or…) to aggregate service logs for each user and calculate user attributes describing them – Implement and execute a custom MapReduce job to convert data to Mahout’s input format – Run the Mahout K-Means algorithm with proper parameters – Implement and execute a custom MapReduce job to convert the result back into a delimited format – Export the result from HDFS and import it into an RDBMS (or whatever system makes use of the “predactions”…) 13
  14. 14. Segment Users based on Service Usage (ex.)  Task: Define K user segments and assign users to segments  Solution with Radoop: 14
  15. 15. Radoop: Data Management 15
  16. 16. Radoop: Process Management 16
  17. 17. Radoop: Supported Functions  Import/Export data to/from Hadoop – Read CSV – Read Database – Write CSV – Write Database – Retrieve/Store/Append to Hive  Data Transformations – Select Attributes – Filter Examples – Generate Attributes – Generate ID – Aggregate – Join – Sort – Normalize – Replace – Replace/Declare Missing Values – Hive/Pig Script  Machine learning & Statistical modeling – Clustering: K-Means, Fuzzy K-Means, Dirichlet, Canopy – Model learning: Naive Bayes – Model scoring: Naive Bayes, Decision Tree, Logistic Regression, Linear Regression – Evaluation: Performance – …and more… 17
  18. 18. Production Use at… 18
  19. 19. Engine Comparison  In-Memory: – In-memory analytics is always the fastest way to build analytical models – Data set size is restricted by hardware (memory) – Data set size: On decent hardware, up to ca. 100 million data points  In-Database: – Not applicable for all analysis tasks – Runtime depends on the power of the database server – Data set size: Unlimited (limit is the external storage capacity)  In-Hadoop: – Not applicable for all analysis tasks – Runtime depends on the power of the Hadoop cluster – Due to massive overhead introduced by Hadoop, the usage of Hadoop is not recommended for smaller data set sizes – Data set size: Unlimited (limit is the external storage capacity) 19
  20. 20. Runtime Comparison for Naïve Bayes (20 nodes) 20
  21. 21. Runtime Comparison for Number of Nodes 21
  22. 22. Conclusion  Predictive Analytics on Hadoop for Everyone: – RapidMiner + Radoop is an easy-to-use & efficient alternative supporting the collaboration process between different team members – Not only Predictive Intelligence but also Prediction-based Actions can be created on top of Hadoop clusters by everyone  Runtimes: – Looking at the runtimes for analytical algorithms, it can be easily seen that limitations in terms of data set sizes have vanished today – but at the price of larger runtimes – Running predictive analytics on Hadoop clusters is prohibitively slow for small data sets and in many cases also for interactive real-time reports – Depending on the data itself, the number of nodes, and the selected predictive analytics algorithm, those can beat the other engines already at ca. 10M to 25M data points – In general we recommend to stay in-memory for up to 100M data points and invest in hardware before doing the switch to in-database (up to 500M data points) and then to Hadoop clusters for data sets beyond this size 22
  23. 23. RapidMiner USA RapidMiner, Inc. (Headquarters) 10 Fawcett St Cambridge, MA 02138 United States E-mail Phone +1 - 617 - 401 - 7708 Fax +1 - 617 - 401 - 7709 CONTACT US 23 RapidMiner Germany RapidMiner GmbH Stockumer Str. 475 44227 Dortmund Germany E-mail Phone +49 - 231 - 425 786 9-0 Fax +49 - 231 - 425 786 9-9 RapidMiner UK RapidMiner Ltd. Quatro House, Frimley Road Camberley GU16 7ER United Kingdom E-mail Phone +44 1276 804 426