SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

286
-1

Published on

Think big, act small, start now

Not only the seemingly endless flow of data but also its variety and complexity are typical for the Digital Era. This evolution offers companies the opportunity to gain new and valuable insights. Some examples of analytics:

- A customer segmentation analysis divides customers into several groups, based on specific characteristics. This allows us to target them better, offer them tailor-made products and services or exploit cross-selling and up-selling opportunities more.

- Churn prediction even makes real-time prediction possible of which customers are about to leave us. This insight enables us to take proactive action to prevent this. At the same time we are confronted with some new challenges and we need to change the way we handle data.

Big data and analytics are the key to gain new insights, which can be incorporated by organizations in their strategic decisions as well as in their operational way of working. The key question is: how do you start? The answer is simple: start with building up the basic competences, start today and keep it simple, prove the added value and add complexity along the way.

During this AE foyer two open source solutions (and market standards), R and Hadoop, will be discussed. We will present their characteristics in detail and illustrate (in an accessible way) how to use them and which quick results you can expect. Furthermore a realistic reference architecture will be shown, helping you to make the right choices, based on your needs and ambitions.

Don’t miss out and discover how you can take advantage of the opportunities of the Digital Era, in an innovative and pragmatic manner!

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
286
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Platwalsen met informatie – educatieve trainingssessie die we wel met voorbeelden en cases concretiseren.
  • Reources = mensen met de juiste competenties  analysis gap.
  • De manier waarop we met de wereld interageren is veranderd.
    Web: social media, webshops, online services,…
    Beyond: mobile, devices, sensors,…


  • Introductie van de 3 V’s: Velocity – Varaiety – Volume.
    De manier waarop we met de wereld interageren is veranderd: social media, mobile, devices,…
  • R is een mooi opstapmodel, maar kan ook een alternatief bieden voor de “groten”.
  • SAI Avondsessie 19/06: R and Hadoop, the perfect marriage for your analytics?

    1. 1. ©AE 2012  1   Bram Vanschoenwinkel Senior Data Scientist, AE @bvschoen @AE_NV R & Hadoop The perfect marriage for your analytics? Avondconferentie 19/06/2014
    2. 2.  2 Agenda 1. It’s a ( R )evolution 2. Intelligent Decision Support in the Digital Age 3. The R Project for Statistical Computing 4. The World of Hadoop 5. Case: A Customer Intelligence Platform 6. Conclusions
    3. 3.  3 It’s a (R)evolution 2000 2010 2015 DATA VOLUME TIME MAJORITY UNSTRUCTUREDDATA
    4. 4.  4 Abundance of Data BEYOND WEB CRM ERP PURCHASE DETAIL PRODUCTION PAYMENT DETAIL PLANNING CONTACT INFORMATION LEADS OFFERS SEGMENTATION PROSPECTS CLICK STREAM DATA WEB SHOPS SOCIAL MEDIA VIDEO IMAGES TEXT ONLINE SERVICES AUDIO OPEN DATA MOBILE DEVICES INTERNET OF THINGS RFID GPS SENSORS USER GENERATED CONTENT SMART DEVICES SENSORS REMOTE MONITORING CLOUD MEDICAL WARABLES
    5. 5.  5 Opportunities OPERATIONAL EXCELLENCE INNOVATIVE BUSINESS MODELS INSIGHTS, STRATEGY AND POLICY
    6. 6.  6 SHORT LIFESPAN OF THE DATA FASTMOVINGDATA FASTDATAPROCESSING HIGH VARIETY OF DATA Challenges
    7. 7.  7 intelligent decision support in the digital age WHAT WE SEE ABUNDANCE OF HETEROGENOUS DATA THE WAY WE INTERACT WITH THE WORLD HAS CHANGED OPPORTUNITIES OPERATIONAL EXCELLENCE BETTER DECISION SUPPORT CHALLENGES ANALYSIS GAP VOLUME, VARIETY, VELOCITY INNOVATING BUSINESS MODELS COMPETENCES
    8. 8.  8 Decision Support in the Digital Age Facing the Challenges and realizing the Opportunities Business Analytics Big Data
    9. 9.  9 Elements of a Holistic Information Management Framework - Data Sources - Internal & External - From Data to Information - Improving data quality - Integrality of data - From Information to Knowledge Intelligent Decision Support: - Reporting - Business Analytics - From Knowledge to Intelligence DATAInformation Knowledge Intelligence Wisdom/Insight
    10. 10.  10 Decision Support in the Digital Age “Business Analytics is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data.”
    11. 11.  11 Business Analytics vs Business Intelligence
    12. 12.  12 New Insights 8 stoppen 132 stoppen 10 stoppen 53 stoppen 64 stoppen 14 stoppen 4 stoppen 11 stoppen
    13. 13.  13 Innovating Business Models Front-end Application(s) Security Analytics (on Hadoop) Web Click StreamingSocial Media Connectivity External Application Integration Operational Data Processing on Hadoop
    14. 14.  14 From Analytics… Statistics Algorithms Biology Psychology Databases
    15. 15.  15 …to Business Analytics
    16. 16.  16 Analytics Approach  Analytics  Incremental and iterative  Think big act small  Proof-of-Concept  Open source tools  Architecture & Deployment  (Non-)funtional requirements  Information Architecture  Technology  Embedded into operations Two Phase Approach Analytics Architecture Deployment
    17. 17.  17 Analytics Churn Prediction Example Invoicing CRM Call Center Application John Doe – 43years – Antwerp – Man – 7calls – 3weeks – 30%down invoicing Jane Dan – 32years – Brussels – Woman – 2calls – 12weeks – 10%up invoicing … Operations CHURN SCORES REGION PRODUCT CHURN SCORES MANAGEMENT DASHBOARD OPERATIONS DATA DUMP Analytics Engine Data Warehouse
    18. 18.  18 Big Data “Big data is high-volume, high-velocity, high-complexity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” (Gartner)
    19. 19.  19 Four V’s and a C  Not only volume makes big data big, it’s all about the three V’s:  High Volume, Variety, Velocity  High Value!  In addition the data is very complex in nature, often unstructured:  Text documents, emails, images and videos, etc.  Click stream data, social media feed data, etc.
    20. 20.  20 Innovative Forms of Information Processing  Traditional methods don’t suffice anymore.  New forms of information processing have emerged. DISTRIBUTED DATA STORAGE COMPUTATION NoSQL DATA STORES
    21. 21.  21 Innovative Forms of Information Processing
    22. 22.  22 The R Project for Statistical Computing  R is a dialect of the S language  S was developed by John Chambers and others at Bell Labs  S was initiated in 1976  Now owned by TIBCO and sold under the name S-PLUS INTERACTIVE NOT PROGRAMMING PROGRAMMING WHEN SYSTEM ASPECTS BECOME IMPORTANT GRADUALLY MOVING INTO
    23. 23.  23 Advantages of R  Most widely used data analysis software  Created and used by 2M+ data scientists, statisticians and analysts  Most powerful statistical programming language  Flexible, extensible & comprehensive for productivity, +4800 packages  Create beautiful and unique data visualizations  As seen in New York Times, Twitter and Flowing Data  Thriving open-source community  Leading edge of analytics research  Fills the talent gap  New graduates prefer R
    24. 24.  24 Drawbacks of R Steep learning curve Objects must be stored in physical memory, little thought to memory management Functionality is based on consumer demand and user contributions Documentation is sometimes patchy and terse, and impenetrable to the non-statistician Vibrant community to help you Recent advancements to deal with this If a package is useful to many people, it will quickly evolve into a robust product Vibrant community to help you
    25. 25.  25 Exploding growth and Demand for R  R is the highest paid IT skill  – Dice.com, Jan 2014  R most-used data science language after SQL  – O’Reilly, Jan 2014  R is used by 70% of data miners  – Rexer, Sep 2013  R is #15 of all programming languages  – RedMonk, Jan 2014  R growing faster than any other data science language  – KDnuggets, Aug 2013  More than 2 million users worldwide
    26. 26.  26 Great Adoption of R by Many Companies  Commercial vendors offering general support and developing specific R based products, e.g.: Oracle, RevolutionAnalytics.  Companies using R for advanced statistics and analytics, e.g.: Thomas Cook, Google, Twitter.  Also in the AE customer base we see different companies looking into R as an alternative or complement to the traditional tools.
    27. 27.  27 Example Packages  twitteR: Provides an interface to the Twitter web API.  tm: Provides Text Mining functionalities like word stemming, stopword removal, etc.  wordcloud: Provides methods for producing wordclouds in different forms, shapes and colors.
    28. 28.  28 Apache Hadoop  Open-source software framework.  Storage and large-scale processing of data on clusters of commodity hardware.  Apache top-level project built and used by a global community.  Two core components: 1. Hadoop Distributed File System (HDFS) 2. MapReduce
    29. 29.  29 Apache Hadoop  MapReduce/HDFS based on Google's MapReduce and Google File System.  Other components are:  Hadoop Common – libraries and utilities needed by other Hadoop modules  Hadoop YARN – a resource-management platform  The entire Apache Hadoop “platform” is now commonly considered to consist of a number of related projects as well: Pig, Hive, Hbase,…  Created by Doug Cutting and Mike Cafarella at Yahoo in 2005 originally to support distribution for the Apache Nutch search engine project. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.
    30. 30.  30 The World of Hadoop
    31. 31.  31 Key Properties Apache Hadoop  Transforms commodity hardware into a service that:  Stores petabytes of data reliably.  Allows huge distributed computations.  Key Properties:  Designed for batch processing.  Write-once-read-many access model for files.  Extremely powerful.  Scalability: • Scales linearly with cores and disks. • Machines can be added and removed from the cluster. • Write code once, same program runs on 1, 1000, 4000 machines.  Reliable and fault-tolerant: • Failed tasks/data transfers are automatically retried. • Data replication, redundancy.
    32. 32.  32 Rack 2 Rack 3Rack 1 A Typical Hadoop Cluster Client DATA ASSIGNMENT TO NODES DATA READ DATA WRITE METADATA FOR BLOCK INFO Task Tracker Task Tracker Map Reduce Map Reduce Job Tracker Data Node Data Node Task Tracker Map Reduce Data Node Task Tracker Task Tracker Map Reduce Map Reduce Data Node Data Node Task Tracker Map Reduce Data Node Task Tracker Task Tracker Map Reduce Map Reduce Data Node Data Node Task Tracker Map Reduce Data Node Master Node Slave Nodes Slave Nodes Slave Nodes Name Node JOB ASSIGNMENT TASK ASSIGNMENT 1. Client 2. Master Node  Name Node  Job Tracker 3. Slave Nodes  Data Nodes  Task Trackers  Map / Reduce
    33. 33.  33 1. Client consults Name Node 2. Client writes block to Data Node 3. Data Node replicates block 4. Cycle repeats for next blocks Rack 2 Rack 3Rack 1 Hadoop File System (HDFS) Data Node 1 Data Node 4 Data Node 7 Data Node 2 Data Node 5 Data Node 8 Data Node 3 Data Node 6 Data Node 9 Name Node Client FILE FILE DATA ASSIGNMENT TO NODES DATA READ DATA WRITE METADATA FOR BLOCK INFO Rack 1: Data Node 1 Data Node 2 … Rack 2: Data Node 3 …
    34. 34.  34 MapReduce the, 1 quick, 1 brown, 1 fox, 1 the, 1 fox, 1 ate, 1 the, 1 mouse, 1 how, 1 now, 1 brown, 1 cow, 1 the, 1 the, 1 the, 1 fox, 1 fox, 1 quick, 1 brown, 1 brown, 1 ate, 1 mouse, 1 how, 1 now, 1 cow, 1 the, 3 fox, 2 quick, 1 brown, 2 ate, 1 mouse, 1 how, 1 now, 1 cow, 1 the, 3 fox, 2 quick, 1 brown, 2 ate, 1 mouse, 1 how, 1 now, 1 cow, 1 Input Splitting Map Shuffle Sort Reduce Output The Map function processes one line at a time, splits it into tokens seperated by a withespace and emits a key-value pair <word, 1>. The Reducer function just sums up the values, which are the occurence counts for each key (i.e. words in this example).
    35. 35.  35 Hadoop Distributions  Fully equipped, scalable and flexible cloud solutions.  Also different on premise solutions are being offered.  Choice depends on specific requirements.  Data Privacy, Scalability, Security, Data Mastership, Configuration, Flexibility, Price-Performance Ratio, Automation,…  How to get started?  Free to download!  Business model is based on training, consulting, support and additional “tooling” (Enterprise Editions).  Many free trial cloud versions available to play around with.  Many tutorials, trainings, blogs, user groups etc.
    36. 36.  36 RHadoop  A collection of four R packages that allow users to manage and analyze data with Hadoop:  rmr: Hadoop MapReduce functionality in R  rhdfs: file management of the HDFS from within R  rhbase: database management for the HBase distributed database  Recently a new package plyrmr was relased providing a familiar interface while hiding many of the MapReduce details (like Hive, Pig and Mahoot).  R and all RHadoop packges should be installed on all nodes in the Hadoop cluster. Combining the advantages of R with the power of Hadoop.
    37. 37.  37 MapReduce Wordcount Example in R Map function. Reduce function. Reading the input from HDFS from.dfs(). Writing the results back to HDFS to.dfs().
    38. 38.  38 Case: A Customer Intelligence Platform * Non Disclosure Agreement: Contact AE via www.ae.be/contact for more information
    39. 39.  39 Conclusions  The Digital Age brings many opportunities but also challenges.  Big Data and Analytics can face the challenges and realize the opportunities.  It is within anyone’s grasp, do it incremental and iterative.  R and Hadoop:  Open source software, active user groups and support.  A great way to start exploring!  Combined power gives you the advantage of 1 + 1 =3.  Sometimes alternatives are better.
    40. 40.  40 Conclusions  Don’t always need Big Data to do Analytics, it depends on the requirements.  Hadoop cloud solutions are scalable, flexible and cost-efficient, but sometimes limited in functionality (or not standardized).  Many differences between Hadoop distributions, constantly evolving (and getting better).  Need for good Data Scientists in a mixed team of competences to make the right choices.
    41. 41.  41 What’s next?  Ask yourselves following questions:  What opportunities do I see for myself?  What strategic and competitive advantages can I realize?  Is Analytics the right solution for me? Do I need Big Data?  What about my Data Warehouse environment?  And what about the quality of my operational data?  Do I have the right infrastructure in place?  Do I have the right competences in house?  Now you should know what’s in it for you, but also the challenges your most probably will be facing.
    42. 42.  42 What’s next?  You have a case you would like to discuss…?  You have any questions…?  Please feel free to contact me:  Bram Vanschoenwinkel  Bram.Vanschoenwinkel@ae.be  +32 478 741738 @bvschoen be.linkedin.com/in/bramvanschoenwinkel/
    43. 43. @bvschoen / @ae_nv www.ae.be blog.ae.be
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×