Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop


Published on

Use-cases and opportunities in BigData
Return on experience with Hadoop

* Introduction to BigData & Hadoop Technology
* Market Insights and Typical use-cases
* NetApp technology for Hadoop
* Best practices for your first project with Hadoop

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop

  1. 1. Use-cases and opportunities in BigData Return on experience with Hadoop 28 nov. 2013 © OCTO 2013 Rua Funchal, 411 5e andar Vila Olimpia Sao Paulo - BRASIL Tél : +55.11.3468.01.03 1
  2. 2. Octo and the Big Data Octo Technology has been investing on the big data market since 2010: R&D Training Partnerships development We provide to our customers consulting services: Use case and opportunity/feasibility studies Solution choice for Big Data projects Architecture design of Big Data solutions Big Data/NoSQL solutions deployment Training Octo Technology Big Data unit is composed today of a team of 12 dedicated people: Technical experts + Data analysts We have performed so far some 20 Big Data projects: Mainly big data studies and PoC Deployment of NoSQL solutions In very different sectors: Insurance, Bank, Logistics, Energy Technical partnerships with the biggest players of the Market (see next slide) 2
  3. 3. Octo expertise & partners on Big Data Ecosystème Hadoop Complex Event Processing High Performance Computing NoSQL Cloud DevOps OCTO has expertise on most of the solutions from the market. Our multiple partnerships allow us to be completely independent towards solutions editors 3
  4. 4. Big Data @ OCTO: some data Number of conferences on Big Data organized by Octo so far 20 850 16 250TB: 800 biggest volume of data analyzed by Octo Nodes: Largest Hadoop cluster deployed by Octo To: largest storage volume used by Octo during a Big Data project Number of partnerhsips of Octo with major players of the Big Data market 80 Number of Octo consultants who have training on a least one Big Data solution 4
  5. 5. Speakers Clement ROUQUIE Director BRAZIL OCTO Diego Flaborea System Engineer NetApp Mathieu DESPRIEE Senior Architect OCTO Wagner Roberto DOS SANTOS Architect OCTO 5
  6. 6. Agenda Introduction to BigData & Hadoop Technology Market Insights and Typical use-cases NetApp technology for Hadoop Best practices for your first project with Hadoop 6
  7. 7. Introduction to BigData and Hadoop © OCTO 2012 2013 7
  8. 8. Big-data is like teenage sex: everyone talks about it, nobody knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it! 8
  9. 9. Origins of Big Data Consulting firms predicted a big economic change, and Big Data is part of it Web giants implement BigData solutions for their owns needs WEB Google, Amazon, F acebook, Twitter, … Management IT Vendors McKinsey, BCG, Gartner, … NetApp, IBM, Vmware … Vendors now follow this movement. They try to take a hold on this very promising business 9
  10. 10. data deluge ! 10
  11. 11. Data and Innovation Data we traditionally manipulate (customers, product catalog…) Innovation is here ! 11
  13. 13. Velocity Real time Second Hour Day File API Web Social networks Variety MB GB TB PB Volume Structured Text Audio Video 13
  15. 15. Is there a clear definition ? Super datawarehouse? Big databases? NoSQL? Low cost storage ? Unstructured data? Cloud? Real-time analysis ? Internet Intelligence? Open Data? There’s no clear definition of Big Data It is altogether a business ambition and many technological opportunities 15
  16. 16. Big Data : proposed definition Big Data aims at getting an economical advantage from the quantitative analysis of internal and external data 16
  17. 17. Technology © OCTO 2012 2013 17
  18. 18. Exponential growth of capacities CPU, memory, network bandwith, storage … all of them followed the Moore’s law Source : 18
  19. 19. 70 Seagate Barracuda 7200.10 64 MB/s 60 MB/s 50 40 Seagate Barracuda ATA IV 30 20 IBM DTTA 35010 10 0,7 MB/s 0 1990 2010 Storage capacity Throughtput We can store 100’000 times more data, but it takes 1000 times longer to read it ! x 100’000 x 91 19
  20. 20. Traditional architectures are limited Storage oriented applications Over 10 Tb, « classical » architectures requires huge software and hardware adaptations. Event flow oriented application (IO bound) Distributed storage Share nothing Event Stream Processing (streaming) Over 1 000 events / second, « classical » architectures requires huge software and hardware adaptations. « Traditional » architectures RDBMS, Application server, ETL, ESB Parallel processing Over 1 000 transactions / second, « classical » architectures requires huge software and hardware adaptations. XTP Transaction oriented applications (TPS) Over 10 threads/Core CPU, sequential programming reach its limits (IO). Computation oriented applications (CPU bound) 20
  21. 21. Emerging families Storage oriented applications The Hadoop ecosystem offers a distributed storage, but also distributed computing using MapReduce. (IO bound) NoSQL : ditributed nonrelational stores, NewSQL : SQL compliant distributed stores Hadoop Event flow oriented application NoSQL NewSQL Streaming Transaction oriented applications (streaming) (TPS) CEP - Complex Event Processing, ESP - Event Stream Processing In-memory analytics Grid GPU Grid computing on CPU, or on GPU Computation oriented applications In-memory analytics solutions distribute the data in the memory of several nodes to obtain a low processing time. (CPU bound) 21
  22. 22. 22
  23. 23. Hadoop : a reference in the Big Data landscape Open Source • Apache Hadoop Main distributions • Cloudera CDH • Hortonworks HDP • MapR Commercial • Greenplum (EMC) • IBM InfoSphere BigInsights (CDH) • Oracle Big data appliance (CDH) • NetApp Analytics (CDH) •… Cloud • Amazon EMR (MapR) • RackSpace (HDP) • VirtualScale (CDH) •… 23
  24. 24. Hadoop Distributed File System (HDFS) Key principles File storage more voluminous than a single disk Data distributed on several nodes Data replication to ensure « fail-over », with « rack awareness » Use of commodity disk instead of SAN 24
  25. 25. Hadoop distributed processing : Map Reduce Key principles Parallelise and distribute processing Quicker processing of smaller data volumes (unitary) Co-location of processing and data 25
  26. 26. Integration w/ Information System Querying Advanced processing Orchestration Distributed Processing Distributed Storage Monitoring and Management Overview of Hadoop architecture 26
  27. 27. Available tools in a typical distribution (CDH) Sqoop Pig Cascading Hive Mahout HAMA Giraph Oozie Azkaban Web Console Flume Scribe MapReduce YARN (v2) Impala Chukwa Hue Cloudera Manager HBase CLI HDFS 27
  28. 28. Hadoop ecosystem today sklearn Spark Impala Stinger Hawq nltk HAMA Mahout panda RHadoop Python R Drill SAS Tools Giraph HBase Cassandra Cascading Pig Hive Talend Interactive Transactional API MR Java Batch Analytical queries ETL Spark Scientific Computing Search Oozie Compute Usages Solr Streaming YARN MR/Tez Storage systems Storage API Distributed FS GlusterFS HDFS S3 Isilon MapRFS Local FS NoSQL based Cassandra DynamoDB Ceph Ring Openstack Swift Import/export CLI Sqoop Flume Storm ETL (Talend, Pentaho) 28
  30. 30. Limits of traditional BI architectures Operational stores ETL tools become bottlenecks BI • • ETL does not scale well too much time spent moving the data ODS Traditional DWH are not adapted to new sources of data • • DWH ETL changing schema semi-structured, or unstructured data Moving the data again ! Datamarts 30
  31. 31. Hadoop can help improving the BI architecture Operational stores Data can be stored fast in Hadoop, and can be transformed “in-place” using processing languages like PIG, or streaming HDFS This approach is called E-L-T : Extract, Load, then Transform Map Reduce SAS, Tableau Software, Qliktech … PIG BI with Hadoop Hive Streaming BI reporting tools can also query the data stored in Hadoop using HIVE, or other libraries, more or less interactively 31
  32. 32. Summary of Hadoop What Hadoop is : A distributed storage system Combined with a framework of distributed batch processing A platform with a linear scalability, designed for commodity hardware Complementary to traditional BI systems, with lower price/performance ratios What Hadoop is not, as of today : Not a database with random-access to data Not mature on real-time, or interactive query Not enough : you need to add visualization tools, processing libraries, and other elements related to your project 32
  33. 33. Q/A 33
  34. 34. Market Trends in Europe © OCTO 2012 2013 34
  35. 35. Types of projects launched in 2012-2013 Data Science = Data mining and learning on business signals Innovation projects, launched directly by a business department with or without the IT department Exploration of new data sources (clickstream, logs, social…) Iterative projects : average budget around (100k€-200k€), ~50k-100k€ per step IT Optimization = Data warehouse offloading, Streamlining of BI appliances (Teradata, Oracle, …) with Hadoop IT project, with objectives of cost-killing, and technical improvement Building hybrid architecture with Hadoop as raw storage and ETL to offload massive data warehouse (over 40TB) Project budgets around 1M€ CAPEX and 300k€ OPEX with a clear ROI 35
  36. 36. Main use cases by sector Project launched in 2012-2013 Sector Data Science Retail Banking • • IT Optimization Behavioral marketing Savings market trends • • • Corporate & Investment Banking Insurance • • • • Proactive Customer Care Behavioral Churn E-Commerce & media • • • Fail prediction Capacity prediction Mobile data log repository Marketing Data Labs QoS Data Labs • Smart metering repository Behavioral marketing Utilities • • • Behavioral marketing Health and Savings market trends Telecoms Market data repository Trade analytics Risk computation 36
  37. 37. Perspectives for 2014 Q3-2013 seems to have been a turning point on the Big Data Analytics market in Europe Executive Committees are supporting Data Science projects as strategic projects Big Data Analytics projects are included in the 2014 budget plan, with Budget over 500k€ Open positions for Data Scientists Sectors where this topics seems to be of highest interest: Retail Banks Telecom E-commerce + Insurance + Energy (distribution) 37
  38. 38. Use-cases © OCTO 2012 2013 38
  40. 40. Behavioral analysis of churners on channels : Web, mobile, call-center  Objective : Anticipate churn.  The Marketing dept wanted to analyze new datasources (logs of mobile internet), previously ignored because of their size (250 TB for 6 months of data) DATA  “Data Lab” Project :   Identification of patterns IT and Marketing joined in the same team Elaboration of a platform to store, process, analyze and discover the behavior of churners, using machine learning algorithms  Duration : 7 months Marketing rules to make proposals 40
  41. 41. Architecture Internet mobile logs 250 TB of data to analyze for churn patterns Cluster of 8 datanodes + 2 master/support nodes Total of : 96 * 3TB disks 128 CPU Cloudera CDH 4 Tools : HIVE, PIG Mahout, R… Web portal Proposals in real-time It is planned to scale-up the cluster to 40 nodes Behavior Analysis Identification of patterns, and marketing rules 41
  43. 43. Analysis of social data to identify correlation with health-insurance claims Keywords correlations  Objective : Anticipation of health-claims, to improve internal  prediction models. Introduction of statistical variables computed from analysis of social data (medical forums).  Realization : Datavizualisation example      Collect of text from forums and other social data Natural language processing (text cleaning and analysis) Semantic learning (medical concepts), to identify trends Identification of correlations in datasets having more than 10 millions of variables Datavizualisation to evaluate results with business experts  Technology :   Hadoop on Amazon EC2 Machine learning : python, CloudSearch, NLTK, sklearn  Duration : 6 months 43
  45. 45. Customer interaction Timeline in a cross-channel context (web + call-center) Mr Mathieu DESPRIEE 4 124569 Today 30/09/1977 50, av des Champs Elysees 75008 Paris 06 17 17 54 12 Segment :HP Act – 12:16  Objective : Improve the knowledge about customer behavior, and the improve the quality of customer care. + Type : case created 4 124 569 356 Operator : Mme Catherine LECHU Incoming call – 12:08  Realization : - 06 64 45 53 73 Duration 12 min Wait time 6 min Subject : Problem with attached files 01/08/2013 Outgoing call – 10:12     Collect of data from Web, CRM, Call-center Analyses using a time-line approach Determination of typical behaviors Creation of real-time rules and alerting for web and call-centers + Subject : Problem with attached files Web Portal – 08:03 - Duration : 23 min Pages : • My subscription (2 min) • Details Case 4 124 586 356 (11 min) • Attached file (10 min) 27/07/2013 Web FAQ– 12:11 Duration: 20 min Pages : • FAQ (13 min) • Subscriptions section (7 min) 45
  46. 46. (bank, confidential) Analyses Analysis Axis 1 : Collect existing data, to search for correlations with customer behavior Business usage Timelines Axes of analysis Personalized direct marketing A database allowing to viualize and navigate into customer’s events, in the form of a timeline Customer care, call-center rules Axis 2 : Use data from credit-card expenses Typical customer behaviors (Machine Learning) Axis 3 : Search for social data (twitter, Facebook) in relation to customers Real-time alerts in e-banking Identification of these behaviors : • • • • • Purchase Churn Claims Default Fraud Digital Banking Trends Axis 4 : Fraud analysis • • Remarketing Digital Marketing Center of interests in communities Evaluation of concurrents Community Management 46
  47. 47. (bank, confidential) Hadoop / Spark Script Python Data collection • • • • • • • • • • Storage Data preparation Feature extraction Feature engineering Feature Qualification Large Scale Machine Learning (Mahout ou Mixture of experts) NLP (NLTK) MapReduce scripting (Python) SQL (Hive) ELT (Pig) Architecture R / Python sklearn / SAS • • • • Data miner Sample Qualification Statiscal Dataviz Machine Learning Statistics DataViz ElasticSearch • Drill-down • Interactive analysis • Search Custom Python D3.js Highcharts.js Reporting Tableau Software Marketing & Analysts IT 47
  49. 49. Real-Time Analytics Output 10 5 0 Smart Metering Data Stream 1 217 433 649 865 1081 1297 1513 1729 1945 2161 2377 2593 2809 3025 3241 3457 3673 3889 4105 Data in motion Input Aggregates Data at rest Weather Forecast Static or Dynamic Prices Analytics Storm Network Forecasts Distributed complex event processing on Hadoop Customer Data Machine Learning Storage 49
  50. 50. Input Data Forecast ~ 1,5 M smart meter measures processed per second to compute forecasts (6-nodes cluster) 50
  52. 52. NetApp Confidential Wireless Provider leverages Hadoop Business Challenge  Consolidate large amounts of raw customer log data from multiple data centers into one data center  Run analytical queries on consolidated data, currently can’t be done with existing tools Telco Industry Provides wireless voice and data services globally Solution  NetApp Open Solution for Hadoop eight node cluster for ingesting, storing, compressing data; Solr, Lucene for indexing, HBase for querying indexed data Benefits Another NetApp solution delivered by  POC: 660GB of data consolidated, indexed, 1.125 billion records processed in six hours  Hadoop storage failover without service interruption  New data processing and analytics capabilities 52
  53. 53. Q/A 53
  54. 54. NetApp technology for Hadoop © OCTO 2012 2013 54
  55. 55. Q/A 55
  56. 56. Best practices for your first analytics project with Hadoop © OCTO 2012 2013 56
  57. 57. Check that Hadoop is a good choice Hadoop is not a replacement to a database technology Hadoop is easy to SCALE, but is a complex technology Hadoop is batch oriented. Real-time processing and interactive querying tools are emerging, but they are still young If you have less than a few TB of data, you don’t need Hadoop 57
  58. 58. Project Framing Cluster setup Project Team setup Data collect / Data quality Analytics Iterations 58
  59. 59. Project Framing Identify a data-source you want to explore, with a potential business value Short-list and choose one business question to evaluate, related to this data Define at a macro-level your needs in analytics “classical analytics” (aggregates and reports) exploratory, with datavizualisation statistical, datamining, machine-learning  Will help you choose the tools in the ecosystem Determine the technological constraints Volume Latency (batch, or not) Data quality Integration with the rest of IT Size your cluster 59
  60. 60. This step requires your attention ! Cluster setup Hadoop uses commodity hardware, but it’s probably not the machines you are used to use in your datacenter 2U, internal storage, high-memory… Consider using the solution of a provider like NetApp Consider using Hadoop in Cloud Benchmark your brand new cluster before actually starting the project Lots of configuration parameters involved… Setup all the tools around Hadoop 60
  61. 61. Project Team setup This is innovative technology. Data-science project is an innovative project.  You need an adapted project management Co-locate the people business data analysts, architect, developers, infrastructure / ops Use Agile practices : Work iteratively, with short cycles or sprints (1 or 2 weeks). Choose small and achievable objectives for each sprint. Use Agile rituals (stand-up, retrospectives…) Train your team. A Hadoop project requires skilled people Hadoop infrastructure Hadoop development data-science Hire experts, and organize the knowledge transfer from them to your team 61
  62. 62. Data collect / Data quality As in a classical “data project” (like in BI), an important part of efforts will be related to data quality : preparation of the data clean up data transformation Don’t under-estimate this 62
  63. 63. Typical iteration of data analytics Analytics iterations select subset of data select machine-learning algorithm to use on it prepare the data (explore, filter, enrich…) divide in training dataset & test dataset execute algorithm measure prediction error visualize results draw a conclusion from the test with this algorithm and adapt for the next iteration : other data ? other algorithm ? solve a technical issue ? And start again ! 63
  64. 64. Q/A 64