Big Data : Manage, Refine, Recycle          orenault@hortonworks.com              blaisev@microsoft.com
Souscrivez à l’offre d’essai ou activez      votre accès Azure MSDN  Présentez-vous sur le stand Azure       (zone Service...
Hadoop : Etude d’un cas             d’utilisationIntroduction :Motivation et Hadoop en environnementScénarios      Microsoft
Terabytes Gigabytes Megabytes             Data Complexity: Variety and Velocity
Volume                            Velocity                               Relational                                 Data  ...
Source: IDCs 2012 Vertical IT andCommunications SurveyN=4117
Big Data Challenges :Source: IDCs 2012 Vertical IT andCommunications SurveyN=4117
Impact
010101010101010101 1010101010101010  01010101010101   101010101010
DiscoveRefine             r     Combine
0101010101010101011010101010101010 01010101010101  101010101010
OPERATIONAL                             DATA  SERVICES                            SERVICES                       Hortonwor...
Next-Generation Data Architecture   APPLICATIONS                      Business Analytics                       Custom Appl...
Business Cases Batch                  Interactive            OnlineRefine                Explore                Enrich    ...
APPLICATIONS                                                                                                              ...
APPLICATIONS                                                                                                              ...
APPLICATIONS                                                                                                              ...
Vertical           Refine                                    Explore                               Enrich                 ...
Hébergement du cluster dans AzureDÉPLOIEMENT D’UN CLUSTERHORTONWORKS
https://www.hadooponazure.com/http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-da...
Découverte du service Hadoop On AzureAZURE HD INSIGHT SERVER
•   ••   ••    •
Data locality optimizationMetadataIp.csv (583 rows)                                                                 Reduce...
Map       Combiner        Reducer                                    Skip(0)     MF-001          ASV                      ...
Chargement de données de ASV vers HDFS, exécution derequêtes, agrégation de résultatsAZURE HD INSIGHT SERVER
Registrations                                                     DB                         Klout.com                    ...
Sources de                                                                      Business                        Acquisitio...
Cloud Services    Virtual Machine    On-premiseSources de                                                                 ...
Agrégation de données issues de multiples sourcesAZURE HD INSIGHT SERVER,SQL2012, POWERPIVOT,POWERVIEW
• Submit changes back to Apache  Foundation• ‘Just works’ on Windows Azure  and Server• Integration with Visual Studio,  J...
https://www.hadooponazure.com/http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-da...
4 ouvrages écrits par 13 Microsofteeshttp://www.editions-eyrolles.com/livres/Windows-8-pour-les-professionnels
© 2012 Microsoft Microsoft Corporation. Tous droitsMicrosoft,Microsoft, et les autres les autresproduits sont des marques ...
Big Data : Manage, Refine, Analyze
Big Data : Manage, Refine, Analyze
Big Data : Manage, Refine, Analyze
Big Data : Manage, Refine, Analyze
Big Data : Manage, Refine, Analyze
Upcoming SlideShare
Loading in …5
×

Big Data : Manage, Refine, Analyze

860 views
805 views

Published on

Cette session permet de découvrir le paysage Big Data d'une façon pragmatique. Nous remettrons d'abord la question du BIG Data dans ses contextes business et techno. Ensuite, nous ferons un zoom sur les technologies Hadoop et leurs différentes possibilités d'implémentation.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
860
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with: Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring
  • Across all of our user base, we have identified just 3 separate usage patterns – sometimes more than one is used in concert during a complex project, but the patterns are distinct nonetheless. These are Refine, Explore and Enrich.The first of these, the Refine case, is probably the most common today. It is about taking very large quantities of data and using Hadoop to distill the information down into a more manageable data set that can then be loaded into a traditional data warehouse for usage with existing tools. This is relatively straightforward and allows an organization to harness a much larger data set for their analytics applications while leveraging their existing data warehousing and analytics tools.Using the graphic here, in step 1 data is pulled from a variety of sources, into the Hadoop platform in step 2, and then in step 3 loaded into a data warehouse for analysis by existing BI tools
  • A second use case is what we would refer to as Data Exploration – this is the use case in question most commonly when people talk about “Data Science”.In simplest terms, it is about using Hadoop as the primary data store rather than performing the secondary step of moving data into a data warehouse. To support this use case you’ve seen all the BI tool vendor rally to add support for Hadoop – and most commonly HDP – as a peer to the database and in so doing allow for rich analytics on extremely large datasets that would be both unwieldy and also costly in a traditional data warehouse. Hadoop allows for interaction with a much richer dataset and has spawned a whole new generation of analytics tools that rely on Hadoop (HDP) as the data store.To use the graphic, in step 1 data is pulled into HDP, it is stored and processed in Step 2, before being surfaced directly into the analytics tools for the end user in Step 3.
  • The final use case is called Application Enrichment.This is about incorporating data stored in HDP to enrich an existing application. This could be an on-line application in which we want to surface custom information to a user based on their particular profile. For example: if a user has been searching the web for information on home renovations, in the context of your application you may want to use that knowledge to surface a custom offer for a product that you sell related to that category. Large web companies such as Facebook and others are very sophisticated in the use of this approach.In the diagram, this is about pulling data from disparate sources into HDP in Step 1, storing and processing it in Step 2, and then interacting with it directly from your applications in Step 3, typically in a bi-directional manner (e.g. request data, return data, store response).
  • In the currentdeveloperpreview on www.hadooponazure.com data stored inASV canbeaccesseddirectlyfrom the Interactive JavaScript Console byprefixing the protocolscheme of the URI for the assetsyou are accessingwithASV://To use thisfeature in the current release, youwillneedHDInsight and Windows Azure Blob Storage accounts. To accessyourstorageaccountfromHDInsight, go to the Cluster and click on the Manage Cluster tile.
  • Azure Vault Storage (ASV) and the HadoopDistributed File System (HDFS)implemented by HDInsight on Azure are distinct file systemsthat are optimized,respectively, for the storage of data and computations on that data. ASV provides a highlyscalable and available, lowcost, long term, and shareablestorageoption for data thatis to beprocessedusingHDInsight. With asv, you will process across all nodes in the cluster.  The use case for using Azure Blob Storage as the backing store for your data is that you can scale compute independent of data (eg, you can only spin up a Hadoop cluster when you need it, and keep your data in blob store).When data is stored in ASV, you map/reduce jobs will run across multiple nodes.The Hadoop clusters deployed by HDInsight on HDFS are optimized for running Map/Reduce (M/R) computationaltasks on the data.HDInsight clusters are deployed in Azure on computenodes to execute M/Rtasks and are dropped once thesetasks have been completed. Keeping the data inthe HDFS clusters after computations have been completedwouldbe an expensiveway to store this data. ASV provides a full featured HDFS file system overAzure Blob storage (ABS). ABS is a robust, generalpurpose Azure storagesolution, sostoring data in ABS enables the clusters used for computation tobesafelydeletedwithoutlosing user data. ASV is not onlylowcost. It has beendesigned as an HDFS extension to provide a seamlessexperience to customers byenabling the full set of components in the Hadoopecosystem to operatedirectlyon the data it manages.Storage is located remotely to the worker nodes (no data locality optimization). We have re-architected the networking infrastructure in our datacenters to accommodate the Hadoop scenario.  All up we have an incredibly low overhead / subscription ratio for networking, this means we can have a lot of throughput between Hadoop and Blob. With the right storage account placement and settings, Medium VM can read from Azure blob just as fast as it can read from the local disk. However, a single storage account is limited in size and overall transfer rate; so in order to scale out beyond these limitations, you will have to add storage accounts to your cluster. We are working to improve these numbers all the time.Regarding cluster VM placement, you decide at which data center the cluster will be deployed. as long as your storage account is placed at the same data center, you will get good throughput. Regarding copying data from asv to hdfs, you can use 'hadoop fs -cphdfs:///.. asv://...' to copy files from hdfs to asv (and vice versa) In the upcoming release of HDInsight on Azure, ASV willbe the default file system.
  • Camille
  • StorageHDFS is the distributed file system.ASV is Azure Storage VaultTask Scheduling and ExecutionMap Reduce is the batch job framework.ETLPIG is a high level language describes job execution and flowSQL LikeHIVE provides HiveQL, a SQL like language on top of Map Reduce.SQOOP enables data exchange between relational databases & HadoopBIHive ODBC used to move data out of Hadoop from a HIVE TableProgrammability.NET HDInsight SDKLINQ to Hive
  • Camille
  • http://www.editions-eyrolles.com/livres/Windows-8-pour-les-professionnels/
  • Big Data : Manage, Refine, Analyze

    1. 1. Big Data : Manage, Refine, Recycle orenault@hortonworks.com blaisev@microsoft.com
    2. 2. Souscrivez à l’offre d’essai ou activez votre accès Azure MSDN Présentez-vous sur le stand Azure (zone Services & Tools) Participez au tirage au sort à 18h30 le 12 ou le 13 février
    3. 3. Hadoop : Etude d’un cas d’utilisationIntroduction :Motivation et Hadoop en environnementScénarios Microsoft
    4. 4. Terabytes Gigabytes Megabytes Data Complexity: Variety and Velocity
    5. 5. Volume Velocity Relational Data VarietySource: IDCs 2012 Vertical IT andCommunications Survey
    6. 6. Source: IDCs 2012 Vertical IT andCommunications SurveyN=4117
    7. 7. Big Data Challenges :Source: IDCs 2012 Vertical IT andCommunications SurveyN=4117
    8. 8. Impact
    9. 9. 010101010101010101 1010101010101010 01010101010101 101010101010
    10. 10. DiscoveRefine r Combine
    11. 11. 0101010101010101011010101010101010 01010101010101 101010101010
    12. 12. OPERATIONAL DATA SERVICES SERVICES Hortonworks AMBARI FLUME PIG HIVE HBASE Data Platform (HDP) OOZIE SQOOP HCATALOG Enterprise Hadoop WEBHDFS MAP REDUCEHADOOP CORE HDFS YARN (in 2.0) The ONLY 100% open source and complete Enterprise ReadinessPLATFORM SERVICES High Availability, Disaster distribution Recovery, Snapshots, Security, etc… HORTONWORKS DATA PLATFORM (HDP) Enterprise grade, proven and tested at scale OS Cloud VM Appliance Ecosystem endorsed to ensure interoperability
    13. 13. Next-Generation Data Architecture APPLICATIONS Business Analytics Custom Applications Enterprise Applications DEV & DATA TOOLS BUILD & TEST DATA SYSTEMS OPERATIONAL TOOLS HORTONWORKS MANAGE & DATA PLATFORM MONITOR RDBMS EDW MPP TRADITIONAL REPOS DATA SOURCES Traditional Sources New Sources OLTP, POS (RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media) SYSTEMS © Hortonworks Inc. 2013
    14. 14. Business Cases Batch Interactive OnlineRefine Explore Enrich HORTONWORKS DATA PLATFORM Big Data Transactions, Interactions, Observations
    15. 15. APPLICATIONS Refine Explore Enrich Business Analytics Custom Applications Enterprise Applications Collect data and apply a known algorithm to it in trusted operational process 3DATA SYSTEMS HORTONWORKS DATA PLATFORM 2 1 Capture RDBMS EDW MPP Capture all data TRADITIONAL REPOS 2 Process Parse, cleanse, apply structure & 1 transform 3 ExchangeDATA SOURCES Push to existing data warehouse Traditional Sources New Sources (RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media) for use with existing analytic tools
    16. 16. APPLICATIONS Refine Explore Enrich Business Analytics Collect data and perform 3 iterative investigation for valueDATA SYSTEMS HORTONWORKS DATA PLATFORM 2 1 Capture RDBMS EDW MPP TRADITIONAL REPOS Capture all data 2 Process Parse, cleanse, apply structure & 1 transformDATA SOURCES 3 Exchange Traditional Sources New Sources Explore and visualize with (RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media) analytics tools supporting Hadoop
    17. 17. APPLICATIONS Refine Explore Enrich Custom Applications Enterprise Applications Collect data, analyze and present salient results for 3 online appsDATA SYSTEMS HORTONWORKS 1 Capture DATA PLATFORM 2 Capture all data RDBMS EDW MPP NOSQL TRADITIONAL REPOS 2 Process Parse, cleanse, apply structure & transform 1 3 ExchangeDATA SOURCES Incorporate data directly into Traditional Sources New Sources applications (RDBMS, OLTP, OLAP) (web logs, email, sensor data, social media)
    18. 18. Vertical Refine Explore Enrich • Dynamic Pricing • Log Analysis/Site Optimization • Brand and Sentiment Analysis Retail & Web • Session & Content Optimization • Loyalty Program Optimization • Market basket analysis • Product recommendation Telco • Customer profiling • Equipment failure prediction • Location based advertising Government • Threat Identification • Person of Interest Discovery • Cross Jurisdiction Queries • Risk Modeling & Fraud Identification • Surveillance and Fraud Detection • Real-time upsell, cross sales marketing Finance • Trade Performance Analytics • Customer Risk Analysis offers • Grid Failure Prevention Energy • Smart Grid: Production Optimization • Individual Power Grid • Smart Meters • Dynamic Delivery Manufacturing • Supply Chain Optimization • Customer Churn Analysis • Replacement parts • Clinical decision support Healthcare • Electronic Medical Records (EMPI) • Insurance Premium Determination • Clinical Trials Analysis
    19. 19. Hébergement du cluster dans AzureDÉPLOIEMENT D’UN CLUSTERHORTONWORKS
    20. 20. https://www.hadooponazure.com/http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspxhttp://gettingstarted.hadooponazure.com/http://gettingstarted.hadooponazure.com/gettingStartedHw.html
    21. 21. Découverte du service Hadoop On AzureAZURE HD INSIGHT SERVER
    22. 22. • •• •• •
    23. 23. Data locality optimizationMetadataIp.csv (583 rows) Reducer Map Task Data Node 1 Combiner- DataNode1 (1-193)- DataNode2 (194-387) MF-001 MF-002- DataNode3 (338-583) Ip.csv (1-193) (F;0, M;1 00 MF-193 Split Name Map Task Data Node 2 Combiner Node Ip.csv (194- MF-194 MF-195 (F;42, M;41 (F;142, M;441 387) MF-387 Map Task Data Node 3 Combiner MF-388 (F;100, Ip.csv (388- MF-389 M;300 583) MF-583
    24. 24. Map Combiner Reducer Skip(0) MF-001 ASV MF-002 (F;0, M;1 00 MF-193 Map CombinerASV://mycontainer/myfolder/Ip.csv Skip(194) MF-194 (F;42, MF-195 M;41 (F;142, M;441 MF-387 Map Combiner Skip(388) MF-388 (F;100, MF-389 M;300 MF-583
    25. 25. Chargement de données de ASV vers HDFS, exécution derequêtes, agrégation de résultatsAZURE HD INSIGHT SERVER
    26. 26. Registrations DB Klout.com (MySql) (Node.js) Mobile Klout API Profile DB (ObjectiveC) (Scala) Signal Data (HBase) Collectors Enhancemen Partner API (Java/Scala) t Data (Mashery) Engine Warehouse Search Index (PIG/Hive) (Hive) (Elastic Search) Streams (MongoDB) Monitoring (Nagios) Serving Stores Dashboards (Tableau) Perks Analyics Analytics (Scala) Cubes Event Tracker (SSAS) (Scala)Case Study: Data Services Firm Uses Microsoft BI and Hadoop to Boost Insight into Big Data
    27. 27. Sources de Business Acquisition, Stockage, Traitement des données Supervision données Intelligence PIG HIVE MAHOUT Pegasus Reporting CEP Map/Reduce OLAP Data Node Name Node Data Node Bulk Load Data Node System Center RDBMS Files SystemFile System Connector ASV HDFS Application Server
    28. 28. Cloud Services Virtual Machine On-premiseSources de Business Acquisition, Stockage, Traitement des données Supervision données Intelligence HDInsight Services SQL ReportingStreamInsight PIG HIVE MAHOUT Pegasus Map/Reduce SSRS Data Node Name Node Data Node Plume Data Node SSAS System Center SQL Files System Database SQOOP ASV HDFS SharePoint Microsoft Windows Azure
    29. 29. Agrégation de données issues de multiples sourcesAZURE HD INSIGHT SERVER,SQL2012, POWERPIVOT,POWERVIEW
    30. 30. • Submit changes back to Apache Foundation• ‘Just works’ on Windows Azure and Server• Integration with Visual Studio, Javascript, Excel, etc.• Performance, Scale, High Availability• Management, Ease of use• Security, Data Governance• Integration with AD and SC.• Integrate as part of our overall data platform
    31. 31. https://www.hadooponazure.com/http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspxhttp://gettingstarted.hadooponazure.com/http://gettingstarted.hadooponazure.com/gettingStartedHw.htmlhttp://weatherservice.cloudapp.nethttp://www.srh.noaa.gov/rfcshare/ffg_download/ffg_download.phphttp://social.technet.microsoft.com/wiki/contents/articles/14320.processing-noaa-flash-flood-guidance-data-in-sql-server.aspxhttp://blogs.msdn.com/b/sqlcat/archive/2013/02/01/mash-up-hive-sql-server-data-in-powerpivot-amp-power-view-hurricane-sandy-2012.aspx
    32. 32. 4 ouvrages écrits par 13 Microsofteeshttp://www.editions-eyrolles.com/livres/Windows-8-pour-les-professionnels
    33. 33. © 2012 Microsoft Microsoft Corporation. Tous droitsMicrosoft,Microsoft, et les autres les autresproduits sont des marques déposées déposées ou descommerciales de Microsoft Microsoft aux États-Unis et/ou dans dautres pays. © 2012 Corporation. Tous droits réservés. réservés. Windows Windows et noms de noms de produits sont des marques ou des marques marques commerciales de aux États-Unis et/ou dans dautres pays.Les informations contenuescontenues dans ce document sont fournies uniquement à titreElles représentent lopinion actuelle de Microsoft Microsoft Corporation sur les pointsdate dela date de cette présentation. Microsoftaux conditions fluctuantes du marché etmarché et ce ne doit Les informations dans ce document sont fournies uniquement à titre indicatif. indicatif. Elles représentent lopinion actuelle de Corporation sur les points cités à la cités à cette présentation. Microsoft sadapte sadapte aux conditions fluctuantes du ce documentpas être interprété comme un engagement de la part de Microsoft ; de plus, Microsoft ne peut pas garantir la véracité de toute information présentée toute information présentée aprèsMICROSOFT EXCLUT TOUTE MICROSOFT EXCLUT TOUTE GARANTIE, EXPRESSE,EN CE QUI document ne doit pas être interprété comme un engagement de la part de Microsoft ; de plus, Microsoft ne peut pas garantir la véracité de après la date de la présentation. la date de la présentation. GARANTIE, EXPRESSE, IMPLICITE OU STATUTAIRE, IMPLICITECONCERNE CETTE PRÉSENTATION. CONCERNE CETTE PRÉSENTATION. OU STATUTAIRE, EN CE QUI

    ×