• Save
Petit-Déjeuner OCTO / Cloudera "Tout pour réussir votre premier projet Hadoop et passer à l’échelle industrielle"
Upcoming SlideShare
Loading in...5
×
 

Petit-Déjeuner OCTO / Cloudera "Tout pour réussir votre premier projet Hadoop et passer à l’échelle industrielle"

on

  • 3,081 views

Les promesses du Big Data sont séduisantes. Encore, faut-il savoir maîtriser l’écosystème d’Hadoop, son architecture et la configuration d’un cluster adapté aux besoins métiers. Dans ce ...

Les promesses du Big Data sont séduisantes. Encore, faut-il savoir maîtriser l’écosystème d’Hadoop, son architecture et la configuration d’un cluster adapté aux besoins métiers. Dans ce petit-déjeuner, pas de théorie uniquement des retours d’expérience de projets en France, avec OCTO et aux USA avec Cloudera.

Les thèmes abordés seront :

Quels projets pilotes Hadoop lancés en 2013? YARN, Impala, MapReduce, HCatalog,...
Quels composants logiciels pour compléter le puzzle Hadoop pour offrir une solution Big Data utilisable par les métiers?
Comment dimensionner et configurer un cluster Hadoop adapté aux besoins?
Comment benchmarker les performances d’un cluster?
Quelles sont les best practices et les pièges à éviter en matière de développement
Retours d’expérience projets en France et aux USA


Au terme de ce petit-déjeuner :

Vous aurez une vision claire de ce qu'est Hadoop et son écosystème en 2013
Vous connaîtrez les best practices de dimensionnement de cluster
Vous saurez sélectionner les outils de l'écosystème correspondant à vos besoins
Vous saurez, au travers d'un retour d'expérience du terrain, comment réussir votre projet Big Data avec Hadoop

Statistics

Views

Total Views
3,081
Views on SlideShare
2,797
Embed Views
284

Actions

Likes
4
Downloads
1
Comments
0

5 Embeds 284

http://www.scoop.it 183
http://www.informatiquenews.fr 70
https://twitter.com 23
http://www.linkedin.com 7
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Link to opportunity record in SFDC (valid for SFDC employees only): https://na6.salesforce.com/0068000000eoHgTA multinational bank saves millions by optimizing their EDW for analytics and reducing data storage costs by 99%.Background: A multinational bank has traditionally relied on a Teradata enterprise data warehousefor most of its data storage, processing and analytics. With the movement from in-person to online banking, the number of transactions and the data each transaction generates has ballooned. The Teradata system was supporting over 330,000 applications that run monthly and 6,000 databases.Challenge: The bank wanted to make effective use of all the data being generated, but their Teradata system quickly became maxed out. It could no longer handle current workloads and the bank’s business critical applications were hitting performance issues. It was taking 7 days to complete ETL processing, so the Teradata environment could only be used for analysis during brief periods each month. And they were spending millions every year just to back up all of their data. Regulatory compliance requires them to store 7 years’ data, and it would take 5 weeks just to make historical data available for analysis.The bank was forced to either expand the Teradata system which would be very expensive, restrict user access to the system in order to lessen the workload, or offload raw data to tape backup and rely on small data samples and aggregations for analytics.IBM and EMC had attempted to alleviate this pain but failed. The strategic data warehouse group within the bank initiated a research project with Georgia Tech students to look into data warehousing projects, which led a student to reach out to Cloudera. This ultimately initiated an in-depth POC.During the POC, the bank looked at several different operational systems and the transformations that needed to take place to that data to prepare it for use in the data warehouse. They found they’d scaled past what their traditional ETL tools could deliver, so they were just using those ETL tools to move data into the data warehouse and then doing transformations within the warehouse (ELT). The system was spending 44% of its resources on everyday operations such as running canned BI reports and 42% on ETL processing (or ELT in this case), leaving only 11% for advanced analytics and data discovery that drives ROI from new opportunities. This is a very costly use of the data warehouse platform and not what it was meant for. They were able to quantify how much space and compute power was being used for each ELT process in data warehouse supporting hundreds of applications. This information helped to quantify how much effort (man hours) it would take to implement these processes in Hadoop, and which applications would most benefit in terms of financial and time-related ROI by migrating to Hadoop. They decided to start with SQL-based transformations, and implemented 2 applications from start to finish as part of the POC..Solution: After a very in-depth POC involving 30+ representatives from the bank, they deployed Cloudera to offload data processing, storage and some analytics from the Teradata system, freeing up space on the EDW so it could focus on its real purpose: performing high value operational and data discovery analytics. They didn’t migrate the entire system at once -- they started with the applications that would deliver the most value and save the most Teradata resources. The bank initially deployed a small cluster, demonstrating that they could meet Teradata’s performance at a fraction of cost.Results: Cloudera delivers value to this bank through our low cost per terabyte, low cost of implementation, compute savings, and the flexibility offered by Hadoop. The bank was able to justify the ROI of Cloudera very easily from a cost perspective, with Teradata as the incumbent. They were spending over $180,000 per terabyte on Teradata (which is unusually high -- most Teradata customers probably pay closer to $40,000 per TB). Cloudera offers $1,000 per terabyte.By offloading data processing and storage onto Cloudera, the bank avoided spending millions to expand their Teradata infrastructure, while reclaiming the 7 days every month that Teradata was spending on data transformations. Expensive CPU is no longer consumed by data processing, and storage costs are a mere 1% of what they were before. Meanwhile, data processing is 42% faster and data center power consumption has been reduced by 25%. The bank can now process 10TB of data every day.In addition, Cloudera delivered technical value through its flexible scalability. The bank could deploy and test on a small cluster of 15 nodes to see how performance scales linearly with growth, versus having to buy in large chunks as they do with Teradata.
  • The quant risk LOB within a multinational bank saves millions through better risk exposure and fraud prevention analysis, while avoiding expanding their data warehouse footprint. Background: With the movement from in-person to online banking, a multinational bank processes increasingly more transactions -- 2 billion per month. Increased transactions translate into growing data volumes, and greater potential to use that data for better, more data driven fraud prevention. Challenge: While opening the door to better fraud prevention, today’s frequent banking transactions also necessitate constant revisions to risk profiles which is data processing intensive. And detecting fraud is a complex, difficult process that requires a continuous cycle of sampling a subset of data, building a data model, finding an outlier that breaks the model, going back and rebuilding the model, and so forth. The bank’s existing Teradata warehouse was optimized for logical analysis and reporting and had reached its capacity. It would be very costly to expand the current environment, but to continue operating within that environment would necessitate more sampling, aggregations, or moving data to offline tape backup. Doing this would mean the bank had to ignore the opportunity to create better risk and fraud detection models presented by its growing, digital data volumes. Solution: The bank deployed Cloudera Enterprise as its data factory for fraud detection and prevention and risk analysis across home loans, insurance and online banking. Results: With the new environment, this bank has avoided expanding their expensive Teradata footprint while eliminating data sampling and improving fraud detection and risk analysis models. Now, they can look at every incidence of fraud for each person over a 5 year history. And they’ve been able to offload data processing to Hadoop in order to conserve the expensive Teradata CPU for analytical tasks.
  • A large semiconductor manufacturer has improved the accuracy of their yield predictions by running models on a larger data set: 10 years of data instead of 9 months. Background: A large semiconductor manufacturer uses yield models to predict which chips are likely to fail. Those predictions allow the company to take action -- they can adjust designs and thus minimize failures. Those predictive yield models were run on Oracle, based on 9 months of historical data. Challenge: The company wanted to improve the accuracy of their models by using a larger data set containing longer history and more granular information. But they couldn’t afford to store more than 9 months’ data on Oracle. Solution: The semiconductor manufacturer deployed the Dell | Cloudera solution for Apache Hadoop with HBase, which gives them unlimited scale and more flexible data capture and analysis at 10x lower TCO than traditional data warehouse environments. The company runs a 53-node cluster today, and expects to store up to 10 years data on CDH -- this will amount to about 10PB of data. The manufacturer can now collect and process data from every phase of the manufacturing process. Results: Since deploying the Dell | Cloudera solution, the manufacturer met its goal of improving the accuracy of their predictive yield models so they could optimize operations. When problems occur with chips, they can answer questions like: Where and why did the problem occur?Which manufacturing plant did this chip come from?Which components were used?Ultimately, this manufacturer is improving its operational efficiency with the Dell | Cloudera solution for Apache Hadoop.
  • Link to account record in SFDC (valid for Cloudera employees only): https://na6.salesforce.com/0018000000l7XjiBlackBerry realized ROI on their Cloudera investment through storage savings alone, while reducing ETL code by 90%.Background: BlackBerry transformed the mobile devices market in 1999 with their introduction of the BlackBerry smartphone. Since then, other industry innovators have introduced devices that compete against BlackBerry, and the company must leverage all of the data it can collect in order to understand its customers, what they need and want in mobile devices, and how to remain an industry leader. Challenge: BlackBerry Services generate ½ PB of data every single day -- or 50-60TB compressed. They couldn’t afford to store all of this data on their relational database, so their analytics were limited to a 1% data sample which reduced the accuracy of those analytic insights. And it took a long time to try to access data in the archive. Their incumbent system couldn’t cope with the multiplying growth of data volumes or constant access requests -- BlackBerry had to pipeline their data flows to prevent the data from hitting disk.Solution: BlackBerry deployed Cloudera Enterprise to provide a queryable data storage environment that would allow them to put all of their data to use. Today, BlackBerry has a global dataset of ~100 PB stored on Cloudera. The platform collects device content, machine-generated log data, audit details and more. BlackBerry has also converted ETL processes to run in Cloudera, and Cloudera feeds data into the data warehouse. Hadoop components in use include Flume, Hive, Hue, MapReduce, Pig and Zookeeper. Results: BlackBerry’s investment in Cloudera was justified through data storage cost savings alone. And by moving data processing over to Hadoop, their ETL code base has been reduced by 90%. They no longer have to rely on a 1% data sample for analytics; they can query all of their data -- faster, on a much larger data set, and with greater flexibility before. One ad hoc query that used to take 4 days to run now finishes in 53 minutes on Cloudera. BlackBerry’s new environment allowed them to do things like predict the impact that the London Olympics would have on their network so they could take proactive measures and prevent a negative customer experience.
  • Link to account record in SFDC (valid for Cloudera employees only): https://na6.salesforce.com/0018000000y2z1Y?srPos=0&srKp=001A leading manufacturer of mobile devices and technology identified a hidden software bug that was causing a spike in mobile phone returns. Background: Leading manufacturer of mobile devices and technology develops products that connect seamlessly so consumers have the best content at their fingertips 24x7. The company’s engineering department is responsible for manufacturing mobile phones and for developing a popular mobile platform. In recent years, consumers’ use of mobile phones has evolved from making calls to checking emails, taking photos and videos, buying things online and more. Mobile devices today actually make up more than 20% of all web traffic in the US.Challenge: The volumes of data that need to be collected, stored, explored and analyzed are exploding. Every device generates a massive stream of unstructured data from texts, photos, videos, web browsing, and so on. And today’s competitive market requires the company to not only find a way to capture more data more data volumes than ever before, but they also need to be able to process that data and act on it rapidly in order to stay innovative. The company’s Oracle RAC enterprise data warehouse couldn’t keep up. Solution: This company today leverages Cloudera Enterprise Core with RTD in conjunction with Oracle RAC; the two platforms work together for a closed loop analytical process. The company offloads data processing and historical storage from Oracle to CDH, and moves data as needed back into Oracle for reporting and analysis. They process 1TB of data every day. Oracle houses a few months of recent data which is available to business analysts for immediate reporting — both ad hoc and canned reports — whereas CDH is used for historical trend analysis (via Hive) of up to 25 years’ history. Oracle contains aggregated data; CDH captures all of the detailed data.Results: Hadoop’s ability to run large-scale, complex analysis is helping this company gain insights that would otherwise be hidden. In one case, a carrier that had been selling a popular phone noticed a sudden spike in returns. The carrier brought this issue to attention, and the manufacturer’s R&D team started investigating. After collecting a lot of data spread across numerous systems and conducting intensive research in CDH, they found a correlation between when they’d starting using a new hardware supplier for one component in the device and when returns of that device started to spike. The new hardware component had the same specs and was actually a better quality product, with a more narrow standard deviation for error. It turns out that the larger deviation in the original component actually allowed the software to work properly; when the quality of the component was stricter, a software bug manifested itself. By using Hadoop to combine carrier data with manufacturing data, this company was able to identify the problem and fix the software bug.
  • YP (YellowPages.com, previously AT&T Interactive) offloads data processing to Cloudera, which in turn enables new services that are valuable to publishers.Background: With the movement from print (publishing the YellowPages books) to predominant usage of the web (YellowPage.com), YP’s business relies on display ads that are purchased by publishers and vendors. In order to keep publishers buying ads, YP needs to be able to offer near real time analytics so the publishers can monitor how their campaigns are doing and make adjustments on the fly. Challenge: YP’s incumbent SQL Server data warehouse was not a scalable solution, and with increasing data volumes, performance was poor. YP generates 260 million billable web traffic events and 600 million non-billable events every day, and the business was demanding they keep 13 months of billable history and 90 days non-billable history in the data warehouse so that data would be available for analysis.Solution: YP replaced their SQL Server data warehouse with HP Vertica and Cloudera Enterprise. Cloudera serves as the core production traffic processing system that helps the company understand its network quality and traffic, and uses Vertica for reporting and analysis. YP currently has 315 CDH nodes and about 30 TB on Vertica. Results: With their new system, YP’s data processing is completed in hours vs. days in the previous environment. This has ultimately enabled YP to launch several new business functions that increase the value they offer publishers including: Real-time publisher portalsFaster behavioral targetingReal-time traffic analysisNetwork quality analyticsWith the faster data processing enabled by Cloudera, YP is better equipped to identify areas they should invest in as a business which are likely to drive revenues.

Petit-Déjeuner OCTO / Cloudera "Tout pour réussir votre premier projet Hadoop et passer à l’échelle industrielle" Petit-Déjeuner OCTO / Cloudera "Tout pour réussir votre premier projet Hadoop et passer à l’échelle industrielle" Presentation Transcript

  • 1© OCTO 2013© OCTO 2012© OCTO 2013Réussir votre premier projetHadoop et passer à l’échelleEn partenariat avec
  • 2© OCTO 2013OCTO et le Big DataUne offre cohérente entre technologie et analyse prédictiveCONSEIL EN SI BIG DATA Etude et positionnement des solutionsen fonction de votre contexte Transformation de SI Décisionnel vers leBig Data Cadrage de projets Big DataARCHITECTURE DES SYSTÈMES BIG DATA POC sur Hadoop et NoSQL Conception et réalisation de systèmessous Hadoop et NoSQL Formation HadoopCONSEIL EN ANALYSE DE DONNÉES AVANCÉES Benchmarks de projets Big Data parsecteur Formation des équipes de dataminingaux techniques Big Data Accompagnent des projets pilotemétiersCOLLECTE DE DONNÉES EXTERNES Identification de sources de données Collecte et traitements de données nonstructurées Recherche de corrélations économiquesDIRECTION SI DIRECTION MÉTIER
  • 3© OCTO 2013Une équipe dédiée, composée deExperts et architectes sur les clusters de stockage et de calculStatisticiens et consultants en machine learningUne R&D spécifique sur Hadoop, NoSQL et le machine learningDes relations très approfondies avec les équipes R&D de nospartenairesCloudera10Gen MongodbDatastax CassandraL’équipe OCTO Big Data Analytics
  • 4© OCTO 2013IntervenantsJulien CABOTDirecteur Big Data AnalyticsOCTOjcabot@octo.comGraham GearSystems EngineerClouderagraham@cloudera.comRémy SAISSYArchitecte, expert HadoopOCTOrsaissy@octo.com
  • 5© OCTO 2013Introduction à Big Data et HadoopComment fournir une solution business de bout en bout avecHadoop?Questions/réponses10 Best practices pour dimensionner et configurer un clusterHadoop4 - Hadoop CDH4 sous YARN dans les coms. Retourd rienceQuestions/réponsesQuoi de neuf dans la Cloudera CDH en 2013?Retour d’expérience aux USQuestions/réponsesAgenda
  • 6© OCTO 2013© OCTO 2012© OCTO 2013Big Data et Hadoop
  • 7© OCTO 2013Un concept devenant une réalité pour les entreprisesDes réflexions et prototypes activés dans les entreprises françaisesBig Data, une écosystème multipleWebGoogle, Amazon,Facebook, Twitter,…Logiciel ITIBM, Teradata,Vmware, EMC,…ManagementMcKinsey,BCG, Deloitte,…
  • 8© OCTO 2013Il n’existe pas aujourd’hui de définition claire de Big DataIl s’agit à la foisd’une ambition métier et d’une opportunité technologiqueDéfinir Big DataSuper datawarehouse?Stockage low cost?NoSQL?Cloud?Internet Intelligence?Analyse en tempsréel?Non structuré? Open Data?
  • 9© OCTO 2013Big Data, une ambition stratégiqueBig data est l’ambition de tirer unavantage économiquedel’analyse quantitative desdonnéesinternes et externes de l’entreprise
  • 10© OCTO 2013Quelques usages de Big Data dans les entreprisesMarketingcomportementaldes clients retailsbancaire• Analyse des opérations degestion (CRE) bancairespour déterminer unesegmentation marketingbasée sur lecomportement des clientsretails et non sur unesegmentation par foyerfiscal• Recommandations deproduits financiersAnalyse prédictiveIARD exploitantles tendances descommunautésWeb• Identifier des corrélationsentre les sujets d’intérêtsdes communautés (patients, auto, habitation,épargne, …) et lessinistres• Enrichir les modèles dedatamining avec desindicateurs exogènesreflétant les facteurspsycho sociauxOff loading desentrepôts dedonnées• Réduire les coûts destockage desdatawarehouses par 100en déchargeantpartiellement les systèmesOracle ou Teradata versHadoop• Tirer profit d’unearchitecture cloudprivé/hybride, élastique àla demande
  • 11© OCTO 2013Big Data, un univers technologique pour construiredes systèmes à haute performanceApplicationorientée FluxévènementielApplication orientéeTransactionApplication orientéeStockageApplication orientéeCalculsUnivers« standard »SGBDR,Serveur d’application,ETL, ESBAu-delà de 10 To en ligne, lesarchitectures « classiques »nécessitent des adaptationslogiques et matérielles trèsimportantes.Au-delà de 1 000transactions/seconde, lesarchitectures « classiques » desadaptations logiques etmatérielles très importantesAu-delà de 10 threads/CoreCPU, la programmationséquentielle classique atteintses limites (I/O).Au-delà de 1 000évènements/seconde, lesarchitectures « classiques »nécessitent des adaptationslogiques et matérielles trèsimportantes.StockagedistribuéSharenothingXTPProgrammationparallèleEvent StreamProcessing
  • 12© OCTO 2013Evolution non uniforme de la capacité et du débit desdisques010203040506070Débit(MB/s)Gain : x9164 MB/s0,7 MB/sSeagateBarracuda7200.10SeagateBarracudaATA IVIBM DTTA35010Gain : x100 0001990 2010La croissance du débit reste très inférieure de celle de la capacité
  • 13© OCTO 2013Une limite structurelle à la loide Moore!Latences des composants technologiquesL’architecture client-serveur traditionnelle doit évoluer pour continuerà suivre la loi de Moore
  • 14© OCTO 2013Evolution des architectures pour dépassercette limite structurelleArchitecture In Memory• Réduire la latence en utilisantdes supports plus rapides(DRAM, SSD)• Bénéficier de l’évolution descapacités des composants• La limite structurelle n’est pasque déplacée• Pour évoluer, l’architecture doitdevenir une grille In MemoryArchitecture en grille• Paralléliser les accès I/O endivisant les volumes (sharding)• Bénéficier du différentiel decoût entre commodityhardware et haut de gamme• Le réseau de la grille devientun composantprincipal, nécessitant co-localisation des données etdes traitements• Permet de scaler à l’infini, c’estle Warehouse scalecomputing!
  • 15© OCTO 2013Hadoop dans l’univers BigdataApplicationorientée FluxévènementielsApplication orientéeTransactionsApplication orientéeStockageApplication orientéeCalculsParrallel databaseNoSQLNewSQLCEP, ESP HadoopHDFSMapReduceProjetsassociésCassandraPigHiveChuckwaHbaseMahoutPigZooKeeperIn Memory
  • 16© OCTO 2013Hadoop s’impose comme une architecturede référence sur le marché• Apache HadoopOpen Source• Cloudera CDH• Hortonworks• MapR• DataStax (Brisk)COTS• Greenplum (EMC)• IBM InfoSphere BigInsights (CDH)• Oracle Big data appliance (CDH)• NetApp Analytics (CDH)• …Editeurs• Amazon EMR (MapR)• VirtualScale (CDH)Cloud
  • 17© OCTO 2013© OCTO 2012© OCTO 2013Comment fournir une solution businessde bout en bout avec Hadoop ?
  • 18© OCTO 2013Hadoop, un écosystème richeet complexe
  • 19© OCTO 2013Stockage de fichiers plus volumineux qu’un unique disqueRépartition des données sur plusieurs machinesRéplication des données pour assurer le « fail-over » : « rackawareness »Hadoop Distributed File System(HDFS)
  • 20© OCTO 2013Paralléliser et distribuer les traitementsTraiter plus rapidement des volumes de données unitaires plus faiblesCo-localiser traitements / donnéesMapReduce, le système detraitement
  • 21© OCTO 2013Hadoop est à la foisUn système de stockage distribué pour les grands fichiers (N x 64Mo)Un système d’agrégation et de traitement parallèle en mode batchà la demande, reposant sur la grille de stockageHadoop n’est pas aujourd’huiUn système d’accès à la donnée unitaire (random access)Un système temps réel, mais batch à la demandeUn outils de visualisation graphique des donnéesUne librairie de traitements statistiques et text mining finaliséeMahout, Hama fournissent des algorithmes parallèlesHadoop nécessite des composants externes pour compléter lepuzzleLes mythes et réalités sur Hadoop
  • 22© OCTO 2013Data labOffloading d’entrepôts/applianceTraitement de flux d’informations (Hadoop asan ELT)Grille de calculsMachine learning temps réel (Online learning)Quels composants? Pour faire quoi?
  • 23© OCTO 2013Le puzzle complet (une vision)HDFSMapReduceHive Pig MahoutHbaseCassandraData MiningData VisualizationCollecte de stocksSystèmeopérationnelSystèmedécisionnelInfrastructure EvènementsWebCollecte de streamsGPUSystèmeopérationnelMétiers Data minersWeb ServicesCataloguededonnées
  • 24© OCTO 2013Collecte en stocksPUT HDFS natifSqoop pour les SGBDRTalend : ELT pour HadoopSyncsort : chargement de gros volumesETL via Connecteurs sur HiveCollecte en streamsFlume / Kafka : logsCassandraStorm : collecte et traitement en temps réel de gros volumesESB via Connecteurs sur HiveOutils de collecte
  • 25© OCTO 2013Hadoop et les outils de BI et de Data mining
  • 26© OCTO 2013L’architecture matérielle et logicielle d’un projet Hadoop dépenddes usages du clusterIl n’existe pas une architecture de référence pour tous lesusages, mais des architectures par classe d’utilisationL’architecture et la configuration du cluster sont les points lesplus critiques, qui nécessitent une expérience et une expertisepointueIl existe néanmoins des best practices et des pièges à éviterConcevoir une architecture Hadoop complète
  • 27© OCTO 2013Discussion
  • 28© OCTO 2013© OCTO 2012© OCTO 201310 best practices pourdimensionner et configurer uncluster Hadoop
  • 29© OCTO 2013Piège 1 : la tentation des machines « monstres de guerre »Piège 2 : le réseau, mieux vaut 10Gb/s c’est plus sûrPiège 3 : pour superviser, mes outils actuels suffisent !Piège 4 : un SCM ? Pas le temps, SSH fera l’affaire !Piège 5 : les logs c’est important, il faut tous les collecterPiège 6 : conserver les paramètres mémoire par défautPiège 7 : conserver la configuration par défaut de HDFSPiège 8 : conserver la configuration par défaut de MapReducePiège 9 : utiliser les formats de fichier par défautPiège 10 : benchmarker son cluster avec TeraSortSommaire
  • 30© OCTO 2013Le piègeDes ressources inutiliséesUn niveau de parallélisme insuffisantUn surcoût aux performances non garantiesBest PracticePenser parallélisationNotion de conteneur : 1 CPU physique / xGo de RAM / Disque durHDFSDimensionner pour du temps de traitementPiège 1 : la tentation des machines « monstres de guerre »
  • 31© OCTO 2013Le piègePour garder de bonnes perfs, il faut éviter la sursouscriptionSwitchs de rack plus gros, donc plus cher10Gb/s = 1Go/s = 40Go/s au niveau du switchBackbone encore plus gros, donc encore plus cher40Go/s * <nombre de racks> = ?Best PracticeUtiliser deux cartes 1Gb/s FDMoins de disque sur chaque serveurSuperviserPiège 2 : le réseau, mieux vaut 10Gb/s c’est plus sûr
  • 32© OCTO 2013Le piègePas de détail sur les métriques internes d’HadoopLectures / écritures de HDFS par nœudConsommation mémoire pendant les étapes d’un jobBest PracticePensez aux développeurs !Utiliser Ganglia pour des métriques finesPiège 3 : pour superviser, mes outils actuels suffisent !
  • 33© OCTO 2013Le piègeUn petit cluster Hadoop, c’est 10 machinesConfiguration et maintenance à la main difficilePerte de tempsBest PracticeUtiliser un SCMPiège 4 : un SCM ? Pas le temps, SSH fera l’affaire !
  • 34© OCTO 2013Le piège500 mappers et 20 reducers520 fichiers de logs à collecter sur tout le clusterPeu d’informations utiles à long termeBest PracticePas de collecte sur les slavesCollecte sur les mastersPiège 5 : les logs c’est important, il faut tous les collecter
  • 35© OCTO 2013Le piègeIls ne sont pas optimisés pour votre clusterSous utilisation des ressourcesÉchecs possibles de certains jobsBest Practice2Go pour les démons tasktracker et datanode4Go pour le démon JobTracker4Go + 1Go par million de bloc pour le namenodeUtiliser 4Go voire 8Go par tâche de map et de reduceSuperviserPiège 6 : conserver les paramètres mémoire par défaut
  • 36© OCTO 2013Le piègePas optimisée pour un clusterLes paramètres dépendent de vos données, de votre réseau, …Best PracticeConfigurer en pensant I/O vs mémoire vs réseauChaque cas d’utilisation a sa propre configuration optimiséeSuperviserPiège 7 : conserver la configuration par défaut de HDFS
  • 37© OCTO 2013Le piègePas optimisée pour un clusterLes paramètres dépendent de votre utilisationBest PracticeUtiliser le CapacitySchedulerConfigurer avec des règles de calculAuditer l’usage réel pour optimiser les configurationsPiège 8 : conserver la configuration par défaut de MapReduce
  • 38© OCTO 2013Le piègeLenteur des jobs dû à un stockage inefficacePlus d’espace utilisé que nécessaireBest PracticeFormat de stockage : distinguer les usagesBase de donnéesDonnées binairesCompression : quelle fréquence d’accès ?Donnée utiliséeArchivagePiège 9 : utiliser les formats de fichier par défaut
  • 39© OCTO 2013Le piègeNon représentatif de l’usage réel du clusterBest PracticeUtiliser du code de productionPiège 10 : benchmarker son cluster avec TeraSort
  • 40© OCTO 2013Discussion
  • 41© OCTO 2013© OCTO 2012© OCTO 2013Hadoop CDH4 sous YARN dansles télécoms. Retour dexpérience
  • 42© OCTO 2013ContexteCaractéristiques du clusterDéroulement du projetDéploiement de HadoopDéploiement des outils supportLes alimentations de donnéesL’analyse des donnéesLa migration du clusterLe benchmark du clusterCluster en fin de missionConclusionSommaire
  • 43© OCTO 2013Durée : 3 moisEquipe opérationnelle : 8 personnesTrois enjeux majeurs :Construire une plateforme Big Data opérationnelleMontée en compétence des équipesPréconisations pour une plateforme industrielleEquipe colocaliséeContexte
  • 44© OCTO 20131 rack, 12 serveurs1 nœud pour les outils, 1 autre pour l’anonymisation2 nœuds masternamenode / resourcemanagersecondary namenode8 nœuds slave : datanode et nodemanagerCaractéristiques du clusterSlavesMastersOutilsAccès Masters etOutils
  • 45© OCTO 2013Déroulement du projet
  • 46© OCTO 2013Réseau de production : utiliser un mirroir localConfiguration OS : compétences système et réseau requisesUtiliser un SCM pour déployerNécessité d’avoir des profils polyvalentsDéploiement de HadoopA l’attaque!
  • 47© OCTO 2013Relativement facile une fois Hadoop correctement installéPeu d’impact sur le cluster en lui mêmeNe déployer que le nécessaireDéploiement des outils support
  • 48© OCTO 2013KISS : Keep It Simple StupidNe pas négliger le travail en amont de l’analyse !Les alimentations de données
  • 49© OCTO 2013Beaucoup de travail en amontUn cluster s’optimise au contact de la réalitéLimites des outilsAjustement de l’ordonnanceurConfiguration mémoireConfiguration d’HDFSL’analyse des données
  • 50© OCTO 2013Passage de CDH 4.0.1 à CDH 4.1.2Des leçonsDu travail en amontLe SCM aurait fait gagner du tempsSuivre les préconisations !La migration du cluster
  • 51© OCTO 2013Initialement en début de projet…Terasort ? Plutôt HiBenchAu final, le travail réalisé pendant le projet a été le meilleurbenchmarkLe benchmark du cluster
  • 52© OCTO 2013Cluster YARN opérationnelPlusieurs outils testés au cours de l’explorationHDFS occupé à 70% : 1 427 251 fichiers, 280ToLes jobs ne saturent pas complètement le clusterCluster en fin de mission
  • 53© OCTO 2013Des points positifsYARN : stable et ouvre à d’autres frameworks que Map ReduceDes outils polyvalentsDes points à améliorerMaturité des outils et de leur environnement de travailComplexité de la configuration de Hadoop comme de ses outilsDes documentations et des abaquesMettre en place votre cluster ?une équipe pluri disciplinairede la polyvalence techniqueConclusion
  • 54© OCTO 2013Discussion
  • 55© OCTO 2013© OCTO 2013Présentation Cloudera
  • 56© OCTO 2013© OCTO 2012© OCTO 2013Conclusion
  • 57© OCTO 2013L’écosystème Hadoop est riche etcomplexe, en mouvementLes gains attendus sont sans précédentsL’usage a une incidence forte surl’architecture et la configurationConclusion
  • 58© OCTO 2013Identifiez les use cases métiers applicables dans votre contexte, enbenchmarkant les projets lancés dans d’autres secteurs en France etau-delàLancez un POC métier d’exploration des données, avec les métiers lesplus early adoptersMarketingDistributionInfrastructure industrielleTradingRisquesValorisez les résultats du POC en termes métiersDéfinissez une architecture cible de classe industrielle pour généraliserl’approche en réduisant les coûtsComment démarrer cet après midi?
  • 59© OCTO 2013OCTO et le Big DataUne offre cohérente entre technologie et analyse prédictiveCONSEIL EN SI BIG DATA Etude et positionnement des solutionsen fonction de votre contexte Transformation de SI Décisionnel vers leBig Data Cadrage de projets Big DataARCHITECTURE DES SYSTÈMES BIG DATA POC sur Hadoop et NoSQL Conception et réalisation de systèmessous Hadoop et NoSQL Formation HadoopCONSEIL EN ANALYSE DE DONNÉES AVANCÉES Benchmarks de projets Big Data parsecteur Formation des équipes de dataminingaux techniques Big Data Accompagnent des projets pilotesmétiersCOLLECTE DE DONNÉES EXTERNES Identification de sources de données Collecte et traitements de données nonstructurées Recherche de corrélations économiquesDIRECTION SI DIRECTION MÉTIER
  • 60© OCTO 201360Petit Déjeuner Hadoop - ClouderaGraham Gear | graham@cloudera.comAPRIL 2013
  • 61© OCTO 201361CLOUDERATIMELINE2008CLOUDERA FOUNDEDBY MIKE OLSON,AMR AWADALLAH &JEFF HAMMERBACHER2009HADOOP CREATORDOUG CUTTING JOINSCLOUDERA2009CDH:FIRST COMMERCIALAPACHE HADOOPDISTRIBUTION2010CLOUDERA MANAGER:FIRST MANAGEMENTAPPLICATION FORHADOOP2011CLOUDERA REACHES100 PRODUCTIONCUSTOMERS2011CLOUDERA UNIVERSITYEXPANDS TO 140COUNTRIES2012CLOUDERAENTERPRISE 4:THE STANDARD FORHADOOP IN THEENTERPRISE2012CLOUDERA CONNECTREACHES 300PARTNERSBEYOND…TRANSFORMINGHOW COMPANIESTHINK ABOUTDATACDH CLOUDERAMANAGERCLOUDERAENTERPRISE4CHANGINGTHE WORLDONE PETABYTEAT A TIME
  • 62© OCTO 2013Pervasive in the Enterprise6220+ B events online perday are ingested byCloudera70% of all the smartphones in the U.S. arepowered by Cloudera250 million Tweets per dayare filtered for actionablebusiness insights by Cloudera4 of the top 5 commercialbanks rely on Cloudera20 M householdslower their power billusing Cloudera3 of the top 5 organizations intelecoms, defense, media,banking and retail run ClouderaCONFIDENTIAL - RESTRICTED
  • 63© OCTO 2013SIMPLIFIED, UNIFIED, EFFICIENT• Bulk of data stored on scalable low cost platform• Perform end-to-end workflows• Specialized systems reserved for specialized workloads• Provides data access across departments or LOBCOMPLEX, FRAGMENTED, COSTLY•Data silos by department or LOB• Lots of data stored in expensive specializedsystems• Analysts pull select data into EDW• No one has a complete viewThe Cloudera Approach63Meet enterprise demands with a new way to think about data.THE CLOUDERA WAYTHE OLD WAYSingle data platform tosupport BI, Reporting &App ServingMultiple platformsfor multiple workloads
  • 64© OCTO 2013A Complete Solution64CLOUDERAUNIVERSITYDEVELOPERTRAININGADMINISTRATORTRAININGDATA SCIENCETRAININGCERTIFICATIONPROGRAMSPROFESSIONAL SERVICESUSE CASE DISCOVERY NEW HADOOPDEPLOYMENTPROOF-OF-CONCEPTDEPLOYMENTCERTIFICATIONPROCESS & TEAMDEVELOPMENTPRODUCTION PILOTSINGEST STOREEXPLOREPROCESSANALYZESERVECDH CLOUDERAMANAGERCLOUDERASUPPORTCLOUDERANAVIGATOR
  • 65© OCTO 201365Cloudera Enterprise CoreIncludes Support & Management for all the CoreComponents of CDHINGEST STORE EXPLORE PROCESS ANALYZE SERVECERTIFIED CONNECTORSCONNECTORS ARE COLORCODED TO THEIRCORRESPONDINGSUBSCRIPTION OPTIONCDHPROJECTS ARECOLOR CODED TOTHEIRCORRESPONDINGSUBCRIPTIONOPTIONSTORAGERESOURCE MGMT& COORDINATIONUSER INTERFACE WORKFLOW MGMT METADATACLOUDINTEGRATIONYAYARNZOZOOKEEPERHDFSHADOOP DFSHBHBASEHUHUEOOOOZIEWHWHIRRSQSQOOPFLFLUMEFILEFUSE-DFSRESTWEBHDFSHTTPFSSQLODBCJDBCMSMETASTOREACACCESSBI ETL RDBMSBATCH COMPUTEBATCH PROCESSING REAL-TIMEACCESS& COMPUTEMRMAPREDUCEMR2MAPREDUCE2HIHIVEPIPIGMAMAHOUTDFDATAFUIMIMPALAMANAGEMENTSOFTWARE &TECHNICAL SUPPORTSUBSCIPTION OPTIONSCLOUDERANAVIGATORCLOUDERAMANAGERCORE(REQUIRED)
  • 66© OCTO 201366Cloudera Enterprise RTDIncludes Support & Management for Apache HBaseINGEST STORE EXPLORE PROCESS ANALYZE SERVECERTIFIED CONNECTORSCONNECTORS ARE COLORCODED TO THEIRCORRESPONDINGSUBSCRIPTION OPTIONCDHPROJECTS ARECOLOR CODED TOTHEIRCORRESPONDINGSUBCRIPTIONOPTIONSTORAGERESOURCE MGMT& COORDINATIONUSER INTERFACE WORKFLOW MGMT METADATACLOUDINTEGRATIONYAYARNZOZOOKEEPERHDFSHADOOP DFSHBHBASEHUHUEOOOOZIEWHWHIRRSQSQOOPFLFLUMEFILEFUSE-DFSRESTWEBHDFSHTTPFSSQLODBCJDBCMSMETASTOREACACCESSBI ETL RDBMSBATCH COMPUTEBATCH PROCESSING REAL-TIMEACCESS& COMPUTEMRMAPREDUCEMR2MAPREDUCE2HIHIVEPIPIGMAMAHOUTDFDATAFUIMIMPALAMANAGEMENTSOFTWARE &TECHNICAL SUPPORTSUBSCIPTION OPTIONSCLOUDERANAVIGATORCLOUDERAMANAGERCORE(REQUIRED)RTD
  • 67© OCTO 201367Cloudera Enterprise RTQIncludes Support & Management for Cloudera ImpalaINGEST STORE EXPLORE PROCESS ANALYZE SERVECERTIFIED CONNECTORSCONNECTORS ARE COLORCODED TO THEIRCORRESPONDINGSUBSCRIPTION OPTIONCDHPROJECTS ARECOLOR CODED TOTHEIRCORRESPONDINGSUBCRIPTIONOPTIONSTORAGERESOURCE MGMT& COORDINATIONUSER INTERFACE WORKFLOW MGMT METADATACLOUDINTEGRATIONYAYARNZOZOOKEEPERHDFSHADOOP DFSHBHBASEHUHUEOOOOZIEWHWHIRRSQSQOOPFLFLUMEFILEFUSE-DFSRESTWEBHDFSHTTPFSSQLODBCJDBCMSMETASTOREACACCESSBI ETL RDBMSBATCH COMPUTEBATCH PROCESSING REAL-TIMEACCESS& COMPUTEMRMAPREDUCEMR2MAPREDUCE2HIHIVEPIPIGMAMAHOUTDFDATAFUIMIMPALAMANAGEMENTSOFTWARE &TECHNICAL SUPPORTSUBSCIPTION OPTIONSCLOUDERANAVIGATORCLOUDERAMANAGERCORE(REQUIRED)RTD RTQ
  • 68© OCTO 201368Cloudera Enterprise BDRBackup & Disaster Recovery Module for ClouderaEnterpriseINGEST STORE EXPLORE PROCESS ANALYZE SERVECERTIFIED CONNECTORSCONNECTORS ARE COLORCODED TO THEIRCORRESPONDINGSUBSCRIPTION OPTIONCDHPROJECTS ARECOLOR CODED TOTHEIRCORRESPONDINGSUBCRIPTIONOPTIONSTORAGERESOURCE MGMT& COORDINATIONUSER INTERFACE WORKFLOW MGMT METADATACLOUDINTEGRATIONYAYARNZOZOOKEEPERHDFSHADOOP DFSHUHUEOOOOZIEWHWHIRRSQSQOOPFLFLUMEFILEFUSE-DFSRESTWEBHDFSHTTPFSSQLODBCJDBCMSMETASTOREACACCESSBI ETL RDBMSBATCH COMPUTEBATCH PROCESSING REAL-TIMEACCESS& COMPUTEMRMAPREDUCEMR2MAPREDUCE2HIHIVEPIPIGMAMAHOUTDFDATAFUIMIMPALAMANAGEMENTSOFTWARE &TECHNICAL SUPPORTSUBSCIPTION OPTIONSCLOUDERANAVIGATORCLOUDERAMANAGERCORE(REQUIRED)RTD RTQBDRHBHBASE
  • 69© OCTO 201369Cloudera NavigatorData Audit & Access Control for Cloudera EnterpriseINGEST STORE EXPLORE PROCESS ANALYZE SERVECERTIFIED CONNECTORSCONNECTORS ARE COLORCODED TO THEIRCORRESPONDINGSUBSCRIPTION OPTIONCDHPROJECTS ARECOLOR CODED TOTHEIRCORRESPONDINGSUBCRIPTIONOPTIONSTORAGERESOURCE MGMT& COORDINATIONUSER INTERFACE WORKFLOW MGMT METADATACLOUDINTEGRATIONYAYARNZOZOOKEEPERHDFSHADOOP DFSHBHBASEHUHUEOOOOZIEWHWHIRRSQSQOOPFLFLUMEFILEFUSE-DFSRESTWEBHDFSHTTPFSSQLODBCJDBCMSMETASTOREACACCESSBI ETL RDBMSBATCH COMPUTEBATCH PROCESSING REAL-TIMEACCESS& COMPUTEMRMAPREDUCEMR2MAPREDUCE2HIHIVEPIPIGMAMAHOUTDFDATAFUIMIMPALAMANAGEMENTSOFTWARE &TECHNICAL SUPPORTSUBSCIPTION OPTIONSCLOUDERANAVIGATORCLOUDERAMANAGERCORE(REQUIRED)RTD RTQBDRAUDIT(v1.0)LINEAGEACCESS(v1.0)LIFECYCLEEXPLORE
  • 70© OCTO 201370Customer Case Studies
  • 71© OCTO 2013A multinational bank savesmillions by optimizing DW foranalytics & reducing datastorage costs by 99%.Ask Bigger Questions:How can we optimize ourdata warehouseinvestment?
  • 72© OCTO 2013Cloudera optimizes the EDW, saves millions72The Challenge:• Teradata EDW at capacity: ETL processes consume 7 days; takes 5weeks to make historical data available for analysis• Performance issues in business critical apps; little room for discovery,analytics, ROI from opportunitiesMultinational bank saves millionsby optimizing existing DW foranalytics & reducing data storagecosts by 99%.The Solution:• Cloudera Enterprise offloads datastorage, processing & someanalytics from EDW• Teradata can focus onoperational functions & analytics
  • 73© OCTO 2013The quant risk LOB within amultinational bank savesmillions through better riskexposure analysis & fraudprevention.Ask Bigger Questions:How can we preventfraud?
  • 74© OCTO 2013Cloudera delivers savings through fraud prevention74The Challenge:• Fraud detection is a cumbersome, multi-step analytic process requiringdata sampling• 2B transactions/month necessitate constant revisions to risk profiles• Highly tuned 100TB Teradata DW drives over-budget capital reserves &lower investment returnsQuant risk LOB in multinationalbank saves millions through betterrisk exposure analysis & fraudpreventionThe Solution:• Cloudera Enterprise data factoryfor fraud prevention, credit &operational risk analysis• Look at every incidence of fraudfor 5 years for each person• Reduced costs; expensive CPUno longer consumed by dataprocessing
  • 75© OCTO 2013A Semiconductor Manufacturerusespredictive analytics to takepreventative action on chipslikely to fail.Ask Bigger Questions:Which semiconductorchips will fail?
  • 76© OCTO 2013Cloudera enables betterpredictions76The Challenge:• Want to capture greater granular and historical data for more accuratepredictive yield modeling• Storing 9 months’ data on Oracle is expensiveSemiconductor manufacturer canprevent chip failure with moreaccurate predictive yield models.The Solution:•Dell | Cloudera solution for ApacheHadoop•53 nodes; plan to store up to 10years (~10PB)•Capturing & processing data fromeach phase of manufacturing processCONFIDENTIAL - RESTRICTED
  • 77© OCTO 2013BlackBerry eliminates datasampling & simplifies dataprocessing for better, morecomprehensive analysis.Ask Bigger Questions:How do we retaincustomers in a competitivemarket?
  • 78© OCTO 2013Cloudera delivers ROI through storage alone78The Challenge:• BlackBerry Services generates .5PB (50-60TB compressed) data per day• RDBMS is expensive – limited to 1% data sampling for analyticsBlackBerry can analyze all theirdata vs. relying on 1% sample forbetter network capacity trending &management.The Solution:• Cloudera Enterprise managesglobal data set of ~100PB• Collecting device content,machine-generated log data,audit details• 90% ETL code base reduction
  • 79© OCTO 2013A leading manufacturer ofmobile devices gleans newinsights & delivers instantsoftware bug fixes.Ask Bigger Questions:How do we preventmobile device returns?
  • 80© OCTO 2013Cloudera complements the data warehouse80The Challenge:• Fast-growing Oracle DW – difficult & expensive to maintain performanceat scale• Need to ingest massive volumes of unstructured data very quicklyMobile technology leader identifieda hidden software bug causingsudden spike in returns.The Solution:• Cloudera Enterprise + RTD: dataprocessing, storage & analysison 25 years data• Integrated with Oracle: closedloop analytical process• Collecting device data everymin., loading 1TB/day intoClouderaRead the case study:http://www.cloudera.com/content/cloudera/en/resources/library/casestudy/driving-innovation-in-mobile-devices-with-cloudera-and-oracle.html
  • 81© OCTO 2013YellowPages enables newpublisher services throughfaster data processing.Ask Bigger Questions:How can we increase thevalue we deliver topublishers?
  • 82© OCTO 2013The Challenge:• Want to keep 260M billable daily events for 13 mos. + 600M non-billabledaily events for 90 days• Performance & scale challenges on SQL ServerThe Solution:• Cloudera Enterprise – coreproduction traffic processing system• Integrated with HP Vertica – 315CDH nodes; 30TB on VerticaCloudera expedites dataprocessing from days to hours82 CONFIDENTIAL - RESTRICTEDYP deploys Cloudera to offloadthe data warehouse, enablingnew business functions.