Hadoop in the Enterprise Eco SystemHadoop is designed to solve Big Data problems encountered by Web and Social companies. In doing so, lot of the features Enterprises need or want are put on the back seat. For example HDFS does not offer native support for security and authentication.Hadoop is NOT cheapHardware Cost - lets say a Hadoop node is $5000. A 100 node cluster would be $500,000 for hardware.IT and Operations costs - teams like : Network Admins, IT, Security Admins, System Admins. Also one needs to think about operational costs like Data Center expenses : cooling, electricity ..etcHadoop is still rough on the edgesThe development and admin tools for Hadoop are still pretty new. Companies like Cloudera, Horton Works, MapR and Karmasphere have been working on this issue. How ever the tooling may not be as mature as Enterprises are used to (say Oracle Admin ..etc)
How to implement hadoop successfuly
How To Implement HadoopSuccessfully!Based on: Avinash KaushikBy Adir Sharabi
24%of Hadoop projects areactually in production.OnlyBy Rainstor
About ConduitOver 250 million active end usersMore than 260,000 publishersOver 3 billion monthly user interactionsDeployed in 120 countriesFounded in 2005Acquired Wibiya in 2011
020406080100120140160180200220240260280300320340360380400420440460data size (TB) # of NodesConduit’s Big Data Growth (5TB to 500TB)Jan 2009DWH LaunchedMar 2010Hadoop Launchedon cloud (8 nodes)Feb 2011Hadoop Deployedon conduit’s data center(72 nodes)Jan & Oct 2012Procurement(105/120 nodes)Sep 2013Procurement – DR
Conduit’s Data Platform in Numbers• Hardware:125 Nodes (+70 after DR) on 6 racksTB Used/1.2 PB Total• Daily processed data:50,000 files500,000,000 records700 GB• Daily jobs submitted: Over 5,000• Data freshness: 60 minutes
Tip #2Data is turning challengesinto business opportunities.
8%8%9%9%10%11%13%15%19%0% 5% 10% 15% 20%analyze complete rather than partial data setsotherCustomer intelligence for more targetedmarketingInclude more semi-structure/unstructured infointo decision makingImprove scientific researchETLlog analysisReduce cost of data analysisMine data for business intelligenceUse Cases
But… Hadoop in the Enterprise Eco System – lot of the featuresEnterprises need or want are put on the back seat Hadoop is NOT cheap (H/W & operations cost) – Makesure company‟s decision makers are on board Hadoop is still rough on the edges – tooling may not beas mature as Enterprises are used to Data access is batch oriented
Tip #3The 10/90 rule for magnificentdata success.
Tip #3 Nurture your „big brains‟ Hadoop cutting edge technology – Investment in relatedskills and training is crucial Good Data Scientists are “unicorns” Embrace the Open Source culture it will payoff BI team is essential for connecting the dots
Data Roles @ ConduitProductMobileData Infra TeamData BI TeamData Science TeamWibiya Quick LaunchToolbarBIScientist Scientist Scientist ScientistBI BI BIOtherScientistBI
Tip #4Shoot for right time data,not real time data.
Tip #4 Complex decision making is time consuming thereforeunable to react in real time Real time is expensive! Taylor the right solution to accommodate the required datafreshness Focus on big things!
Data Maturity vs. Freshness @Conduit10 60LowMediumHighReal TimeMonitoringHue/HiveReportingServiceAdvancedAnalyticsModelsBusinessObjectiveAdvancedAnalyticsModelsReportingServiceFreshnessData Maturity(Structured, cleansed &completedHadoopDWHKafka
Tip #5 Data will be dirty, schema-less, no foreign keys And yet, we are standing on a mountain of gold! Make your best and know when to shift to data analysis Tune your algorithms to tolerate data deficiencies thenhunt for insights Big data is not Data Warehouse
Tip #6 Break down barriers preventing our users/applications fromusing their valuable data in more effective ways to gleanmeaningful insights Provide your users advanced self service tools to access thedata Hadoop ecosystem evolving as we speak Your performance is measured by the tools effectivenessand ease of use
To Summarize…• Start small• Identify the opportunities• Invest in people & related skills• Adjust processes to the organization needs• Know your data limits• Self Service Tools are extremely important