020406080100120140160180200220240260280300320340360380400420440460data size (TB) # of NodesConduit’s Big Data Growth (5TB to 500TB)Jan 2009DWH LaunchedMar 2010Hadoop Launchedon cloud (8 nodes)Feb 2011Hadoop Deployedon conduit’s data center(72 nodes)Jan & Oct 2012Procurement(105/120 nodes)Sep 2013Procurement – DR
Conduit’s Data Platform in Numbers• Hardware:125 Nodes (+70 after DR) on 6 racksTB Used/1.2 PB Total• Daily processed data:50,000 files500,000,000 records700 GB• Daily jobs submitted: Over 5,000• Data freshness: 60 minutes
Tip #2Data is turning challengesinto business opportunities.
8%8%9%9%10%11%13%15%19%0% 5% 10% 15% 20%analyze complete rather than partial data setsotherCustomer intelligence for more targetedmarketingInclude more semi-structure/unstructured infointo decision makingImprove scientific researchETLlog analysisReduce cost of data analysisMine data for business intelligenceUse Cases
But… Hadoop in the Enterprise Eco System – lot of the featuresEnterprises need or want are put on the back seat Hadoop is NOT cheap (H/W & operations cost) – Makesure company‟s decision makers are on board Hadoop is still rough on the edges – tooling may not beas mature as Enterprises are used to Data access is batch oriented
Tip #3The 10/90 rule for magnificentdata success.
Tip #3 Nurture your „big brains‟ Hadoop cutting edge technology – Investment in relatedskills and training is crucial Good Data Scientists are “unicorns” Embrace the Open Source culture it will payoff BI team is essential for connecting the dots
Data Roles @ ConduitProductMobileData Infra TeamData BI TeamData Science TeamWibiya Quick LaunchToolbarBIScientist Scientist Scientist ScientistBI BI BIOtherScientistBI
Tip #4Shoot for right time data,not real time data.
Tip #4 Complex decision making is time consuming thereforeunable to react in real time Real time is expensive! Taylor the right solution to accommodate the required datafreshness Focus on big things!
Data Maturity vs. Freshness @Conduit10 60LowMediumHighReal TimeMonitoringHue/HiveReportingServiceAdvancedAnalyticsModelsBusinessObjectiveAdvancedAnalyticsModelsReportingServiceFreshnessData Maturity(Structured, cleansed &completedHadoopDWHKafka
Tip #5Data quality sucks,just get over it!
Tip #5 Data will be dirty, schema-less, no foreign keys And yet, we are standing on a mountain of gold! Make your best and know when to shift to data analysis Tune your algorithms to tolerate data deficiencies thenhunt for insights Big data is not Data Warehouse
Tip #6Democratize the data.
Tip #6 Break down barriers preventing our users/applications fromusing their valuable data in more effective ways to gleanmeaningful insights Provide your users advanced self service tools to access thedata Hadoop ecosystem evolving as we speak Your performance is measured by the tools effectivenessand ease of use
To Summarize…• Start small• Identify the opportunities• Invest in people & related skills• Adjust processes to the organization needs• Know your data limits• Self Service Tools are extremely important