BigDataShankar RadhakrishnanJuly, 2011
Big Data in the NewsSavingsAmerican Health-Care: $300 Billion/YearEuropean Public Sector: €250 Billion/YearProductivity Margins: 60% increaseSources: McKinsey Global Institute
TopicsWhat do we collect today?DBMS LandscapeThe DisconnectThe NeedWhat is BigData?CharacteristicsApproachArchitectural RequirementsTechniquesChallengesSolutionsIssuesDeep Dive – Practical Approaches to Big DataHadoopAster Data
What do we collect?In 2010, people stored data to fill 60,000 Library of Congress (LoC collected 235TB in Apr/2011)YouTube receives 24hours of video, every minute5 Billion mobile phones in use in 2010Tesco (British Retailer) collects 1.5 billion pieces of information to adjust prices and promotionsAmazon.com: 30% of sales is out of its recommendation enginePlanecast, Mobclix : Track & Target systems promotes contextual promotionsA Boeing Jet Engine produces 20TB/Hour for engineers to examine in real time to make improvementsSources: Forrester, The Economist,McKinsey Global Institute
Collect MoreBusiness OperationsTransactionsRegistersGatewaysCustomer InformationCRMProduct InformationBarcodesRFIDWebPagesWeb RepositoriesUnstructured InformationSocial MediaSignalsMobileGPS, GeoSpatial
DBMS SolutionsLegacyFaster RetrievalEfficient StorageDivide and AccessData ConsolidationBroader TablesAccess all as a rowFine GrainAccessSecurityRules and PoliciesProblemsData GrowthWhen storage cost is not an issueScalability IssuesPerformance IssuesNew types of requirementsDeciding what to analyze, when and how?Cost of a change in the subject-area to analyze
The DisconnectOld DBMS vs. New Data Types/StructuresOld DBMS vs. New volumeOld DBMS vs. New AnalysisOld DBMS vs. Data RetentionOld DBMS vs. Data Element StripingOld DBMS vs. Data InfrastructureOld DBMS vs. One DB Platform for all
The NeedSystem that can handle high volume dataPerform complex operationsScalableRobustHighly AvailableFault TolerantEconomicNew Approach
Big Data“Tools and techniques to manage different types of data, in high volume, in high velocitywith varied requirements to mine them”CharacteristicsSizeScale up and scale out: Terabyte, Petabyte …StructureStructuredUnstructured : Audio, Video, Text, GeoSpatialSchema Less StructuresStreamTorrent of real-time informationOperationMassively Parallel Processing (MPP)
ApproachHardwareCommodity HardwareApplianceDynamic ScalingFault TolerantHighly AvailableNo constraints on StorageCloudVirtual Environment, StorageProcessing ModelsIn-memoryIn-databaseInterfaces/AdaptersWorkload ManagementDistributed Data ProcessingSoftwareFrameworks – Hadoop, MapReduce, Vrije, BOOM, BloomOpen SourceProprietary
Architectural RequirementsIntegration FrameworkDevelopment FrameworkManagement FrameworkModeling FrameworkProcessing FrameworkData Management Framework
ChallengesVolumetric AnalysisComplexityStreaming Data/Real Time DataNetwork TopologyInfrastructurePattern-based Strategy
TechniquesControlled and Variate TestingMiningMachine LearningNatural Language Processing (NLP)Cohort AnalysisNetwork or Path AnalysisPredictive ModelsCrowd SourcingRegression ModelsSentiment AnalysisProcessing SignalsSpatial AnalyticsVisualizationTime-series Analysis
SolutionsIBM: Infosphere BigInsights, StreamsTeradata/Aster Data: nCluster, SQL-MRFrameworksHadoopMapReduceInfobright*SplunkCloudera*CassandraNoSQL, NewSQLGoogle’s Big TableApplianceTeradataNetezza (IBM)Columnar DatabasesVertica (HP)ParAccelManaged Services Available
IssuesLatencyFaultinessAccuracyACIDAtomicityConsistencyIsolationDurabilitySetup CostDevelopment CostCost-to-fly
Deep DiveHadoop
Top level Apache projectOpen sourceSoftware Framework - JavaInspired by Google’s white papers onMap/Reduce (MR)Google File System (GFS)Big TableOriginally developed to support Apache NutchDesignedLarge scale data processingFor batch processingFor sophisticated analysisTo deal with structured and unstructured dataDB Architect’s Hadoop : "Heck Another Darn Obscure Open-source Project"
Why Hadoop?Runs on commodity hardwarePortability across heterogeneous hardwareand software platformsShared-nothing architectureScale hardware when ever you wantSystem compensates for hardware scalingand issues (if any)Run large-scale, high volume data processesScales well with complex analysis jobs(Hardware) “Failure is an option”Ideal to consolidate data from both new andlegacy data sourcesHighly IntegrableValue to the business
Hadoop EcosystemHDFS	Hadoop Distributed File SystemMap/Reduce		Software framework for 			Clustered, Distributed data 			processingZooKeeper	SchedulerAvro		Data SerializationChukwa	Data Collection System to			monitor Distributed SystemsHBase 		Data storage for distributed			large tablesHive			Data warehousePig		High-Level Query LanguageScribe		Log CollectionUDF			User Defined Functions
Hadoop Flow (Example)Network StorageWeb ServersScribeOracleMySQLHadoop Hive DWHMySQLOracleAppsFeeds
HDFSHadoop Distributed File SystemMaster/Slave ArchitectureRuns on commodity hardwareFault TolerantHandle large volumes of dataProvides High ThroughputStreaming data-accessSimple file coherency modelPortable to heterogeneous hardware and softwareRobustHandles disk failures, replication (& re-replication)Performs cluster rebalancing, data integrity checks
HDFS ArchitectureName nodeFile system operations
Maps data-nodesData nodeProcess read/write
Handles Data-blocks
ReplicationHadoop M/RTagged by a jobSplits input data-set into separate chunk’sProcessed by map tasks, in parallelSorts the output of the mapsProcessed by reduce tasks, in parallelTypically stored and processed in a file systemFramework takes care ofScheduling tasksMonitoringRe-executing failed tasksInfrastructure issuesLoad-balancing, Load-redistributionReplication, Failover
Mapper Functioncat * | grep | sort | uniq –c | cat > fileinput | map | shuffle | reduce | output
Reduce Functioncat * | grep | sort | uniq –c | cat > fileinput | map | shuffle | reduce | output
Who uses Hadoop?
Deep DiveAster Data

Bigdata

  • 1.
  • 2.
    Big Data inthe NewsSavingsAmerican Health-Care: $300 Billion/YearEuropean Public Sector: €250 Billion/YearProductivity Margins: 60% increaseSources: McKinsey Global Institute
  • 3.
    TopicsWhat do wecollect today?DBMS LandscapeThe DisconnectThe NeedWhat is BigData?CharacteristicsApproachArchitectural RequirementsTechniquesChallengesSolutionsIssuesDeep Dive – Practical Approaches to Big DataHadoopAster Data
  • 4.
    What do wecollect?In 2010, people stored data to fill 60,000 Library of Congress (LoC collected 235TB in Apr/2011)YouTube receives 24hours of video, every minute5 Billion mobile phones in use in 2010Tesco (British Retailer) collects 1.5 billion pieces of information to adjust prices and promotionsAmazon.com: 30% of sales is out of its recommendation enginePlanecast, Mobclix : Track & Target systems promotes contextual promotionsA Boeing Jet Engine produces 20TB/Hour for engineers to examine in real time to make improvementsSources: Forrester, The Economist,McKinsey Global Institute
  • 5.
    Collect MoreBusiness OperationsTransactionsRegistersGatewaysCustomerInformationCRMProduct InformationBarcodesRFIDWebPagesWeb RepositoriesUnstructured InformationSocial MediaSignalsMobileGPS, GeoSpatial
  • 6.
    DBMS SolutionsLegacyFaster RetrievalEfficientStorageDivide and AccessData ConsolidationBroader TablesAccess all as a rowFine GrainAccessSecurityRules and PoliciesProblemsData GrowthWhen storage cost is not an issueScalability IssuesPerformance IssuesNew types of requirementsDeciding what to analyze, when and how?Cost of a change in the subject-area to analyze
  • 7.
    The DisconnectOld DBMSvs. New Data Types/StructuresOld DBMS vs. New volumeOld DBMS vs. New AnalysisOld DBMS vs. Data RetentionOld DBMS vs. Data Element StripingOld DBMS vs. Data InfrastructureOld DBMS vs. One DB Platform for all
  • 8.
    The NeedSystem thatcan handle high volume dataPerform complex operationsScalableRobustHighly AvailableFault TolerantEconomicNew Approach
  • 9.
    Big Data“Tools andtechniques to manage different types of data, in high volume, in high velocitywith varied requirements to mine them”CharacteristicsSizeScale up and scale out: Terabyte, Petabyte …StructureStructuredUnstructured : Audio, Video, Text, GeoSpatialSchema Less StructuresStreamTorrent of real-time informationOperationMassively Parallel Processing (MPP)
  • 10.
    ApproachHardwareCommodity HardwareApplianceDynamic ScalingFaultTolerantHighly AvailableNo constraints on StorageCloudVirtual Environment, StorageProcessing ModelsIn-memoryIn-databaseInterfaces/AdaptersWorkload ManagementDistributed Data ProcessingSoftwareFrameworks – Hadoop, MapReduce, Vrije, BOOM, BloomOpen SourceProprietary
  • 11.
    Architectural RequirementsIntegration FrameworkDevelopmentFrameworkManagement FrameworkModeling FrameworkProcessing FrameworkData Management Framework
  • 12.
    ChallengesVolumetric AnalysisComplexityStreaming Data/RealTime DataNetwork TopologyInfrastructurePattern-based Strategy
  • 13.
    TechniquesControlled and VariateTestingMiningMachine LearningNatural Language Processing (NLP)Cohort AnalysisNetwork or Path AnalysisPredictive ModelsCrowd SourcingRegression ModelsSentiment AnalysisProcessing SignalsSpatial AnalyticsVisualizationTime-series Analysis
  • 14.
    SolutionsIBM: Infosphere BigInsights,StreamsTeradata/Aster Data: nCluster, SQL-MRFrameworksHadoopMapReduceInfobright*SplunkCloudera*CassandraNoSQL, NewSQLGoogle’s Big TableApplianceTeradataNetezza (IBM)Columnar DatabasesVertica (HP)ParAccelManaged Services Available
  • 15.
  • 16.
  • 17.
    Top level ApacheprojectOpen sourceSoftware Framework - JavaInspired by Google’s white papers onMap/Reduce (MR)Google File System (GFS)Big TableOriginally developed to support Apache NutchDesignedLarge scale data processingFor batch processingFor sophisticated analysisTo deal with structured and unstructured dataDB Architect’s Hadoop : "Heck Another Darn Obscure Open-source Project"
  • 18.
    Why Hadoop?Runs oncommodity hardwarePortability across heterogeneous hardwareand software platformsShared-nothing architectureScale hardware when ever you wantSystem compensates for hardware scalingand issues (if any)Run large-scale, high volume data processesScales well with complex analysis jobs(Hardware) “Failure is an option”Ideal to consolidate data from both new andlegacy data sourcesHighly IntegrableValue to the business
  • 19.
    Hadoop EcosystemHDFS Hadoop DistributedFile SystemMap/Reduce Software framework for Clustered, Distributed data processingZooKeeper SchedulerAvro Data SerializationChukwa Data Collection System to monitor Distributed SystemsHBase Data storage for distributed large tablesHive Data warehousePig High-Level Query LanguageScribe Log CollectionUDF User Defined Functions
  • 20.
    Hadoop Flow (Example)NetworkStorageWeb ServersScribeOracleMySQLHadoop Hive DWHMySQLOracleAppsFeeds
  • 21.
    HDFSHadoop Distributed FileSystemMaster/Slave ArchitectureRuns on commodity hardwareFault TolerantHandle large volumes of dataProvides High ThroughputStreaming data-accessSimple file coherency modelPortable to heterogeneous hardware and softwareRobustHandles disk failures, replication (& re-replication)Performs cluster rebalancing, data integrity checks
  • 22.
  • 23.
  • 24.
  • 25.
    ReplicationHadoop M/RTagged bya jobSplits input data-set into separate chunk’sProcessed by map tasks, in parallelSorts the output of the mapsProcessed by reduce tasks, in parallelTypically stored and processed in a file systemFramework takes care ofScheduling tasksMonitoringRe-executing failed tasksInfrastructure issuesLoad-balancing, Load-redistributionReplication, Failover
  • 26.
    Mapper Functioncat *| grep | sort | uniq –c | cat > fileinput | map | shuffle | reduce | output
  • 27.
    Reduce Functioncat *| grep | sort | uniq –c | cat > fileinput | map | shuffle | reduce | output
  • 28.
  • 29.