Bigdata

BigDataShankar RadhakrishnanJuly, 2011

Big Data in the NewsSavingsAmerican Health-Care: $300 Billion/YearEuropean Public Sector: €250 Billion/YearProductivity Margins: 60% increaseSources: McKinsey Global Institute

TopicsWhat do we collect today?DBMS LandscapeThe DisconnectThe NeedWhat is BigData?CharacteristicsApproachArchitectural RequirementsTechniquesChallengesSolutionsIssuesDeep Dive – Practical Approaches to Big DataHadoopAster Data

What do we collect?In 2010, people stored data to fill 60,000 Library of Congress (LoC collected 235TB in Apr/2011)YouTube receives 24hours of video, every minute5 Billion mobile phones in use in 2010Tesco (British Retailer) collects 1.5 billion pieces of information to adjust prices and promotionsAmazon.com: 30% of sales is out of its recommendation enginePlanecast, Mobclix : Track & Target systems promotes contextual promotionsA Boeing Jet Engine produces 20TB/Hour for engineers to examine in real time to make improvementsSources: Forrester, The Economist,McKinsey Global Institute

Collect MoreBusiness OperationsTransactionsRegistersGatewaysCustomer InformationCRMProduct InformationBarcodesRFIDWebPagesWeb RepositoriesUnstructured InformationSocial MediaSignalsMobileGPS, GeoSpatial

DBMS SolutionsLegacyFaster RetrievalEfficient StorageDivide and AccessData ConsolidationBroader TablesAccess all as a rowFine GrainAccessSecurityRules and PoliciesProblemsData GrowthWhen storage cost is not an issueScalability IssuesPerformance IssuesNew types of requirementsDeciding what to analyze, when and how?Cost of a change in the subject-area to analyze

The DisconnectOld DBMS vs. New Data Types/StructuresOld DBMS vs. New volumeOld DBMS vs. New AnalysisOld DBMS vs. Data RetentionOld DBMS vs. Data Element StripingOld DBMS vs. Data InfrastructureOld DBMS vs. One DB Platform for all

The NeedSystem that can handle high volume dataPerform complex operationsScalableRobustHighly AvailableFault TolerantEconomicNew Approach

Big Data“Tools and techniques to manage different types of data, in high volume, in high velocitywith varied requirements to mine them”CharacteristicsSizeScale up and scale out: Terabyte, Petabyte …StructureStructuredUnstructured : Audio, Video, Text, GeoSpatialSchema Less StructuresStreamTorrent of real-time informationOperationMassively Parallel Processing (MPP)

ApproachHardwareCommodity HardwareApplianceDynamic ScalingFault TolerantHighly AvailableNo constraints on StorageCloudVirtual Environment, StorageProcessing ModelsIn-memoryIn-databaseInterfaces/AdaptersWorkload ManagementDistributed Data ProcessingSoftwareFrameworks – Hadoop, MapReduce, Vrije, BOOM, BloomOpen SourceProprietary

Architectural RequirementsIntegration FrameworkDevelopment FrameworkManagement FrameworkModeling FrameworkProcessing FrameworkData Management Framework

ChallengesVolumetric AnalysisComplexityStreaming Data/Real Time DataNetwork TopologyInfrastructurePattern-based Strategy

TechniquesControlled and Variate TestingMiningMachine LearningNatural Language Processing (NLP)Cohort AnalysisNetwork or Path AnalysisPredictive ModelsCrowd SourcingRegression ModelsSentiment AnalysisProcessing SignalsSpatial AnalyticsVisualizationTime-series Analysis

SolutionsIBM: Infosphere BigInsights, StreamsTeradata/Aster Data: nCluster, SQL-MRFrameworksHadoopMapReduceInfobright*SplunkCloudera*CassandraNoSQL, NewSQLGoogle’s Big TableApplianceTeradataNetezza (IBM)Columnar DatabasesVertica (HP)ParAccelManaged Services Available

IssuesLatencyFaultinessAccuracyACIDAtomicityConsistencyIsolationDurabilitySetup CostDevelopment CostCost-to-fly

Top level Apache projectOpen sourceSoftware Framework - JavaInspired by Google’s white papers onMap/Reduce (MR)Google File System (GFS)Big TableOriginally developed to support Apache NutchDesignedLarge scale data processingFor batch processingFor sophisticated analysisTo deal with structured and unstructured dataDB Architect’s Hadoop : "Heck Another Darn Obscure Open-source Project"

Why Hadoop?Runs on commodity hardwarePortability across heterogeneous hardwareand software platformsShared-nothing architectureScale hardware when ever you wantSystem compensates for hardware scalingand issues (if any)Run large-scale, high volume data processesScales well with complex analysis jobs(Hardware) “Failure is an option”Ideal to consolidate data from both new andlegacy data sourcesHighly IntegrableValue to the business

Hadoop EcosystemHDFS Hadoop Distributed File SystemMap/Reduce Software framework for Clustered, Distributed data processingZooKeeper SchedulerAvro Data SerializationChukwa Data Collection System to monitor Distributed SystemsHBase Data storage for distributed large tablesHive Data warehousePig High-Level Query LanguageScribe Log CollectionUDF User Defined Functions

Hadoop Flow (Example)Network StorageWeb ServersScribeOracleMySQLHadoop Hive DWHMySQLOracleAppsFeeds

HDFSHadoop Distributed File SystemMaster/Slave ArchitectureRuns on commodity hardwareFault TolerantHandle large volumes of dataProvides High ThroughputStreaming data-accessSimple file coherency modelPortable to heterogeneous hardware and softwareRobustHandles disk failures, replication (& re-replication)Performs cluster rebalancing, data integrity checks

HDFS ArchitectureName nodeFile system operations

Maps data-nodesData nodeProcess read/write

ReplicationHadoop M/RTagged by a jobSplits input data-set into separate chunk’sProcessed by map tasks, in parallelSorts the output of the mapsProcessed by reduce tasks, in parallelTypically stored and processed in a file systemFramework takes care ofScheduling tasksMonitoringRe-executing failed tasksInfrastructure issuesLoad-balancing, Load-redistributionReplication, Failover

Bigdata

More Related Content

What's hot

Viewers also liked

Similar to Bigdata

Recently uploaded

Bigdata