Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data


Published on

Published in: Technology

Big data

  1. 1. The Next Frontier for Innovation, Competition and Productivity
  2. 2. 2 Size of Data Speed,AccuracyandComplexityofIntelligence Big Data analytics Big Data Traditional analytics Advanced analytics Big Data relates to rapidly growing, Structured and Unstructured datasets with sizes beyond the ability of conventional database tools to store, manage, and analyze them. In addition to its size and complexity, it refers to its ability to help in “Evidence-Based” Decision-making, having a high impact on business operations What is Big Data ? Gigabytes Terabytes Petabytes Zetabytes Small Data Sets Small Data Sets Traditional analytics Big Data Source: CRISIL GR&A analysis Source: CRISIL GR&A analysis
  3. 3. 3 Structured Data  Resides in formal data stores – RDBMS and Data Warehouse; grouped in the form of rows or columns  Accounts for ~10% of the total data existing currently AudioVideo Weather patternsBlogs Location co-ordinatesText message Web logs & clickstreams RDBMS (e.g., ERP and CRM Data Warehousing Microsoft Project Plan File Semi- Structured Data  A form of structured data that does not conform with the formal structure of data models  Accounts for ~10% of the total data existing currently Unstructured Data  Comprises data formats which cannot be stored in row/ column format like audio files, video, clickstream data,  Accounts for ~80% of the total data existing currently Sensor data/ M2M Email Social media Geospatial data Source: Industry reporting; CRISIL GR&A analysis
  4. 4. 4
  5. 5. Volume •Data quantity Velocity •Data Speed Variety •Data Types Veracity •Data Quality
  6. 6. Descriptive analytics 6 Evolution of analytics LevelofComplexity In-database analyticsAnalytics as a separate value chain function Time Standard reports Adhoc reports Alerts Statistical analysis Forecast - ing Predictive modeling Optimization Stochastic optimization Natural Language Processing Big Data analytics Complex event processing Predictive analytics Prescriptive analytics Basic analytics  What happened?  When did it happen?  What was the its impact ? Advanced analytics  Why did it happen?  When will it happen again?  What caused it to happen?  What can be done to avoid it? Multivariate statistical analysis Time series analysis Behavioral analytics Data mining Constraint based BI Social network analytics Semantic analytics Online analytical processing (OLAP) Extreme SQL Visualization Analytic database functions  Big Data analytics is where advanced analytic techniques are applied on Big Data sets  The term came into play late 2011 – early 2012 Late 1990s 2000 onwards Source: CRISIL GR&A analysis Query drill down
  7. 7. 7 Data Sources Big Data Analytics Components of Big Data Ecosystem Developer Environments (Languages (Java), Environments (Eclipse & NetBeans), programming interfaces (MapReduce)) Analytics products (Avro, Apache Thrift) BI &visualization tools Applications (mobile, search, web) End users Business analysts Big Data Data Architecture Hadoop/ Big Data tech’y framework (MapReduce etc.) Unstructured data (Text, web pages, social media content, video etc.) Structured data (stored in MPP, RDBMS and DW*) Data administration tools NoSQL MPP RDBMS DW NoSQL Hadoop based Operational Data Datamanagement& storage Dataanalytics&its applicationanduse ITservices (SI,customization,consulting,systemdesign) ETL & Data integration products System tools Workflow/ scheduler products Input data Four key elements: 1. Big Data Management & storage:  Data storage infrastructure and technologies 2. Big Data Analytics  Includes the technologies and tools to analyze the data and generate insight from it 3. Big Data’s Application & Use  Involves enabling the Big Data insights to work in BI and end-user applications 4. IT services including  System Integration  Consulting  Project management and customization What does the Big Data Ecosystem Constitute ? *MPP – Massively parallel processing; RDBMS - Relational Data Base Management Systems; DW – Data warehouse Source: CRISIL GR&A analysis
  8. 8. 8
  9. 9. 9
  10. 10.  Software platform that lets one easily write and run applications that process vast amounts of data. It includes: – MapReduce – offline computing engine – HDFS – Hadoop distributed file system – HBase (pre-alpha) – online data access  Yahoo! is the biggest contributor  Here's what makes it especially useful: ◦ Scalable: It can reliably store and process petabytes. ◦ Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). ◦ Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. ◦ Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. 10
  11. 11. It is written with large clusters of computers in mind and is built around the following assumptions: ◦ Hardware will fail. ◦ Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency. ◦ Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. ◦ It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. ◦ Applications need a write-once-read-many access model. ◦ Moving Computation is Cheaper than Moving Data. ◦ Portability is important. 11
  12. 12.  Programming model developed at Google  Sort/merge based distributed computing  Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo,, IBM, etc.)  It is functional style programming that is naturally parallelizable across a large cluster of workstations or PCS.  The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success) 12
  13. 13.  Hadoop implements Google’s MapReduce, using HDFS  MapReduce divides applications into many small blocks of work.  HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster.  MapReduce can then process the data where it is located.  Hadoop ‘s target is to run on clusters of the order of 10,000- nodes. 13
  14. 14.  The run time partitions the input and provides it to different Map instances;  Map (key, value)  (key’, value’)  The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’.  Each Reduce produces a single (or zero) file output.  Map and Reduce are user written functions 14
  15. 15. map(String key, String value): // key: document name; value: document contents; map (k1,v1)  list(k2,v2) for each word w in value: EmitIntermediate(w, "1"); (Example: If input string is (“God is God. I am I”), Map produces {<“God”,1”>, <“is”, 1>, <“God”, 1>, <“I”,1>, <“am”,1>,<“I”,1>} reduce(String key, Iterator values): // key: a word; values: a list of counts; reduce (k2,list(v2))  list(v2) int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); (Example: reduce(“I”, <1,1>)  2) 15
  16. 16. see bob throw see 1 bob 1 throw 1 see 1 spot 1 run 1 bob 1 run 1 see 2 spot 1 throw 1 see spot run Can we do word count in parallel?
  17. 17.  InputFormat  Map function  Partitioner  Sorting & Merging  Combiner  Shuffling  Merging  Reduce function  OutputFormat  1:many
  18. 18.  Worker failure: The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.  Master Failure: It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last check pointed state. However, in most cases, the user restarts the job. 19
  19. 19.  The input data (on HDFS) is stored on the local disks of the machines in the cluster. HDFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines.  The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data. When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth. 20
  20. 20.  The map phase has M pieces and the reduce phase has R pieces.  M and R should be much larger than the number of worker machines.  Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails.  Larger the M and R, more the decisions the master must make  R is often constrained by users because the output of each reduce task ends up in a separate output file.  Typically, (at Google), M = 200,000 and R = 5,000, using 2,000 worker machines. 21
  21. 21.  The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. ◦ highly fault-tolerant and is designed to be deployed on low-cost hardware. ◦ provides high throughput access to application data and is suitable for applications that have large data sets. ◦ relaxes a few POSIX requirements to enable streaming access to file system data. ◦ part of the Apache Hadoop Core project. The project URL is 22
  22. 22. 23
  23. 23. 24
  24. 24. 25
  25. 25. 26 2018E Supply 2018E Demand Demand-supply gap for data scientists* in US, 2018 Data Scientists Data-savvy Managers Technical Engineers  Expertise in data analytics skills to extract data, use of modeling & simulations  Multi-disciplinary knowledge of business to find insights  Advanced business degree such as MBA, M.S. or managerial diplomas  Advanced degree like M.S. or Ph.D., in mathematics, statistics, economics, computer science or any decision sciences  Knowledge of statistics and/or machine learning to frame key questions and analyze answers  Conceptual knowledge of business to interpret and challenge the insights  Ability to make decisions using Big Data insights  Having a degree in computer science, information technology, systems engineering. or related disciplines  Possessing data management knowledge  IT skills to develop, implement, and maintain hardware and software  Project management across the Big Data ecosystem – Consulting services – Implementation – Infrastructure management – Analytics  Big Data analytics  Business intelligence  Visualization  Technical support in hardware & software across the Big Data ecosystem for: – Data architecture – Data administration – Developer environment – Applications 50%-60% gap relative to supply 300K Role in Ecosystem Requisite educational qualifications Other expertise 140K – 190K 440K-490K Demand-supply gap for data-savvy managers* in US, 2018 2018E Supply 2018E Demand 60% gap relative to supply 2.5 million 1.5 million 4.0 million *Analysts with deep analytical training; **Managers to analyze Big Data and make decisions based on their findings; Source: McKinsey Global Institute; CRISIL GR&A analysis
  26. 26. 27 2011E 2012E 2015F Global Big Data Market Size, 2011 – 2015E US$ billion 5.3-5.6 8.0-8.5 25.0-26.0  The global Big Data market is expected to grow by about a CAGR of 46% over 2012-2015  IT & ITES, including analytics, is expected to grow the fastest, at a rate of more than 60% – Its share in the total Big Data market is expected to increase to ~45% in 2015 from ~31% in 2011  The USD 25 billion opportunity represents the initial wave of the opportunity. This opportunity is set to expand even more rapidly after 2015 given the pace at which data is being generated. Source: Industry reporting; CRISIL GR&A analysis 2015 US$ 6-6.5 billion US$ 7-7.5 billion US$ 10-11 billion Global Big Data Market Size, 2015F ~US$25 billion Big Data analytics & IT & IT-enabled services Software Hardware Lion’s share of the Big Data hardware and software market is expected to be occupied by IT giants like IBM, HP, Microsoft, SAP, SAS, Oracle, etc. Opportunity for India lies in capturing the slice of IT services that includes Big Data analytics and IT & IT- enabled services
  27. 27. 28