Big Data & Hadoop Introduction


Published on

Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.

Intention was to present the big picture of Big Data & Hadoop

Published in: Technology
  • @kprincehp Hey Keith, I have already mentioned about MPP on slide 12, which is other way of handling BIG Data, and my target was to give BIG Data & Hadoop introduction wherein Hadoop is the new way to handle the scale. Noway one can deny the RDBMS technology, its inevitable.
    Are you sure you want to  Yes  No
    Your message goes here
  • Oh dear, completely misrepresented view of RDMS technology and totally misses out the explosive scalability that the MPP systems have been delivering for 25 years - so excluding the likes of Teradata, Netezza, & Vertica. Illustrates the problem with the Hadoopla being peddled by those that don't know the breadth of techologies that exist.
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Veracity is defined as “conformity with truth or fact,” or in short, Accuracy or Certainty.  Things that can cause us to question the data are inconsistencies, model approximations, ambiguities, deception, fraud, duplication, spam and latency.Variability: Say you go to an ice cream parlor that has 20 flavors of ice cream. That is Variety. Now, say you go there three days in a row and order strawberry; but each time it looks and tastes different.The different meanings/contexts associated with a given piece of dataValue : How fast & accurately you analyze and provide analytics to make sense (business sense) out of it.
  • India alone generated about 40,000 PB of Data in 2010. – EMC & IDC data.Volume: Whether they deal with incoming or outgoing requests, companies with exceptionally large amounts of data always look for faster, more efficient, and lower-cost solutions for data storage and access requirements.
  • 90% of Data was generated in last 2 years.Velocity: A high rate of data arriving from multiple, disparate sources in various formats requires solutions that rapidly process query requests for large data, and also support the acquisition and retention of data just as quickly.
  • Variety: Traditionally, companies have only analyzed data in structured formats and have either fought to generate value from unstructured data or have confined their analysis to a structured part of the overall picture. Today’s technology, such as “Not Only SQL” (NoSQL) platforms, let businesses combine structured data with unstructured and semi-structured data to answer questions spanning all of their managed data.
  • Data is the new oil!-Clive Humby, ANA Senior marketer’s summit, 2006Data is the new oil? No: Data is the new soil.-David McCandless, TEDGlobal, 2010Value: IT departments have had to make tough decisions about which data to keep and how long to keep it, and the processing power required to perform large and complex ad hoc analysis often has been beyond the department’s capacity and budget. Big-data solutions can provide value through insights gained by combining larger sets of data than were previously possible to manage. Now, companies can harvest more external data on market conditions, customer satisfaction, and competitive analysis, performing what-if scenarios for new insights.
  • Variability: The variability in data structure and how users want to interpret that data in the short and long term are considerations that may help a solution provider steer an organization toward a big data solution. Often the initial structure and content of data can change over time, and similar data from different sources can exhibit wide variability in structure and format. Big data solutions allow data to be stored in its original form and transformed for in-depth analysis when a user queries the data.China introduced 2-child @ 1970 & 1-child policy @ 1979.
  • SMPs are limited by the capacity of the OS to manage the architecture, necessitating solutions with 16 to 32 processors.MPPs often contain 50 to 200 processors or more. MPP systems can grow horizontally simply by adding more processors.
  • Challenges with Distributed ComputingCheap nodes fail, especially if you have manyMean time between failures for 1 node = 3 yearsMean time between failures for 1000 nodes = 1 daySolution: Build fault-tolerance into systemCommodity network = low bandwidthSolution: Push computation to the dataProgramming distributed systems is hardSolution: Data-parallel programming model: users write “map” & “reduce” functions, system distributes work and handles faults
  • Confronted with a data explosion, Google engineers Jeff Dean and Sanjay Ghemawatarchitected (and published!) two seminal systems: the Google File System (GFS) and Google MapReduce (GMR).GFS was a brilliantly pragmatic solution to exabyte-scale data management using commodity hardware.GMR was an equally brilliant implementation of a long-standing design pattern applied to massively parallel processing of said data on said commodity machines.GFS and GMR became the core of the processing engine used to crawl, analyze, and rank web pages into the giant inverted index that we all use daily at reverse engineering in the open source world, and, voila, Apache Hadoop — comprised of the Hadoop Distributed File System and HadoopMapReduce — was born in the image of GFS and GMR.Doug, who was working at Yahoo at the time, named it after his son's toy elephant.Read :
  • Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB
  • Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries.MapReduce can run on top of HDFS or a selection of other storage systemsIntelligent scheduling algorithms for locality, sharing, and resource optimization.
  • RDBMS and Hadoop: Apples and Oranges?Choose right tools for the right job.
  • FacebookYahooeBayGE – Sentiment AnalysisOrbitzInfochimps
  • 1.Variety refers to the many different data and file types that are important to manage and analyze more thoroughly, but for which traditional relational databases are poorly suited. Some examples of this variety include sound and movie files, images, documents, geo-location data, web logs, and text strings.Velocity is about the rate of change in the data and how quickly it must be used to create real value. Traditional technologies are especially poorly suited to storing and using high-velocity data. So new approaches are needed. If the data in question is created and aggregates very quickly and must be used swiftly to uncover patterns and problems, the greater the velocity and the more likely that you have a Big Data opportunity.2.Hadoop has surely captured the greatest name recognition, it is just one of three classes of technologies well suited to storing and managing Big Data. The other two classes are NoSQL and Massively Parallel Processing (MPP) data stores. Examples of MPP data stores include EMC’s Greenplum, IBM’s Netezza, and HP’s Vertica.3.The consistent trait of these varied data types is that the data schema isn’t known or defined when the data is captured and stored. Rather, a data model is often applied at the time the data is used.4.Now, thanks to rapidly increasing computer power (often cloud-based), open source software (e.g., the Apache Hadoop distribution), and a modern onslaught of data that could generate economic value if properly utilized, there are an endless stream of Big Data uses and applications.5.The specific native access methods to stored data provide a rich, low-latency approach, typically through a proprietary interface. SQL access has the advantage of familiarity and compatibility with many existing tools. Although this is usually at some expense of latency driven by the interpretation of the query to the native “language” of the underlying system.
  • “President Obama’s campaign ran an extremely sophisticated and relentless digital operation that threw out the rule book and took no assumption for granted, it was masterminded by data analysts who left nothing to chance.”
  • Human ProfilingRelated Video : Presidential Election: powered by Hadoop
  • Big Data & Hadoop Introduction

    1. 1. BIG DATA Enlightening Big Data - Jayant
    2. 2. What is BIG Data ? ? ?How BIG is BIG Data ?
    3. 3. How to define BIG Data ? ? ? Gartner’s Doug Laney in a 2001 research report.
    4. 4. Volume
    5. 5. Velocity • 300m photos uploaded / day • 2.5b content shared / day Facebook • 70K Queries executed / day • 500+TB / day • 340m tweets / day Twitter • 140m active users • 4.7b search queries / day Google • Processing 20 PB data / day • 1m transaction / hour Walmart • 2.5 petabytes of data / hour
    6. 6. Variety Structured Analysis Unstructured Analysis Responses to Pledge, Responses to following questions multiple choice questions • Share your story • Ask a question to Aamir • Send a message of hope • Share your solution Content Filtering Rating Tagging System (CFRTS) L0, L1, L2 phased analytics Impact Analysis Crawling general internet for measuring the before & after scenario on a particular topic
    7. 7. ValueIt is a capital mistake to theorize before one has data. -Sherlock Holmes
    8. 8. Variability Who enjoys the fastest internet? Where does our energy come from? Living longer with fewer children
    9. 9. Veracity
    10. 10. Other Effect – Geo, Event …
    11. 11. 3 I’s for Big Data • “data that’s an order of magnitude greater than data you’re accustomed to.” - Gartner analyst Doug Laney • “data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”Ill-Defined - Ed Dumbill, program chair for the O’Reilly Strata Conference • How do you make Big Data approachable? • There are lots of challenges in leveraging Big Data, from managing the data to having the right tools to get you the insights that matter. • Companies like Splunk and Sumo Logic are Big Data Apps for machine data. Marketing relevance company BloomReach processes more than 100 million web pages,Intimidating generating 94% average annual incremental traffic as a result. • What’s actionable about big data? • “the analytic value of data decays rapidly.” - Andrew Rogers, founder and CTO of Space Curve That means being able to analyze your data as fast as possible is critical to gaining competitiveImmediate advantage. “hit the iron when it is hot”
    12. 12. Managing BIG Data• Distributed Computing• Multiprocessing Unit• Parallel processing• SMP (Symmetric MultiProcessing solutions) : SMP systems use multiple processors that share a common operating system (OS) and memory. e.g. Microsoft SQL Server 2008 R2 Fast Track Data Warehouse platform• MPP (Massively Parallel Processing) : MPP systems harness numerous processors each having own OS & memory working on different parts of an operation in a coordinated way. e.g. Microsoft’s Parallel Data Warehouse solution• NoSQL Platforms : They increase performance at a lower cost, with linear scalability, true commodity hardware, a schema-free structure, and more relaxed data- consistency validation. e.g. Hadoop
    13. 13. Evolution – Distributed System Atomicity For the internet workload, with distributed Consistency computing, ACID properties are too strong. Isolation DurabilityRather than requiring consistency Basicafter every transaction, it is enough Availabilityfor the database to eventually be in Soft-statea consistent state -- BASE. Eventual consistency• Consistent – Reads always pick up the latest write.• Available – can always read and write.• Partition tolerant – The system can be split across multiple machines and datacentersCan do at most two of these three. Brewer’s CAP Theorem for Distributed Systems
    14. 14. Path to DataStack 3.0 Must support Variety, Volume and VelocityData Stack 1.0 Data Stack 2.0 Data Stack 3.0Relational Database Systems Enterprise Data Warehouse Dynamic Data PlatformRecording Business Events Support for Decision Making Uncovering Key InsightsHighly Normalized Data Unnormalize Dimensional Model Schema less ApproachGBs of Data TBs of Data PBs of DataEnd User Access thru Ent Apps End User Access Through Reports End User Direct AccessStructured Structured Structured + Semi Structured
    15. 15. Hadoop• A scalable fault-tolerant grid operating system for data storage and processing• Its scalability comes from the marriage of: • HDFS: Self-Healing High-Bandwidth Clustered Storage • MapReduce: Fault-Tolerant Distributed Processing• Operates on unstructured and structured data• A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …)• Open source under the friendly Apache License• Design Axioms:-• System Shall Manage and Heal Itself• Performance Shall Scale Linearly• Compute Should Move to Data• Simple Core, Modular and Extensible
    16. 16. Hadoop• Hadoop’s Inspiration – Google’s MapReduce 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch• Google’s GFS & GMR  Hadoop’s HDFS & HMR 2003-2004: Google publishes GFS and• Hadoop was created by Doug Cutting and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Michael J. Cafarella. Nutch• Hadoop is written in the Java programming 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch language and is a top-level Apache project 2007: NY Times converts 4TB of archives over being built and used by a global community of 100 EC2s contributors. 2008: Web-scale deployments at Y!, Facebook, April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes May 2009: Yahoo does fastest sort of a TB, 62secs over 1460 nodes Yahoo sorts a PB in 16.25hours over 3658 nodes June 2009, Oct 2009: Hadoop Summit (750), Hadoop World (500)
    17. 17. HDFS Hadoop Distributed File System Block Size = 64MB Replication Factor = 3Cost/GB is a few ¢/month vs $/month
    18. 18. MapReduce Distributed Processing
    19. 19. Working of Hadoop – I (Map Reduce)
    20. 20. Working of Hadoop – I (Map Reduce)
    21. 21. Working of Hadoop – I (MR Code) public void map(Object key, Text value, …. ) { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } public void reduce(Text key, Iterable<IntWritable> values, ……… ) { int sum = 0; for (IntWritable val : values) {sum += val.get();} result.set(sum); context.write(key, result); }
    22. 22. Working of Hadoop - II
    23. 23. Working of Hadoop - III
    24. 24. Hadoop Layout
    25. 25. Hadoop - Economics• Typical Hardware: • Two Quad Core Nehalems • 24GB RAM • 12 * 1TB SATA disks (JBOD mode, no need for RAID) • 1 Gigabit Ethernet card• Cost/node: $5K/node• Effective HDFS Space: • ¼ reserved for temp shuffle space, which leaves 9TB/node • 3 way replication leads to 3TB effective HDFS space/node • But assuming 7x compression that becomes ~ 20TB/nodeEffective Cost per user TB: $250/TBOther solutions cost in the range of $5K to $100K per user TB Powered by Hadoop: • Facebook • 1100-nodes cluster with 8800 cores • store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning • Yahoo • Biggest cluster: 4000 nodes • Search Marketing, People you may know, Search Assist, and many more… • Ebay • 532 nodes cluster (8 * 532 cores, 5.3PB). • Using it for Search optimization and Research
    26. 26. RDBMS and Hadoop RDBMS MapReduceData size Gigabytes PetabytesAccess Interactive and batch BatchStructure Fixed schema Unstructured schemaLanguage SQL Procedural (Java, C++, Ruby, etc)Integrity High LowScaling Nonlinear LinearUpdates Read and write Write once, read many timesLatency Low High
    27. 27. Choose Right Tool
    28. 28. BIG Data Landscape
    29. 29. Hadoopable Problem Types 1 Batchable• They are batchable into the two-phase Map/Reduce sequence(s) 2 Massive Volume• There is a need to analyze massive data volumes, which precluded their solution using more traditional platforms. 3 No Data Dependency• They exhibit little or no data dependence, meaning that work being done by one computational node is largely done on data locally accessible to that computational node. 4 No Process Dependency• They are amenable to massive parallelism in that there is little process dependence across computations. The tasks do not have to be “sequentialized,” meaning that those tasks really can be executed at the same time without having to wait for each other to provide interim results, except during the transition between the map and reduce phases. 5 Unstructure++• They are not limited to data managed within a structured environment, and in fact unstructured data analysis and analyzing combinations of structured and unstructured data are suitable. 6 No Inter-Process Communication• Individually-assigned tasks require limited inter-process communication, reducing any latency delays associated with injecting data into and pulling data out of a network. 6 Super Scale Hadoop Deployments
    30. 30. Myths 1 Big Data is Only About Massive Data Volume• Volume is just one key element in defining Big Data, and it is arguably the least important of three elements. The other two are variety and velocity.• Experts consider PBs of data volume as the starting point for Big Data, although this volume indicator is a moving target. 2 Big Data Means Hadoop• Hadoop is the Apache open-source software framework for working with Big Data. It was derived from Google technology and put to practice by Yahoo and others.• Big Data is too varied and complex for a one-size-fits-all solution. 3 Big Data Means Unstructured Data• The term “unstructured" is imprecise and doesn’t account for the many varying and subtle structures typically associated with Big Data types. Big Data is probably better termed “multi-structured” as it could include text strings, documents of all types, audio and video files, metadata, web pages, email messages, social media feeds, form data, etc. 4 Big Data is for Social Media Feeds and Sentiment Analysis• Early pioneers of Big Data have been the largest, web-based, social media companies — Google, Yahoo, Facebook — it was the volume, variety, and velocity of data generated by their services that required a radically new solution rather than the need to analyze social feeds or gauge audience sentiment. 5 NoSQL means No SQL• NoSQL means “not only” SQL because these types of data stores offer domain-specific access• Technologies in this NoSQL category include key value stores, document-oriented databases, graph databases, big table structures, and caching data stores.
    31. 31. Where/How its used Business Technical• Behavioral analysis • Staging area for Data• Targeting marketing offers warehouse / analytics• Analyzing marketing • Analytics Sandbox effectiveness • Unstructured / semi-• Root cause analysis structured content• Sentiment Analysis storage and analysis• Fraud Analysis • Total data analysis• Risk Mitigation • Commodity based Storage
    32. 32. Applications
    33. 33. Case StudyRigorous WeeklyOperation Cycleproducing instantanalyticsKiller combo of Human+Softwareto analyze the dataefficiently Topic opens on Sunday Episode Tags are refined and messages Live Analytics report is are re-ingested for sent during the show another pass Featured content is Data capture from SMS, delivered thrice a day phone calls, social all through out the media, website, week. JSONs are created for System runs L0 Analysis, the external and L1, L2 Analysts continue internal dashboards
    34. 34. Road Ahead…
    35. 35. “With too little data, you won’t be able to make any conclusions that you trust. With loads of data you will find relationships that aren’t real… Big data isn’t about bits, it’s about talent” – Douglas Merrill Q&A
    36. 36. Torture the data, and it will confess to anything. -Ronald Coase, Economics, Nobel Prize Laureate Thank You