Compegence: Nagaraj Kulkarni - Hadoop and No SQL_TDWI_2011Jul23_Preso


Published on

COMPEGENCE: Nagaraj Kulkarni - Hadoop and No SQL.
Presented in TDWI (The Data Warehousing Institute) India, Hyderabad (July 2011)

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Compegence: Nagaraj Kulkarni - Hadoop and No SQL_TDWI_2011Jul23_Preso

  1. 1. Co-Creating COMPetitive intelliGENCE through Process, Data and Domain driven Information Excellence Hadoop and No SQL PRESENTED in TDWI India, Hyderabad (2011 July) Nagaraj KulkarniHadoop and No SQL Slide 1 2011 Jul
  2. 2. Process, Data and Domain Integrated Approach Market Actions Decision Excellence Actionable Systemic Competitive Advantage lies in Changes the exploitation of: Usable Business Process Landscape –More detailed and specific information –More comprehensive Business external data & dependencies Timely Infor Systems Intent –Fuller integration mation –More in depth analysis Business –More insightful plans and Flexible strategies Usage Domain Data –More rapid response to Scalable business events Cost –More precise and apt response to customer events Sustainable Effort Skills & CompetencyRam Charan’s Book: What The CEO Wants You To Know: How Your Company Really WorksCOMPEGENCEHadoop and No SQL Slide 2 2 Information Excellence Foundation 2011 Jul
  3. 3. Touching Upon Context For Big Data Challenges Data Base Systems – Pre Hadoop Strengths and Limitations What is Scale, Why No SQL Think Hadoop, Hadoop Eco system Think Map Reduce Nail Down Map Reduce Think GRID (Distributed Architecture) Deployment Options Map Reduce Not and Map Reduce Usages Nail Down HDFS and GRID ArchitectureHadoop and No SQL Slide 3 2011 Jul
  4. 4. Big Data ContextHadoop and No SQL Slide 4 2011 Jul
  5. 5. Systemic Changes Boundary less ness Connected Best Sourcing Globe Interlinked Culture Demand Side Focus Customer Centric Bottom Up Innovation Empowered employees Leading Trends Agility and Responsiveness Response Time Speed, Agility, FlexibilityHadoop and No SQL Slide 5 2011 Jul
  6. 6. Landscape To Address Manageability Data Scalability Explosion Performance Agility Information Overload Decision Making Time to Action Interlinked Boundaryless Processes Systemic Understanding Collaborate and Synergize & Systems Simplify and ScaleHadoop and No SQL Slide 6 2011 Jul
  7. 7. Information Overload A wealth of information creates a poverty of attention. Herbert Simon, Nobel Laureate EconomistHadoop and No SQL ConfidentialCOMPEGENCE Slide 7 7 2011 Jul
  8. 8. More Touch points, More Channels BACKUP Source: JupiterResearch (7/08) © 2008 JupiterResearch, LLCHadoop and No SQL Slide 8 2011 Jul
  9. 9. Scale – What is it?Hadoop and No SQL Slide 9 2011 Jul
  10. 10. How do we scale Traditional System - How they achieve Scalability  Multi Threading  Multiple CPU – Parallel Processing  Distributed Programming – SMP & MPP  ETL Load Distribution – Assigning jobs to different nodes  Improved ThroughputHadoop and No SQL Slide 10 2011 Jul
  11. 11. Scale – What is it about? Facebook 1.73 Billion Internet Users 500 Million Active eBay Users per Month 90 Million Active 247 Billion emails per day Users 500 Billion+ Page 126 Million Blogs Views per month 10 Billion Requests per day 5 Billion Facebook Content 25 Billion+ Content per week per month 220 million+ items on sale 50 Million Tweets per day 15 TB New Data / day 1200 m/cs, 21 PB 40 TB + / day 80% of this data is Cluster 40 PB of Data unstructured Yahoo Twitter Estimated 800 GB of data 82 PB of Data 1 TB plus / day per user (million Petabyte!) 25000+ nodes 80 + nodesHadoop and No SQL Slide 11 2011 Jul
  12. 12. How do we scale – Think Numbers Thinking of Scale - Need for Grid Think Numbers Data Highway 1000 Nodes / DC Datamart lb n1 -nn 100 mps/pipe 10 DC Log storage bp & processing dc 2 1K byte webserver log record dc 1 Web server Log Datamart 1 second / row ………. ………. In one day 1000 * 10 * 1K * 60 * 60 * 24 = 864 GB Storage for a year 864 GB * 365 = 315 TB To store 1 PB – 40K * 1000 = Millions $ To process 1 TB = 1000 minutes ~ 17 hrs Think Agility and FlexibilityHadoop and No SQL Slide 12 2011 Jul
  13. 13. Scale – What is it about? Volume Speed Integration level more… Does it scale linearly with data size and analysis complexityHadoop and No SQL Slide 13 2011 Jul
  14. 14. We would not have no issues… If the following assumptions Hold Good: The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesnt change. There is one administrator. Transport cost is zero. The network is homogeneous.Hadoop and No SQL Slide 14 2011 Jul
  15. 15. Think HadoopHadoop and No SQL Slide 15 2011 Jul
  16. 16. New Paradigm: Go Back to Basics Divide and Conquer (Divide and Delegate and Get Done) Move Work or Workers ? Relax Constraints (Pre defined data models) Expect and Plan for Failures (avoid n address failures) Community backup Assembly Line Processing (Scale, Speed, Efficiency, Commodity Worker) The “For loop” Parallelization (trivially parallelizable) Infrastructure and Supervision (Grid Architecture) Manage Dependencies Ignore the Trivia (Trivia is relative!) Joel SpolskyCharlie Munger’s Mental Models and No SQL Slide 16 2011 Jul
  17. 17. New Paradigm: Go Back to Basics Map Reduce Paradigm Grid Architecture Divide and Conquer Split and Delegate The “for loop” Move Work or Workers Sort and Shuffle Expect and Plan for Failures Parallelization (trivially parallelizable) Assembly Line Processing (Scale, Relax Data Constraints Speed, Efficiency, Commodity Worker) Assembly Line Processing Scale, Manage Dependencies and Failures Speed, Efficiency, Commodity Worker) Ignore the Trivia (Trivia is relative!) Map Reduce History Replication, Redundancy, Lisp Heart Beat Check, Cluster rebalancing, Unix Fault Tolerance, Task Restart, Google FS Chaining of jobs (Dependencies), Graceful Restart, Look Ahead or Speculative execution,Hadoop and No SQL Slide 17 2011 Jul
  18. 18. No SQL Options Hbase/Cassandra for huge data volumes- PBs. •Hbase fits in well where Hadoop is already being used. •Cassandra less cumbersome to install/manage MongoDB/CouchDB Document oriented databases for easy use and GB-TB volumes. Might be problematic at PB scales Neo4j like graph databases for managing relationship oriented applications- nodes and edges Riak, redis, membase like Simple key-value databases for huge distributed in-memory hash mapsHadoop and No SQL Slide 18 2011 Jul
  19. 19. Let us Think HadoopHadoop and No SQL Slide 19 2011 Jul
  20. 20. RDBMS and Hadoop RDBMS MapReduce Data size Gigabytes Petabytes Interactive and Access Batch batch Unstructured Structure Fixed schema schema Procedural (Java, Language SQL C++, Ruby, etc) Integrity High Low Scaling Nonlinear Linear Write once, read Updates Read and write many times Latency Low HighHadoop and No SQL Slide 20 2011 Jul
  21. 21. Apache Hadoop Ecosystem Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high-level data-flow language and execution framework for parallel computation. HBase / Flume / Scribe: A scalable, distributed database that supports structured data storage for large tables. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. ZooKeeper: A high-performance coordination service for distributed applications. Flume: Message Que Processing Mahout: scalable Machine Learning algorithms using Hadoop Chukwa: A data collection system for managing large distributed systems.Hadoop and No SQL Slide 21 2011 Jul
  22. 22. Apache Hadoop Ecosystem ETL Tools BI Reporting RDBMS Zookeepr (Coordination) Pig (Data Flow) Hive (SQL) Sqoop Avro (Serialization) MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System)Hadoop and No SQL Slide 22 2011 Jul
  23. 23. HDFS – The BackBone Hadoop Distributed File SystemHadoop and No SQL Slide 23 2011 Jul
  24. 24. Map Reduce – The New Paradigm Transforming Large Data MapReduce Basics •Functional Programming Mappers •List Processing •Mapping Lists ReducersHadoop and No SQL Slide 24 2011 Jul
  25. 25. PIG – Help the Business User QueryPig: Data-aggregation functions over semi-structured data (log files). Pig Latin Programs Query Parser Logical Plan Semantic Checking Logical Plan Logical Optimizer Optimized Logical Plan Logical to Physical Translator Physical Plan Physical To M/R Translator MapReduce Plan Map Reduce Launcher Create a job jar to be submitted to Hadoop clusterHadoop and No SQL Slide 25 2011 Jul
  26. 26. PIG Latin ExampleHadoop and No SQL Slide 26 2011 Jul
  27. 27. HBASE – Scalable Columnar • Scalable, Reliable, Distributed DB • Columnar Structure • Built on top of HDFS • Map Reduceable • A SQL Database! – No joins – No sophisticated query engine – No transactions – No column typing – No SQL, no ODBC/JDBC, etc. • Not a replacement for your RDBMS...Hadoop and No SQL Slide 27 2011 Jul
  28. 28. HIVE – SQL Like • A high level interface on Hadoop for managing and querying structured data • Interpreted as Map-Reduce jobs for execution • Uses HDFS for storage • Uses Metadata representation over hdfs files • Key Building Principles: • Familiarity with SQL • Performance with help of built-in optimizers • Enable Extensibility – Types, Functions, Formats, ScriptsHadoop and No SQL Slide 28 2011 Jul
  29. 29. FLUME – Distributed Data Collection • Distributed Data / Log Collection Service • Scalable, Configurable, Extensible • Centrally Manageable • Agents fetch data from apps, Collectors save it • Abstrations: Source -> Decrator(s) -> SinkHadoop and No SQL Slide 29 2011 Jul
  30. 30. Oozie – Workflow Management An Oozie Workflow M/R streaming OK job SSH OK start HOD fork join Alloc Pig MORE OK decision job ERROR ERROR M/R ENOUGH ERROR job ERROR kill OK ERROR ERROR Java Main OK FS OK end jobHadoop and No SQL Slide 30 2011 Jul
  31. 31. Think Map n ReduceHadoop and No SQL Slide 31 2011 Jul
  32. 32. Understanding Map Reduce Paradigm Logical ArchitectureHadoop and No SQL Slide 32 2011 Jul
  33. 33. Understanding Map Reduce ParadigmHadoop and No SQL Slide 33 2011 Jul
  34. 34. Map Reduce Paradigm Job Configure the Hadoop Job to run. Mapper map(LongWritable key, Text value, Context context) Reducer reduce(Text key, Iterable<IntWritable> values, Context context)Hadoop and No SQL Slide 34 2011 Jul
  35. 35. Programming model Map –Reduce Definition MapReduce is a functional programming model and an associated implementation model for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. CONCEPTSHadoop and No SQL Slide 35 2011 Jul
  36. 36. Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) •Processes input key/value pair •Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key •Produces a set of merged output values (usually just one) •Inspired by similar primitives in LISP and other languagesHadoop and No SQL Slide 36 2011 Jul
  37. 37. Map Reduce Paradigm Word Count Example A simple MapReduce program can be written to determine how many times different words appear in a set of files. What does Mapper and Reducer do? Pseudo Code: mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum)Hadoop and No SQL Slide 37 2011 Jul
  38. 38. Programming model Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Pseudocode: See appendix in paper for real codeHadoop and No SQL Slide 38 2011 Jul
  39. 39. Understanding Map Reduce Paradigm Map – Reduce Execution Recap • Master-Slave architecture • Master: JobTracker – Accepts MR jobs submitted by users – Assigns Map and Reduce tasks to TaskTrackers (slaves) – Monitors task and TaskTracker status, re-executes tasks upon failure • Worker: TaskTrackers – Run Map and Reduce tasks upon instruction from the Jobtracker – Manage storage and transmission of intermediate outputHadoop and No SQL Slide 39 2011 Jul
  40. 40. Understanding Map Reduce Paradigm Map – Reduce Paradigm Recap Example of map functions – Individual Count, Filter, Transformation, Sort, Pig load Example of reduce functions – Group Count, Sum, Aggregator A job can have many map and reducers functions.Hadoop and No SQL Slide 40 2011 Jul
  41. 41. How are we doing on the ObjectiveHadoop and No SQL Slide 41 2011 Jul
  42. 42. Process, Data and Domain driven Information Excellence ABOUT COMPEGENCEHadoop and No SQL Slide 42 2011 Jul
  43. 43. Process, Data and Domain Integrated Approach Market Actions Actionable Systemic Decision Excellence Changes Competitive Advantage lies in the exploitation of: Usable Business Process Landscape –More detailed and specific information –More comprehensive Business Timely Infor external data & dependencies Systems Intent mation –Fuller integration –More in depth analysis Business Flexible Usage –More insightful plans and strategies Domain Data –More rapid response to Scalable Cost business events –More precise and apt Sustainable response to customer events Effort Skills & Competency We complement your “COMPETING WITH ANALYTICS JOURNEY”Hadoop and No SQL Slide 43 2011 Jul
  44. 44. Value Proposition Constraints Decisions? Decisions Tools Alternatives Data Technologies Assumptions Dependencies Trends Concerns / Risks TeraBytes Processes Actions Cost of Ownership Meta data Laye r f or C on sistent Bu sin e ss Unde rstandi ng Actions? Platforms Technology Evolution Sour ceD at Data C usto m D a er ata Extr ct Extrac t a S ginng ta g Ta Tr nsfo r ra m Lo ad A pl ica tion s p at i COMPEGENCE A ssets Busin s Rul s e s e Anal sis y L i a i i t es bl i I n v stm t e en n e ra I t g te Trusted Dashboa ds r T n t ra sla e Da ta C ards Segme n t People R eference D ata (B r nch, P rodu ct ) a P art erD ata n s Repeatable D ri e e v P li g rofi n Fou ndat i n with o Reports Excel DW n I terface C R M/ Marketi g P rograms n Reusable Su m rize m a Pla tf orm Pla tf orm Results Processes Leverage Data Qua lit y and Pro cess Aud it Results? Trade Offs Partners People Reports Cost Ease of Use: Current State Returns Drill Down, Up, Across Time Returns? Dashboards Jump Start the “Process and Information Excellence” journey Focus on your business goals and “Competing with Analytics Journey” Overcome multiple and diverse expertise / skill-set paucity Preserve current investments in people and technology Manage Data complexities and the resultant challenges Manage Scalability to address data explosion with Terabytes of Data Helps you focus on the business and business processes Helps you harvest the benefits of your data investments faster Consultative Work-thru Workshops that help and mature your teamHadoop and No SQL Slide 44 2011 Jul
  45. 45. Our Expertise and Focus Areas Process + Data + Domain => Decision Analytics; Data Mining; Big Data; DWH & BI Architecture and Methodology Partnered Product Development Consulting, Competency Building, Advisory, Mentoring Executive Briefing Sessions and Deep Dive WorkshopsHadoop and No SQL Slide 45 2011 Jul
  46. 46. Partners in Co-Creating Success Process, Data and Domain driven Information Excellence Process, Data and Domain driven Business Decision Life Cycle info@compegence.comHadoop and No SQL Slide 46 2011 Jul