Introduction to-big data

2,756 views

Published on

Published in: Technology, Business
2 Comments
4 Likes
Statistics
Notes
No Downloads
Views
Total views
2,756
On SlideShare
0
From Embeds
0
Number of Embeds
787
Actions
Shares
0
Downloads
199
Comments
2
Likes
4
Embeds 0
No embeds

No notes for slide

Introduction to-big data

  1. 1. Introduction to Big Data 1
  2. 2. What is Big Data? "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.“ – Gartner, 2012Impetus Proprietary 2
  3. 3. Evolution of Big Data  Data explosion!!  48 hours of equities market data ~ 5 TB  3.3 months of OPRA feeds ~ 5 PB  Semi/ Unstructured and real time Data  Google processes PB/hour  Bioinformatics – large datasets of genetics & drug formulations  Money laundering / terror funding, Spatial Data  By 2015, more than 85 percent of Fortune 500 organizations will fail to effectively exploit big data for competitive advantage. – Gartner  Already producing more than 1.9 zettabytes of data - Analyst firm, IDC Australia.Impetus Proprietary 3
  4. 4. 4 Vs of Big Data VOLUME Velocity BigData comes in One Size – Large Streaming Data, Time sensitive Data Terabytes/ Petabytes/ Exabytes Batch, Near/ Real Time Streams Value to Business Value to Business Variety Veracity Structures, Semi-Structured, unstructured Data is Doubt, Truthfulness of Data, Text, Audio/ Video, Click Streams, log files etc Authenticity or correctness of DataImpetus Proprietary 4
  5. 5. Value for Business Finance Healthcare Targetted Medicines with fewer Deeper Analysis to avoid credit risk. complications and side effects Predict the future and have risk free investment’s Telecommunication Retail Predict Failures and Reliable networks Focus on what customer wants Media Government Targetted and Focused Content Services based on facts and not on fictionImpetus Proprietary 5
  6. 6. BigData - Use CasesImpetus Proprietary 6
  7. 7. Potential Use Cases for Big DataImpetus Proprietary 7
  8. 8. Big Data - Use Cases Telecommunication Can I offer something better? Why am I loosing Customers? Better/ Best Deals? 24x7 Service - predict the Failures? Is network reliable What Plans should I offer to Which Plans are good for me? my customers? Telecom Vendors SubscribersImpetus Proprietary
  9. 9. Big Data - Use Cases Financial Services Wanted to buy some products • Are there any offers? I am launching some new offers • Are these offers good for me? • which stores are providing these offers? • How do I make the best use of it? • How to attract the relevant/ Ah! There are so many offers… hard to Interested customers? find the relevant ones…Impetus Proprietary 9
  10. 10. Big Data - Use Cases Web & Digital Media • Am I getting any value? • Are customers really coming to my site? • How many visitors are turned into buyers? • Are there any returning Buyers? • Market and Customer Segmentation? Merchant Portal Ad Promotions Paying $$$Analytical Reports/ WebAnalytics Click Stream DataImpetus Proprietary 10
  11. 11. BigData Challenges  Data processing: -  Processing & Analyzing large data – Terabyte++  Massively Scalable and Parallel  Moving computation is easy than moving data  Support Partial Failure  Data Storage  Doesn’t fit on 1 node, requires cluster  Flexible and Schema less Structure  Data Replication, Partioning and ShardingImpetus Proprietary 11
  12. 12. What we need? Solution? Data Processing  Distributed/ Grid/ Parallel Computing  Distributed data processing  Scales Linearly and Fault tolerant  Leverage NoSQL storage systems as well  Take computation near data to reduce IO  Merge processed results and serve  Stored results for further analyticsImpetus Proprietary 12
  13. 13. What we need? Solution? Data Storage  NoSQL  NoSQL starts from where RDBMS becomes dysfunctional  Flexible Data structure  Linear scaling on commodity Boxes – No SPOF  Real Time and bi-directional Replication  Scales Linearly  Sharding and Partioning of Data  Partition the data in smaller chunks  Store chunks on distributed file system  Replicate them to enable recovery from a node failureImpetus Proprietary 13
  14. 14. Technology LandscapeImpetus Proprietary 14
  15. 15. Typical Hadoop Based Solution Call Data Records Web Clickstreams XML Network Logs Satellite Feeds CSV GPS Data JSON Sensor Readings BINARY Sales Data LOG Emails Commodity Servers Big Data HDFS and Map Reduce Jobs InformationImpetus Proprietary 15
  16. 16. Deep Dive Into BigData ChallengesImpetus Proprietary 16
  17. 17. Data Storage NoSQLImpetus Proprietary 17
  18. 18. Background  RDBMS  Ruled the world for last 3 decades.  Internet changed the world and technology around us.  Scaling up does not work after a certain limit.  Scaling out is not much charming either  Sharding scales but you loose all the useful features of RDBMS  Sharding is operationally difficult  Web2 Apps have different requirement than enterprise appsImpetus Proprietary 18
  19. 19. Today’s Requirement - Data  Data does not fit on one node  Data may not fit in one rack  SANs are too expensive  Data partitioning - across multiple nodes / racks / datacenter  Evolving schemaImpetus Proprietary 19
  20. 20. Today’s Requirement - Reliability  Must be highly available  Commodity nodes - they may crash  Data must survive disk / node failure  Data must survive datacenter failureImpetus Proprietary 20
  21. 21. Introduction to NoSQL A different thought process RDBMS vs. NoSQL  How do we store vs. How do we use  Referencing vs. Embedding  Fixed schema vs. Evolving schema  Depth of functionality vs. Scalability + performance  Compute on read vs. Compute on writeImpetus Proprietary 21
  22. 22. Introduction to NoSQL { “_id” : “some unique string that is assigned to the contact”, “type” : “contact”, “name” : “contacts name”, Document “birth_day” : “a date in string form”, “address” : “the address in string form”, “phone_number” : “phone number in string form” } घ ख च Graph क ग छ ज झ Column Based [ Key = “name” ; value = “vinod” ; timestamp = Friday July 22, 2011]Impetus Proprietary 22
  23. 23. Data Model (Column Based) Column (Cell) RDBMS No SQL name • emailAddress value • vinod@vinodsingh.com timestamp • 1311150988226 time to live • 3600Impetus Proprietary 23
  24. 24. Data Model (Column Based) Table / Column Family RDBMS No SQL rowKey 1 Column 1 Column 2 Column n rowKey 2 Column 1 Column m rowKey n Column 1 Column 2 Column 3 Column zImpetus Proprietary 24
  25. 25. Data Model (Column Based) Super Column NoSQL name value address Column 1 Column 2 Column n No matching concept in RDBMSImpetus Proprietary 25
  26. 26. Data Model (Column Based) Super Column Family NoSQL Super Super Super rowKey 1 Column n Column 1 Column 2 Super Super rowKey 2 Column 1 Column m Super Super Super Super rowKey n Column 3 Column z Column 1 Column 2 No matching concept in RDBMSImpetus Proprietary 26
  27. 27. Use Cases When to use?  Huge amount of data – distributed across network  High query load – give results quickly  Evolving schema  Changes should happen without restart  Migration is not an option with large amount of dataImpetus Proprietary 27
  28. 28. Use Cases When to avoid?  Complex transactions such as in financial & accounting  ACID transactions is must  Small data sizeImpetus Proprietary 28
  29. 29. NoSQL pros  Massive scalability  High Availability  Lower cost with predictable elasticity  Flexible data structureImpetus Proprietary 29
  30. 30. NoSQL cons  Limited data query possibilities  Lower level of consistency aka Eventual consistency  No support for multi object transactions  No standardization  Ad hoc data fixing & reporting - no query language availableImpetus Proprietary 30
  31. 31. Curious case of NoSQL: How To..?  Scale, data growth is ~ 50%?  Migrate massive relational schema data to NoSQL?  Integrate existing application(s) with NoSQL?  Reduce on efforts of learning on NoSQL arena?Impetus Proprietary 31
  32. 32. Kundera  An Open-source project available on https://github.com/impetus-opensource/Kundera  An OGM (Object – Grid / NoSQL-Datastore) Mapping Tool  JPA 2.0 Compliance (ZERO Diversion)  Developers don’t need to unlearn (and learn)  Easy to use, Less boilerplate code  Drop-dead simple and fun  Relieves developer from the diversity and complexity that comes with NoSQL Datastores  Up and running in 5 minutes  For Cassandra  For MongoDB  For HBase  … And For Any RDBMSImpetus Proprietary 32
  33. 33. Kundera ArchitectureImpetus Proprietary 33
  34. 34. Setting up Kundera  Download Jar: - You can download latest Executable Kundera jar from here: https://github.com/downloads/impetus-opensource/Kundera/kundera- cassandra-2.0.7-jar-with-dependencies.jar  Using Kundera with any maven project <repository> <id>sonatype-nexus</id> <dependency> <name>Kundera Public Repository</name> <groupId>com.impetus</groupId> <url>https://oss.sonatype.org/content/repositories/releases</url> <artifactId>kundera</artifactId> <releases> <version>2.0.7</version> <enabled>true</enabled> </dependency> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository>  Building Kundera from source git clone git@github.com:impetus-opensource/Kundera.git mvn clean installImpetus Proprietary 34
  35. 35. Tweet-store app Problem statement: Migrate tweets to NOSQL User information(master data) and corresponding tweets are stored into RDBMS(Oracle/MySQL). Data growth for tweets(in 100 TB’s) is not scalable with RDBMS. How to scale and perform with big data problem?  Migrate tweets to NOSQL and keeping user master data into RDBMS- Polyglot persistenceImpetus Proprietary 35
  36. 36. Tweet-store app Entity Definition and Configuration:Impetus Proprietary 36
  37. 37. Tweet-store app Contd… Initialize Entity manager factory Create User object Find by Key Find by QueryImpetus Proprietary 37
  38. 38. Kundera Client extension framework NOSQL NOSQL NOSQL EntityReader QueryImplementor Client Factory NOSQL Client Kundera-engineImpetus Proprietary Kundera-core 38
  39. 39. Features  Supports – Cassandra, Hbase, MongoDB and Any RDBMS  Stronger Query Support (e.g. Super column based search)  CRUD / Query Support Across Datastores  Object Relationships Handling Across Datastores  Caching  Connection Pool  Datastore-Optimized Persistence and Query Approach  Pluggable architecture (Allows developers to – create a library specific to a particular data-store, Plug it into Kundera and Persist data into that data-store)  Flexibility for choosing Lucene-based or Datastore provided secondary indexing.  Provides auto schema generation feature for Cassandra, Mongo, Hbase and RDBMS.Impetus Proprietary 39
  40. 40. Data Processing HadoopImpetus Proprietary 40
  41. 41. Beyond Multithreading  Ever increasing Computing Requirements  Scaling – Horizontal vs Vertical  Parallel vs Distributed  Fault-tolerance  Grids/ Loosely Coupled Systems  Built using commodity systems  Aggregation of distributed systems  Centralized or Decentralized managementImpetus Proprietary 41
  42. 42. Challenges of Distributed Processing  Production deployments need to be carefully Planned  Unavailability on 1 node should not Impact  Need High Speed Networks  Data Replication invloves data conflicts  Troubleshotting and diagnosing  Geographically Distributed  Consistency & ReliabilityImpetus Proprietary 42
  43. 43. What is Hadoop?  A Batch processing Framework for distributed processing of large data sets on a network of commodity hardware.  Designed to scale out  Fault - tolerant – At Application level  Open source + Commodity hardware = Reduction in CostImpetus Proprietary 43
  44. 44. Components of Hadoop  NameNode  SNN - Secondary NameNode  JobTracker  TaskTracker  DataNode  HDFSImpetus Proprietary 44
  45. 45. Architecture of Hadoop NAMENODE Secondary NAMENODE Typical Hadoop Custer – JobTracker Master/ Slave Architecture Machine -1 Machine -2 Machine -3 TaskTracker TaskTracker TaskTracker Datanode Datanode Datanode HDFS HDFS HDFSImpetus Proprietary
  46. 46. Hadoop Distributed File System  Large Distributed File System  10K nodes, 100 million files, 10 PB  Assumes Commodity Hardware  Failure is expected, rather than exceptional  Streaming Data Access  Write-Once, Read-Many pattern  Batch processing  Node failure - ReplicationImpetus Proprietary 46
  47. 47. Hadoop Distributed File System Namenode Metadata  Meta-data in Memory  The entire metadata is in main memory  No demand paging of meta-data  Types of Metadata  List of files  List of Blocks for each file  List of DataNodes for each block  File attributes, e.g creation time, replication factor  A Transaction Log  Records file creations, file deletions. etcImpetus Proprietary 47
  48. 48. Hadoop Distributed File System DataNode  A Block Server  Stores data in the local file system.  Stores meta-data of a block.  Serves data and meta-data to Clients  Block Report  Periodically sends a report of all existing blocks to the NameNode  Facilitates Pipelining of Data  Forwards data to other specified DataNodesImpetus Proprietary 48
  49. 49. Hadoop Distributed File System Architecture Namenode (the master) name:/users/joeYahoo/myFile - copies:2, blocks:{1,3} name:/users/bobYahoo/someData.gzip, copies:3, blocks:{2,4,5} Metadata Client I/O Datanodes (the slaves) 1 2 1 4 2 5 2 3 4 3 4 5 5Impetus Proprietary 49
  50. 50. Data Integrity  Use checksums (CRC32) to validate data  File Creation  Client computes checksum per 512 byte  DataNode stores the checksum  File access  Client retrieves the data and checksum from DataNode  If validation fails, Client tries other replicasImpetus Proprietary 50
  51. 51. Data Compression  Reduces the number of bytes written to/read from HDFS  Efficiency of network bandwidth and disk space  Reduces the size of data needed to be readImpetus Proprietary 51
  52. 52. Data Compression  LZO Compression – https://github.com/toddlipcon/hadoop-lzo  Hadoop Snappy - http://code.google.com/p/snappy/Impetus Proprietary 52
  53. 53. Map/ Reduce Definition  Map -  Takes input and divides it into smaller sub-problems, and distributes them to worker nodes.  A worker node may do this again in turn, leading to a multi-level tree structure.  The worker node processes the smaller problem, and passes the answer back to its master node.  Reduce –  The master node then collects the answers to all the sub- problems  Combines them in some way to form the outputImpetus Proprietary 53
  54. 54. Map/ Reduce Hadoop  Master-Slave architecture  Master: JobTracker  Accepts MR jobs submitted by users  Assigns Map and Reduce tasks to TaskTrackers  Monitors task and TaskTracker status, re-executes tasks upon failure  Slaves: TaskTrackers  Run Map and Reduce tasks upon instruction from the JobTracker  Manage storage and transmission of intermediate outputImpetus Proprietary 54
  55. 55. Map/ Reduce Job Submission DFS 1. Copy Input Files Input Files Job.xml, 3. Read Input Job.jar Files 5. Upload job information Client 2. Submit Job User 4. Create/ get Splits 6. Submit Job JobTrackerImpetus Proprietary
  56. 56. Map/ Reduce Job Initialization DFS As many maps As splits Input Splits Job.xml, 8. Read Job Files Job.jar Client Maps Reduces JobTracker 6. Submit Job 9. Create Maps and Reduces 7. Initialize Job Job QueueImpetus Proprietary
  57. 57. Map/ Reduce Job Scheduling H1 JobTracker H2 Job Queue H3 11. Pick a task H4 (data local if possible) 10. Heartbeat 10. Heartbeat TaskTracker – TaskTracker - H2 H1 12. Assign Task 13. Launch Task 10. Heartbeat 10. Heartbeat TaskTracker – TaskTracker – H3 H4Impetus Proprietary
  58. 58. Map/ Reduce Task Execution JobTracker Task Tracker Upto MAX_MAP_SLOTS Maps Concurrently Upto Assign task for MAX_REDUCE_SLOTS Execution Reduces Concurrently Read into Job.xml, Local Disk Job.jar DFSImpetus Proprietary
  59. 59. Map/ Reduce Map Task JobTracker Map Completion Assign Task Execute User Code Event TaskTracker User calls output.collect Launch Task Buffer Intermediate Output File to Local DiskImpetus Proprietary
  60. 60. Map/ Reduce Reduce Task JobTracker Assign Task M1 M2 M3 M4 TaskTracker Sort and Merge as we get Map Outputs (Based on some Criteria) All Map Outputs Launch Task Have All Map Outputs? Execute User Code DFS Write Output File to DFSImpetus Proprietary
  61. 61. How Does it Scale?  Software – Apache hadoop  Designed for Scaling and Failures  Scale out  Add nodes at any time  Hardware –  Commodity Boxes?  DN/ TT: -  dual processor/dual core  4-8 GB Ram with ECC memory  4 x 500GB SATA drives  NameNode –  Do Not Comprise.  Server class Box with 32-48 GB Ram  4 x 1TB SATA drives with RAIDImpetus Proprietary 61
  62. 62. How Does it Scale?  Yahoo –  100,000 CPUs in >40,000 computers running Hadoop  Biggest cluster: 4500 nodes (2*4 cpu, 4*1TB disk & 16GB RAM)  Ebay  532 nodes cluster (8 * 532 cores, 5.3PB)  Facebook  A 1100-machine cluster with 8800 cores and about 12 PB raw storage  Linked-In  1200 nodes, with 2x6 cores, 24GB RAM, 6x2TB SATAImpetus Proprietary 62
  63. 63. Strategies for handling SPOF  Run on Different Servers  Primary and Secondary Node  SN periodically creates checkpoint  Download FSImage and EditLog from NN and merge them  Upload new Image to NN NN SN ReplImpetus Proprietary 63
  64. 64. Strategies for handling SPOF  Avatar Node@Facebook  Commercial Versions: -  MapR  Hortonworks  Some Geek Solutions  : -  Replace HDFS with MySQL Cluster for NamenodeImpetus Proprietary 64
  65. 65. Impetus Proprietary Hadoop Ecosystem Backup & Recovery Deployment Security Management Monitoring65
  66. 66. Thank You Q&AImpetus Proprietary 66

×