Big dataarchitecturesandecosystem+nosql


Published on

Big Data, Hadoop Ecosystem , NoSQL , Big Data Architectures

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big dataarchitecturesandecosystem+nosql

  1. 1. Overview of Big Data Architecture , Hadoop Ecosystem & NoSQL Databases Khanderao Kand CTO GloMantra Inc. Entrepreneur and Technologist Twitter @khanderao
  2. 2. Big Data Use Cases Predictive Analytics, Recommendations, Brand/Product Management Social CRM: Brand Analytics, Consumer Sentiment Analysis, Competition Analysis: Risks and Fraud Reduction : Financial, Intrusion, Anti Money Laundry Text Analytics: Patent Search Network and log analysis : Intrusion Analysis Health Analysis: Epidemics, Communicable deseases Intelligence Analysis: CIA, Homeland Security Societal: Social Movements Analysis, Political Campaign Analysis
  3. 3. Big Data Characteristics 3 V (Volume, Velocity and Variety) Variety: Text, Images, Videos, Social Web, Web Logs, ERPs, CRM Volume: Petabytes , Millions of people ,Billions / trillions of records Velocity: Speed of data coming in (likes, mobile, RFID, …) Loosely structured and distributed data Often involves time stamped events Incomplete / non-perfect data Velocity Volume Variety
  4. 4. Big Data is not just Hadoop …Processing Algorithms Log processing for frauds / intrusions / anomalies Behavioral analysis of consumers: for Ads / targeting Pattern recognition e.g. stock trades / weather Machine learning / correlating events Text Processing / Text mining / Sentiment analysis Search Predictive Analytics
  5. 5. Typical Tools Statistical Processing (e.g. R) Machine Learning ( Apache Mahout, UIMA) Text Processing (WEKA, Mallet) Complex Event Processing (S4, Esper) Data Mining / Warehousing (JDM)
  6. 6. Big Data vs Traditional Architecture Big Data Architecture … User launches a batch job 1 Three Tier Architecture App Request Data from Data Tier 2 Data Tier sends data to the App Tier 3 4 App Tier sends the report 5 User requests a report 1 Master Distributes Application 2 Master launches App on nodes 35 User downloads results Application & Data Tier Data Tier Application Tier Application Tier
  7. 7. Examples Select top (10 [cor (SMP500)) from STOCKS from US Select SUM(mentions) from twitter where hashtag = „coke‟ when ad =“coke” in period 1 day over 5 year If buyer age=45, gender=male, past: Nike, sports: 49ner, drinks: Budweiser currently searching: Harley Davidson what would he buy?
  8. 8. Ads Correlation Goal: Cluster users based on past response to ads (and not on any known / learned attributes) and use that knowledge to serve new ads for users in the clusters Approach: AdClicked Events would be processed by CF engine Userid, AdId, click -> Logged CF engine would do batch processing to cluster users with similar responses to past ads CF Based Optimization algo to get users predicted score for given ads Issues: Users click data is very sparse Ads may be short lived hence frequent CF batch (like indexing) needed Mitigation: Any way to correlate users demographic to click response (currently correlation is low) . Can we infer users cluster with demographic based cluster?
  9. 9. Collaborative Filtering Basic Concept: Leverage information provided by interactions of users to predict items of interest for a user Motivation: What to recommend to user Based on: user past actions / feedback (clicks ) and users who acted similar to „this‟ user Advantage: Very good results Content / language agnostic
  10. 10. CF Recommendation Ad1 Adn U1 U j CF Algorithm Recommendation Top Ads For The user
  11. 11. Serving Best Ad that user May click or view Site Content MR1 Cassandra / MogoDB MR3User Clicks Cassandra / MogoDB Cassandra / MogoDB Ad Data Site Content Analysis and Classifier DMZ Freebase OpenCalais Content Analysis User Behavior User-Interest MR2 User Cluster Based AdReco ALGO: CF Algo: Text Analytics + Classifier (SVM/Bayes) Classifiers + Statistical mySQL / Apache Jena
  12. 12. Types of Big Data Platforms Type Concepts Size Vendors In-Memory Databases • Specialized I/O and Flash Memory for faster I/O • Specialized HW • Locked in Order of TBs Oracle Exalytics, SAP HANA, Scaleout, Kognitio Massively Parallel Computing (MPP) • Massive Nodes • Organized data • Distributed Query • Special HW Order of 10s of TB Greenplum, Netezza, Teradata Aster, Sybase IQ Map Reduce • Map and Reduce • Horizontally scalable • Commodity hardware 100‟s of TB to Petabytes Hadoop
  13. 13. The image was taken from the Atacama desert in western South America by Yuri Beletsky (Las Campanas Observatory, Carnegie Institution for Science) on July 11, 2012. Copyright Yuri Beletsky
  14. 14. Alignment… Explosion of data from site logs, search engines, social media… Google published paper on Map Reduce and Google File System, inspired Doug Cutting working on Apache Lucene-Nutch, Hadoop born Yahoo took further with 1000 nodes in 2007-2008 Possible to process very very large data on commodity hardware Apache Open source
  15. 15. Main Stars Availability: Explosion of Data Technology: Hadoop Cheaper storage and hardware Scalability with Cloud Requirement: Business requirement of intelligence out of the data
  16. 16. Hadoop Apache Java Open Source Google Idea, Yahoo original implementation, open sourced Two Components: HDFS distributed File System and Map-Reduce Engine Commodity Hardware Very High Scalability
  17. 17. HDFS Large Data Set Write Once – Read Many Fault Tolerant Distributed File System Name Node – Data Node Fixed Size Data Blocks Checksum Files – Sequence of blocks Replicated over Balanced Cluster Heartbeat Report from Nodes NameNode Client 1 Client2 Read Write Replication Rack1 Rack N
  18. 18. Hadoop Jobs-Tasks Job Tracker Client 1 Client2 Task Tracker Rack N • Move the processing (Code) to Data instead of Data to Code • JobTracker distributes and tracks tasks • TaskTracker on processing nodes communicated task status to JobTrackers • If Task does not respond, marked as failed, and relaunched on another Node Task Tracker2
  19. 19. Map Reduce • Two Step, Map and Reduce, approach of solving problem • Move the code to the data • Map step process data on nodes • Reduce step aggregates results from all Map nodes with reduce algorithm Map Reduce OutputInput Sort / Shuffle
  20. 20. Big Data Process: MR Job Train Map Reduce Output Map Reduce Output Map Reduce Output Map Reduce Output Map Reduce Output
  21. 21. Big Data Stack Speed Scale Speed Hadoop Esper, S4 kdb Hbase MongoDB MySQL Scale Mahout Matlab R SciPy SAS SPSS Patents Infrastructure Technology Layer Processing Algorithms Applications
  22. 22. Big Data Logical Architecture Hadoop Map Reduce Unstructured Data Lucene Nutch Structured Data RDBMS Datalogs Streams ETL Data Integration Workflow & Scheduler System Admin Monitoring No-SQL Hadoop Based RDBMS No-SQL SOLR Apps BI Visualization Analytics Products BI Tools - Dev
  23. 23. Hadoop Ecosystem (Basic) HDFS Map Reduce HCatalog Network HBase SqoopHivePigAvro/ Thrift Data Access Zookeeper,HCatalog Knox Chukwa / Flume Oozie Processing Storage Workflow Orchestration Ambari,Nagios,Ganglia BI Analytics Apps RDBMS
  24. 24. Apache AVRO RPC and serialization framework Programming language independent JSON format Primary use Hadoop Communication between Nodes and In/Out Hadoop
  25. 25. Apache Thrift Interface Definition Language for RPC Language Independent Binary Communication format Layered Stack enabling debugging and monitoring No config / No centralization Developed by Facebook IDL needs code generation for schema change Code Service Client Read/Write TProtocol TTransport
  26. 26. Apache Hive SQL-like HiveQL Warehousing Apps Compiles to MapReduce Tasks Facebook, Netflix, etc. hive> CREATE TABLE ADLOG (adtime timestamp, id int, action string) hive> SHOW TABLES; hive> DESCRIBE ADLOG; hive> ALTER TABLE … hive> FROM rawlog r INSERT OVERWRITE TABLE ADLOG SELECT TRANSFORM(r.time,, r.input) AS (adtime, id, action) USING '/bin/log' WHERE a.adtime > '2008-08-09';
  27. 27. Apache Pig Latin Higher Level scripting above Map Reduce Procedural (unlike SQL) by easy like SQL Constructs like FOREACH, GROUP Supports User Defined Functions From Yahoo Good for Integrating and writing Hadoop Jobs A = LOAD 'WordcountInput.txt'; B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' AS (word:chararray, count: int) `my.outputDir`;
  28. 28. Sqoop Data Bulk Load Data Import Export RDBMS and NoSQL HDFS, Hbase Data Sliced Sliced Transferred via MAP only Jobs
  29. 29. Chukwa Hadoop Subproject Large scale log processing In/Out HDFS Collection and analysis Batch Oriented Components: Agents Collectors MR Jobs for Parsing & Archiving HICC : Hadoop Infra Care Center Web App
  30. 30. Flume Apache project Large scale log processing Supported by Cloudera Log Stream Components: Agents Channel Clients Log4JAppender, HTTP .. Compared with Chukwa: Near Real time (seconds) vs Minutes No Central Config Source Agent Sink1 Agent Sink2 Agent Sink Client Flume Channel
  31. 31. Big „Fast‟ Data Real time adhoc querry: Google Percolater and Dremel inspired Cloudera : Impala SQL like querry on HDFS Lower latency By pass Map Reduce Apache Drill
  32. 32. Apache Storm High Volume Stream Processing Twitter (acquired BackType) Uses ZeroMQ Concepts: Spout Bolt (like Map or Reduce) Topology Spout Spout Bolt (Transor m) Bolt Bolt (Reduce) Bolt
  33. 33. Storm + Fusion Convergence – Twitter Model
  34. 34. NoSQL & Map Reduce NoSQL databases provides: Schema flexibility, Aligned programming models High Volume and scalability on commodity hardware Eventual Consistency Can Interact with real time Applications and high velocity of data Hadoop / HDFS catered more for batch processing, its gap with operational apps can be bridged by using NoSQL to avoid duplication and latency of data Such integration powers NoSQL with high performing Map Reduce functionality HBase natively Hadoop Based Cassandra augmented to Hadoop MongoDB had MapReduce functionality but not HDFS based. MongoDB added HadoopBridge
  35. 35. Q & A