Published on

Big Data.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. BigData<br />Shankar Radhakrishnan<br />July, 2011<br />
  2. 2. Big Data in the News<br />Savings<br />American Health-Care: $300 Billion/Year<br />European Public Sector: €250 Billion/Year<br />Productivity Margins: 60% increase<br />Sources: McKinsey Global Institute<br />
  3. 3. Topics<br />What do we collect today?<br />DBMS Landscape<br />The Disconnect<br />The Need<br />What is BigData?<br />Characteristics<br />Approach<br />Architectural Requirements<br />Techniques<br />Challenges<br />Solutions<br />Issues<br />Deep Dive – Practical Approaches to Big Data<br />Hadoop<br />Aster Data<br />
  4. 4. What do we collect?<br />In 2010, people stored data to fill 60,000 Library of Congress (LoC collected 235TB in Apr/2011)<br />YouTube receives 24hours of video, every minute<br />5 Billion mobile phones in use in 2010<br />Tesco (British Retailer) collects 1.5 billion pieces of information to adjust prices and promotions<br /> 30% of sales is out of its recommendation engine<br />Planecast, Mobclix : Track & Target systems promotes contextual promotions<br />A Boeing Jet Engine produces 20TB/Hour for engineers to examine in real time to make improvements<br />Sources: Forrester, The Economist,McKinsey Global Institute<br />
  5. 5. Collect More<br />Business Operations<br />Transactions<br />Registers<br />Gateways<br />Customer Information<br />CRM<br />Product Information<br />Barcodes<br />RFID<br />Web<br />Pages<br />Web Repositories<br />Unstructured Information<br />Social Media<br />Signals<br />Mobile<br />GPS, GeoSpatial<br />
  6. 6. DBMS Solutions<br />Legacy<br />Faster Retrieval<br />Efficient Storage<br />Divide and Access<br />Data Consolidation<br />Broader Tables<br />Access all as a row<br />Fine Grain<br />Access<br />Security<br />Rules and Policies<br />Problems<br />Data Growth<br />When storage cost is not an issue<br />Scalability Issues<br />Performance Issues<br />New types of requirements<br />Deciding what to analyze, when and how?<br />Cost of a change in the subject-area to analyze<br />
  7. 7. The Disconnect<br />Old DBMS vs. New Data Types/Structures<br />Old DBMS vs. New volume<br />Old DBMS vs. New Analysis<br />Old DBMS vs. Data Retention<br />Old DBMS vs. Data Element Striping<br />Old DBMS vs. Data Infrastructure<br />Old DBMS vs. One DB Platform for all<br />
  8. 8. The Need<br />System that can handle high volume data<br />Perform complex operations<br />Scalable<br />Robust<br />Highly Available<br />Fault Tolerant<br />Economic<br />New Approach<br />
  9. 9. Big Data<br />“Tools and techniques to manage different types of data, in high volume, in high velocitywith varied requirements to mine them”<br />Characteristics<br />Size<br />Scale up and scale out: Terabyte, Petabyte …<br />Structure<br />Structured<br />Unstructured : Audio, Video, Text, GeoSpatial<br />Schema Less Structures<br />Stream<br />Torrent of real-time information<br />Operation<br />Massively Parallel Processing (MPP)<br />
  10. 10. Approach<br />Hardware<br />Commodity Hardware<br />Appliance<br />Dynamic Scaling<br />Fault Tolerant<br />Highly Available<br />No constraints on Storage<br />Cloud<br />Virtual Environment, Storage<br />Processing Models<br />In-memory<br />In-database<br />Interfaces/Adapters<br />Workload Management<br />Distributed Data Processing<br />Software<br />Frameworks – Hadoop, MapReduce, Vrije, BOOM, Bloom<br />Open Source<br />Proprietary<br />
  11. 11. Architectural Requirements<br />Integration Framework<br />Development Framework<br />Management Framework<br />Modeling Framework<br />Processing Framework<br />Data Management Framework<br />
  12. 12. Challenges<br />Volumetric Analysis<br />Complexity<br />Streaming Data/Real Time Data<br />Network Topology<br />Infrastructure<br />Pattern-based Strategy<br />
  13. 13. Techniques<br />Controlled and Variate Testing<br />Mining<br />Machine Learning<br />Natural Language Processing (NLP)<br />Cohort Analysis<br />Network or Path Analysis<br />Predictive Models<br />Crowd Sourcing<br />Regression Models<br />Sentiment Analysis<br />Processing Signals<br />Spatial Analytics<br />Visualization<br />Time-series Analysis<br />
  14. 14. Solutions<br />IBM: Infosphere BigInsights, Streams<br />Teradata/Aster Data: nCluster, SQL-MR<br />Frameworks<br />Hadoop<br />MapReduce<br />Infobright*<br />Splunk<br />Cloudera*<br />Cassandra<br />NoSQL, NewSQL<br />Google’s Big Table<br />Appliance<br />Teradata<br />Netezza (IBM)<br />Columnar Databases<br />Vertica (HP)<br />ParAccel<br />Managed Services Available<br />
  15. 15. Issues<br />Latency<br />Faultiness<br />Accuracy<br />ACID<br />Atomicity<br />Consistency<br />Isolation<br />Durability<br />Setup Cost<br />Development Cost<br />Cost-to-fly<br />
  16. 16. Deep Dive<br />Hadoop<br />
  17. 17. Top level Apache project<br />Open source<br />Software Framework - Java<br />Inspired by Google’s white papers onMap/Reduce (MR)Google File System (GFS)Big Table<br />Originally developed to support Apache Nutch<br />Designed<br />Large scale data processing<br />For batch processing<br />For sophisticated analysis<br />To deal with structured and unstructured data<br />DB Architect’s Hadoop : "Heck Another Darn Obscure Open-source Project"<br />
  18. 18. Why Hadoop?<br />Runs on commodity hardware<br />Portability across heterogeneous hardwareand software platforms<br />Shared-nothing architecture<br />Scale hardware when ever you want<br />System compensates for hardware scalingand issues (if any)<br />Run large-scale, high volume data processes<br />Scales well with complex analysis jobs<br />(Hardware) “Failure is an option”<br />Ideal to consolidate data from both new andlegacy data sources<br />Highly Integrable<br />Value to the business<br />
  19. 19. Hadoop Ecosystem<br />HDFS Hadoop Distributed File System<br />Map/Reduce Software framework for Clustered, Distributed data processing<br />ZooKeeper Scheduler<br />Avro Data Serialization<br />Chukwa Data Collection System to monitor Distributed Systems<br />HBase Data storage for distributed large tables<br />Hive Data warehouse<br />Pig High-Level Query Language<br />Scribe Log Collection<br />UDF User Defined Functions<br />
  20. 20. Hadoop Flow (Example)<br />Network Storage<br />Web Servers<br />Scribe<br />Oracle<br />MySQL<br />Hadoop Hive DWH<br />MySQL<br />Oracle<br />Apps<br />Feeds<br />
  21. 21. HDFS<br />Hadoop Distributed File System<br />Master/Slave Architecture<br />Runs on commodity hardware<br />Fault Tolerant<br />Handle large volumes of data<br />Provides High Throughput<br />Streaming data-access<br />Simple file coherency model<br />Portable to heterogeneous hardware and software<br />Robust<br />Handles disk failures, replication (& re-replication)<br />Performs cluster rebalancing, data integrity checks<br />
  22. 22. HDFS Architecture<br />Name node<br /><ul><li>File system operations
  23. 23. Maps data-nodes</li></ul>Data node<br /><ul><li>Process read/write
  24. 24. Handles Data-blocks
  25. 25. Replication</li></li></ul><li>Hadoop M/R<br />Tagged by a job<br />Splits input data-set into separate chunk’s<br />Processed by map tasks, in parallel<br />Sorts the output of the maps<br />Processed by reduce tasks, in parallel<br />Typically stored and processed in a file system<br />Framework takes care of<br />Scheduling tasks<br />Monitoring<br />Re-executing failed tasks<br />Infrastructure issues<br />Load-balancing, Load-redistribution<br />Replication, Failover<br />
  26. 26. Mapper Function<br />cat * | grep | sort | uniq –c | cat > file<br />input | map | shuffle | reduce | output<br />
  27. 27. Reduce Function<br />cat * | grep | sort | uniq –c | cat > file<br />input | map | shuffle | reduce | output<br />
  28. 28. Who uses Hadoop?<br />
  29. 29. Deep Dive<br />Aster Data<br />
  30. 30. Aster Data<br />Now part of Teradata<br />Massively Parallel<br />SQL Layer on MR (MapReduce)<br />In-Database Analytics<br />Appliance vs. Software Stack Model<br />Cloud Options<br />nPath and Statistical Options<br />Data Integration<br />
  31. 31. nCluster<br />
  32. 32. Thank You<br />"You either scale to where your customer base takes you or you die"<br />Jim Starkey – Founder and CTO NimbusDB<br />"Our philosophy is to build infrastructure using thebest tools available for the job and we areconstantly evaluating better ways to do thingswhen and where it matters."Facebook<br />"In any year we probably generate more data than the Walt Disney Co. did in the first 80 years of existence" Bud Albers - Disney<br />