Big data


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big data

  1. 1. TitreSous-titreDateNom du présentateurGong, ZhihongData Warehouse ConsultantSeptember 2012Big DataThe frontier for innovation
  2. 2. Agenda• Big Data Overview• Hadoop Theory and Practice• MapReduce in Action• NoSQL• MPP Database• What’s hot?
  3. 3. Big Five IT Trends• Mobile• Social Media• Cloud Computing• Consumerization of IT• Big Data
  4. 4. Big Data Era• The coming of the Big Data Era is a chance foreveryone in the technology world to decide into whichcamp they fall, as this era will bring the biggestopportunity for companies and individuals in thetechnology since the dawn of the Internet.− Rob Thomas, IBM Vice President, Business Development
  5. 5. Big Data job trend
  6. 6. 6Big Data – a growing torrent• 2 billion internet users• 5 billion mobile phones in use in 2010.• 30 billion pieces of content shared on Facebook every month.• 7TB of data are processed by Twitter every day,• 10TB of data are processed by Facebook every day.• 40% projected growth in global data generated per year.• 235T data collected by US library of Congress in April 2011• 15 out of 17 sectors in the US have more data stored per companythan the US library of Congress.• 90% of the data in the world today has been created in the last twoyears alone.
  7. 7. Data Rich World• Data capture and collection− Sensor data, Mobile device, Social Network, Web clickstream,Traffic monitoring, Multimedia content, Smart energy meters,DNA analysis, Industry machines in the age of Internet ofThings, Consumer activities – communicating, browsing,buying, sharing, searching – create enormous trails of data.• Data Storage− Cost of storage is reduced tremendously− Seagate 3 TB Barracuda @ $149.99 from (4.9¢/GB)
  8. 8. Technology world has changed• Users: 2,000 users vs. a potential user base of 2 billion• Applications: Online transaction system vs. Web applications.• Application architecture: centralized vs. scale-up• Infrastructure: a commodity box has more computational powerthan a supercomputer a decade ago• 80% percent of the world’s information is unstructured.• Unstructured information is growing at 15 times the rate ofstructured information.• Database architecture has not kept pace
  9. 9. A Sample Case – Big Data• ShopSavvy5 – mobile shopping App− 40,000+ retailers− Millions shoppers− Millions retail store locations− 240M+ product pictures and user action shots− 3040M+ product attributes (color, size, features etc.)− 14,720M+ prices from retailers− 100+ price requests per second− delivering real-time inventory and price information
  10. 10. A Sample Case – Big Data (Cont)• ShopSavvy Architecture− An entirely new platform, ProductCloud, leverages thelatest Big Data tool like Cassandra, Hadoop, and Mahout,maintains HUGE histories of prices, products, scans andlocations that number in the hundreds of billions of items.− Open architecture layers tools like Mahout on top of theplatform to enable new features like price prediction, userrecommendations, product categorization and productresolution.
  11. 11. Visualization I• Retweet network related to Egyptian Revolution
  12. 12. Visualization II
  13. 13. What is “Big Data”• The term Big Data applies to information that can’t beprocessed or analyzed using traditional processes or tool.• Big Data creates values in several ways− Create transparency− Enabling experimentation to discover needs, exposevariability, and improve performance− Segmenting population to customize actions− Replacing/supporting human decision making with machinealgorithms− Innovating new business models, products, and services, e.g.risk estimation.
  14. 14. 14Big Data = Big Value• $300 billion potential annual value to US health – more than doublethe total annual health care spending in Spain.• $350 billion potential annual value to Europe’s public sectoradministration – more than GDP of Greece.• $600 billion potential annual consumer surplus from using personallocation data globally.• 60% potential increase in retailer’s operating margins possible withbig data.• 140,000 to 190,000 more deep analytic talent positions, and 1.5million data-savvy managers needed to take full advantage of bigdata in the United States.• Gartner predicts that “Big Data will deliver transformational benefitsto enterprises within 2 to 5 years"
  15. 15. Characteristics of Big Data• Volume – Terabytes  Zettabytes• Variety – structured, semi-structured, unstructured data• Velocity – Batch -> Streaming Data, Real-time
  16. 16. Traditional Data vs. Big Data
  17. 17. Traditional Data Warehouse vs. Big Data• Traditional warehouses− mostly idea for analyzing structured data and producinginsights with known and relatively stable measurements.• Big Data solutions− idea for analyzing not only raw structured data, but semi-structured and structured data from a wide variety ofsources.− idea when all of the data needs to be analyzed versus asample of data.− Idea for iterative and exploratory analysis when businessmeasures are not predetermined.
  18. 18. CAP Theorem• CAP− Consistency− Availability− Tolerance to network Partitions• Consistency models− Strong consistency− Weak consistency− Eventual consistency• Architectures− CA: traditional relational database− AP: NoSQL database
  19. 19. ACID vs. BASE• ACID− Atomicity− Consistency− Isolation− Durability• BASE− Basically available− Soft-state− Eventual consistency
  20. 20. Lower Priorities• No Complex querying functionality− No support for SQL− CRUD operations through database specific API• No support for joins− Materialize simple join results in the relevant row− Give up normalization of data?• No support for transactions− Most data stores support single row transactions− Tunable consistency and availability (e.g., Dynamo) Achieve high scalability
  21. 21. Why sacrifice Consistency?• It is a simple solution− nobody understands what sacrificing P means− sacrificing A is unacceptable in the Web− possible to push the problem to app developer• C not needed in many applications− Banks do not implement ACID (classic example wrong)− Airline reservation only transacts reads (Huh?)− MySQL et al. ship by default in lower isolation level• Data is noisy and inconsistent anyway− making it, say, 1% worse does not matter
  22. 22. Important Design Goals• Scale out: designed for scale− Commodity hardware− Low latency updates− Sustain high update/insert throughput• Elasticity – scale up and down with load• High availability – downtime implies lost revenue− Replication (with multi-mastering)− Geographic replication− Automated failure recovery
  23. 23. A Brief History of Hadoop• Hadoop is an open source project of the Apache Foundation.• Hadoop has its origins in Apache Nutch, an open source web search engine, itselfa part of the Lucene project.• In 2003, Google published a paper that described the architecture of Google’sdistributed filesystem, called GFS.• In 2004, Google published the paper that introduced MapReduce.• It is a framework written in Java originally developed by Doug Cutting, the creatorof Apache Lucene, who named it after his sons toy elephant.• 2004 - Initial versions of what is now Hadoop Distributed Filesystem and Map-Reduce implemented.• January 2006 — Doug Cutting joins Yahoo!.• February 2006 —Adoption of Hadoop by Yahoo! Grid team.• April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
  24. 24. A Brief History of Hadoop (Cont)• January 2007—Research cluster reaches 900 nodes.• In January 2008, Hadoop was made its own top-level project atApache. By this time, Hadoop was being used by many othercompanies such as Facebook and the New York Times.• In February 2008, Yahoo! announced that its production search indexwas being generated by a 10,000-node Hadoop cluster.• In April 2008, Hadoop broke a world record to become the fastestsystem to sort a terabyte of data.• March 2009 — 17 clusters with a total of 24,000 nodes.• April 2009 — Won the minute sort by sorting 500 GB in 59 seconds(on 1,400 nodes) and the 100 terabyte sort in 173 minutes (on 3,400nodes).
  25. 25. Hadoop Echosystem• Common - A set of components for distributed filesystems and general I/O• Avro - A serialization system for efficient data storage.• MapReduce - A distributed data processing model and executionenvironment that runs on large clusters of commodity machines.• HDFS - A distributed filesystem.• Pig - A data flow language for exploring very large datasets.• Hive - A distributed data warehouse system.• Hbase - A distributed, column-oriented database.• ZoopKeeper - A distributed, highly available coordination service.• Sqoop - A tool for efficiently moving data between relational databasesand HDFS.
  26. 26. Hadoop Distributed File System - HDFS• Hadoop filesystem that runs on top of existing file system• Designed to handle very large files with streaming data accesspatterns• Use blocks to store a file or parts of file− 64MB (default), 128MB (recommended) - compare to 4KB in UNIX• 1 HDFS block is supported by multiple operation system blocks• Advantages of blocks− Big throughput− Fixed size - easy to calculate how many fit on a disk− A file can be larger than any single disk in the network− Fits well with replication to provide fault tolerance and availability
  27. 27. Hadoop Cluster
  28. 28. Hadoop Architecture
  29. 29. Hadoop Node Type• HDFS nodes• NameNode• One per cluster, manages the filesystem namespace and meta data, large memoryrequirements, keep entire filesystem metadata in memory• DataNode• Many per cluster, manages blocks with data and servers them to clients• MapReduce nodes• JobTracker• One per cluster, receives job requests, schedules and monitors MapReduce jobs ontask trackers• TaskTracker• Many per cluster, each TaskTracker spawns Java Virtual Machines to run your map orreduce task.
  30. 30. Write File to HDFS
  31. 31. Run MapReduce jobs
  32. 32. Before MapReduce…• Large scale data processing was difficult!− Managing hundreds or thousands of processors− Managing parallelization and distribution− I/O Scheduling− Status and monitoring− Fault/crash tolerance• MapReduce provides all of these, easily!
  33. 33. MapReduce Overview• What is it?− Programming model used by Google− A combination of the Map and Reduce models with anassociated implementation− Used for processing and generating large data sets• How does it solve our previously mentioned problems?− MapReduce is highly scalable and can be used acrossmany computers.− Many small machines can be used to process jobs thatnormally could not be processed by a large machine.
  34. 34. Simple Data Flow Example
  35. 35. Another Data Flow Example
  36. 36. Map, Shuffle and Reduce
  37. 37. Map Abstraction• Inputs a key/value pair– Key is a reference to the input value– Value is the data set on which to operate• Evaluation– Function defined by user– Applies to every value in value input– Might need to parse input• Produces a new list of key/value pairs– Can be different type from input pair
  38. 38. Map Example
  39. 39. Reduce Abstraction• Starts with intermediate Key / Value pairs• Ends with finalized Key / Value pairs• Starting pairs are sorted by key• Iterator supplies the values for a given key to the Reduce function.• Typically a function that:− Starts with a large number of key/value pairs− One key/value for each word in all files being greped (including multiple entriesfor the same word)− Ends with very few key/value pairs− One key/value for each unique word across all the files with the number ofinstances summed into this entry• Broken up so a given worker works with input of the same key.
  40. 40. Reduce Example
  41. 41. MapReduce Data Flow
  42. 42. Why is this approach better?• Creates an abstraction for dealing with complexoverhead− The computations are simple, the overhead is messy• Removing the overhead makes programs much smallerand thus easier to use− Less testing is required as well. The MapReduce libraries canbe assumed to work properly, so only user code needs to betested• Division of labor also handled by the MapReducelibraries, so programmers only need to focus on theactual computation
  43. 43. MapReduce Advantages• Automatic Parallelization:− Depending on the size of RAW INPUT DATA  instantiatemultiple MAP tasks− Similarly, depending upon the number of intermediate <key,value> partitions  instantiate multiple REDUCE tasks• Run-time:− Data partitioning− Task scheduling− Handling machine failures− Managing inter-machine communication• Completely transparent to the programmer/analyst/user
  44. 44. MapReduce: A step backwards?• Don’t need 1000 nodes to process petabytes:− Parallel DBs do it in fewer than 100 nodes• No support for schema:− Sharing across multiple MR programs difficult• No indexing:− Wasteful access to unnecessary data• Non-declarative programming model:− Requires highly-skilled programmers• No support for JOINs:− Requires multiple MR phases for the analysis
  45. 45. MapReduce VS Parallel DB• Web application data is inherently distributed on a large number ofsites:− Funneling data to DB nodes is a failed strategy• Distributed and parallel programs difficult to develop:− Failures and dynamics in the cloud• Indexing:− Sequential Disk access 10 times faster than random access.− Not clear if indexing is the right strategy.• Complex queries:− DB community needs to JOIN hands with MR
  46. 46. NoSQL Movement• Initially used for: “Open-Source relational database that did not exposeSQL interface”• Popularly used for: “non-relational, distributed data stores that oftendid not attempt to provide ACID guarantees”• Gained widespread popularity through a number of open sourceprojects− HBase, Cassandra, MongDB, Redis, …• Scale-out, elasticity, flexible data model, high availability
  47. 47. Data in Real World• There real data sets that don’t make sense in therelational model, nor modern ACID databases.• Fit what into where?− Trees− Semi-structured data− Web content− Multi-dimensional cubes− Graphs
  48. 48. NoSQL Database Technology• Not only SQL− No schema, more dynamic data model− Denormalizing, no join− CAP theory− Auto-sharding (elasticity)− Distributed query support− Integrated caching
  49. 49. NoSQL Databases• Key-Value store− Redis (in memory), Riak• Column oriented− Cassandra, HBase, Dynamo, BigTable• Document oriented− MongoDB (JSON), CouchBase• Graph
  50. 50. Key Value Stores• Key-Valued data model− Key is the unique identifier− Key is the granularity for consistent access− Value can be structured or unstructured• Gained widespread popularity− In house: Bigtable (Google), PNUTS (Yahoo!), Dynamo(Amazon)− Open source: HBase, Hypertable, Cassandra, Voldemort• Popular choice for the modern breed of web-applications
  51. 51. Cassandra – A NoSQL Database• An open source, distributed store for structured datathat scales-out on cheap, commodity hardware• Simplicity of Operations• Transparency• Very High Availability• Painless Scale-Out• Solid, Predictable Performance on Commodity andCloud Servers
  52. 52. Column Oriented
  53. 53. Column Oriented – Data Structure• Tuples: {“key”: {“name”: “value”: “timestamp”} }insert(“carol”, { “car”: “daewoo”, 2011/11/15 15:00 })Row Keyjimage: 362011/01/01 12:35car: camaro2011/01/0112:35gender: M2011/01/0112:35carolage: 372011/01/01 12:35car: subaru2011/01/0112:35gender: F2011/01/0112:35johnnyage: 122011/01/01 12:35gender: M2011/01/0112:35  suzyage: 102011/01/01 12:35gender: F2011/01/0112:35  
  54. 54. Massively Parallel Processing (MPP) DB• Vertica (HP)• Greenplum (EMC)• Netezza (IBM)• Teradata (NCR)• Kognitio− In memory analytic− No need for data partition or indexing− Scans data in excess of 650 million rows per second per server. Linearscalability means 100 nodes can scan over 650 billion rows persecond!
  55. 55. Vertica• Supports logical relational models, SQL, ACID transactions, JDBC• Columnar Store Architecture− 50x--‐1000x faster by eliminating costly disk IO− offers aggressive data compression to reduce storage costs by up to 90%• 20x – 100x faster than traditional RDBMS data warehouse, runs on commodityhardware• Scale-out MPP Architecture• Real-time loading and querying• In-Database Analytics• Automatic high availability• Natively support grid computing• Natively support MapReduce and Hadoop
  56. 56. Machine Learning• Machine learning systems automate decision making ondata, automatically producing outputs like productrecommendations or groupings.• WEKA - a Java-based framework and GUI for machinelearning algorithms.• Mahout - an open source framework that can runcommon machine learning algorithms on massivedatasets.
  57. 57. Popular Technologies• Databases− HBase, Cassandra, MongoDB, Redis, CouchDB, Vertica, Greenplum, Netezza• Programing Languages− Java; Python, Perl; Hive, Pig, JAQL;• ETL tools− Talend, Pentaho• BI tools− Pentaho, Tableau• Analytics− R, Mahout, BigInsight• Methology− Agile• Other− Hadoop, MapReduce, Lucene, Solr, JSON, UIMA, ZooKeeper
  58. 58. Questions
  59. 59. References• Big data: The next frontier for innovation, competition andproductivity, McKinsey Global Institute, May 2011• Understanding Big Data, IBM, 2012• NoSQL Database Technology Whitepaper, CouchBase• Big Data and Cloud Computing: Current State and FutureOpportunities, 2011• Hadoop Definitive Guide• How Do I Cassandra, Nov 2011••• ……