TitreSous-titreDateNom du présentateurGong, ZhihongData Warehouse ConsultantSeptember 2012Big DataThe frontier for innovation
Agenda• Big Data Overview• Hadoop Theory and Practice• MapReduce in Action• NoSQL• MPP Database• What’s hot?
Big Five IT Trends• Mobile• Social Media• Cloud Computing• Consumerization of IT• Big Data
Big Data Era• The coming of the Big Data Era is a chance foreveryone in the technology world to decide into whichcamp they fall, as this era will bring the biggestopportunity for companies and individuals in thetechnology since the dawn of the Internet.− Rob Thomas, IBM Vice President, Business Development
6Big Data – a growing torrent• 2 billion internet users• 5 billion mobile phones in use in 2010.• 30 billion pieces of content shared on Facebook every month.• 7TB of data are processed by Twitter every day,• 10TB of data are processed by Facebook every day.• 40% projected growth in global data generated per year.• 235T data collected by US library of Congress in April 2011• 15 out of 17 sectors in the US have more data stored per companythan the US library of Congress.• 90% of the data in the world today has been created in the last twoyears alone.
Data Rich World• Data capture and collection− Sensor data, Mobile device, Social Network, Web clickstream,Traffic monitoring, Multimedia content, Smart energy meters,DNA analysis, Industry machines in the age of Internet ofThings, Consumer activities – communicating, browsing,buying, sharing, searching – create enormous trails of data.• Data Storage− Cost of storage is reduced tremendously− Seagate 3 TB Barracuda @ $149.99 from Amazon.com (4.9¢/GB)
Technology world has changed• Users: 2,000 users vs. a potential user base of 2 billion• Applications: Online transaction system vs. Web applications.• Application architecture: centralized vs. scale-up• Infrastructure: a commodity box has more computational powerthan a supercomputer a decade ago• 80% percent of the world’s information is unstructured.• Unstructured information is growing at 15 times the rate ofstructured information.• Database architecture has not kept pace
A Sample Case – Big Data• ShopSavvy5 – mobile shopping App− 40,000+ retailers− Millions shoppers− Millions retail store locations− 240M+ product pictures and user action shots− 3040M+ product attributes (color, size, features etc.)− 14,720M+ prices from retailers− 100+ price requests per second− delivering real-time inventory and price information
A Sample Case – Big Data (Cont)• ShopSavvy Architecture− An entirely new platform, ProductCloud, leverages thelatest Big Data tool like Cassandra, Hadoop, and Mahout,maintains HUGE histories of prices, products, scans andlocations that number in the hundreds of billions of items.− Open architecture layers tools like Mahout on top of theplatform to enable new features like price prediction, userrecommendations, product categorization and productresolution.
Visualization I• Retweet network related to Egyptian Revolution
What is “Big Data”• The term Big Data applies to information that can’t beprocessed or analyzed using traditional processes or tool.• Big Data creates values in several ways− Create transparency− Enabling experimentation to discover needs, exposevariability, and improve performance− Segmenting population to customize actions− Replacing/supporting human decision making with machinealgorithms− Innovating new business models, products, and services, e.g.risk estimation.
14Big Data = Big Value• $300 billion potential annual value to US health – more than doublethe total annual health care spending in Spain.• $350 billion potential annual value to Europe’s public sectoradministration – more than GDP of Greece.• $600 billion potential annual consumer surplus from using personallocation data globally.• 60% potential increase in retailer’s operating margins possible withbig data.• 140,000 to 190,000 more deep analytic talent positions, and 1.5million data-savvy managers needed to take full advantage of bigdata in the United States.• Gartner predicts that “Big Data will deliver transformational benefitsto enterprises within 2 to 5 years"
Traditional Data Warehouse vs. Big Data• Traditional warehouses− mostly idea for analyzing structured data and producinginsights with known and relatively stable measurements.• Big Data solutions− idea for analyzing not only raw structured data, but semi-structured and structured data from a wide variety ofsources.− idea when all of the data needs to be analyzed versus asample of data.− Idea for iterative and exploratory analysis when businessmeasures are not predetermined.
CAP Theorem• CAP− Consistency− Availability− Tolerance to network Partitions• Consistency models− Strong consistency− Weak consistency− Eventual consistency• Architectures− CA: traditional relational database− AP: NoSQL database
Lower Priorities• No Complex querying functionality− No support for SQL− CRUD operations through database specific API• No support for joins− Materialize simple join results in the relevant row− Give up normalization of data?• No support for transactions− Most data stores support single row transactions− Tunable consistency and availability (e.g., Dynamo) Achieve high scalability
Why sacrifice Consistency?• It is a simple solution− nobody understands what sacrificing P means− sacrificing A is unacceptable in the Web− possible to push the problem to app developer• C not needed in many applications− Banks do not implement ACID (classic example wrong)− Airline reservation only transacts reads (Huh?)− MySQL et al. ship by default in lower isolation level• Data is noisy and inconsistent anyway− making it, say, 1% worse does not matter
Important Design Goals• Scale out: designed for scale− Commodity hardware− Low latency updates− Sustain high update/insert throughput• Elasticity – scale up and down with load• High availability – downtime implies lost revenue− Replication (with multi-mastering)− Geographic replication− Automated failure recovery
A Brief History of Hadoop• Hadoop is an open source project of the Apache Foundation.• Hadoop has its origins in Apache Nutch, an open source web search engine, itselfa part of the Lucene project.• In 2003, Google published a paper that described the architecture of Google’sdistributed filesystem, called GFS.• In 2004, Google published the paper that introduced MapReduce.• It is a framework written in Java originally developed by Doug Cutting, the creatorof Apache Lucene, who named it after his sons toy elephant.• 2004 - Initial versions of what is now Hadoop Distributed Filesystem and Map-Reduce implemented.• January 2006 — Doug Cutting joins Yahoo!.• February 2006 —Adoption of Hadoop by Yahoo! Grid team.• April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
A Brief History of Hadoop (Cont)• January 2007—Research cluster reaches 900 nodes.• In January 2008, Hadoop was made its own top-level project atApache. By this time, Hadoop was being used by many othercompanies such as Facebook and the New York Times.• In February 2008, Yahoo! announced that its production search indexwas being generated by a 10,000-node Hadoop cluster.• In April 2008, Hadoop broke a world record to become the fastestsystem to sort a terabyte of data.• March 2009 — 17 clusters with a total of 24,000 nodes.• April 2009 — Won the minute sort by sorting 500 GB in 59 seconds(on 1,400 nodes) and the 100 terabyte sort in 173 minutes (on 3,400nodes).
Hadoop Echosystem• Common - A set of components for distributed filesystems and general I/O• Avro - A serialization system for efficient data storage.• MapReduce - A distributed data processing model and executionenvironment that runs on large clusters of commodity machines.• HDFS - A distributed filesystem.• Pig - A data flow language for exploring very large datasets.• Hive - A distributed data warehouse system.• Hbase - A distributed, column-oriented database.• ZoopKeeper - A distributed, highly available coordination service.• Sqoop - A tool for efficiently moving data between relational databasesand HDFS.
Hadoop Distributed File System - HDFS• Hadoop filesystem that runs on top of existing file system• Designed to handle very large files with streaming data accesspatterns• Use blocks to store a file or parts of file− 64MB (default), 128MB (recommended) - compare to 4KB in UNIX• 1 HDFS block is supported by multiple operation system blocks• Advantages of blocks− Big throughput− Fixed size - easy to calculate how many fit on a disk− A file can be larger than any single disk in the network− Fits well with replication to provide fault tolerance and availability
Hadoop Node Type• HDFS nodes• NameNode• One per cluster, manages the filesystem namespace and meta data, large memoryrequirements, keep entire filesystem metadata in memory• DataNode• Many per cluster, manages blocks with data and servers them to clients• MapReduce nodes• JobTracker• One per cluster, receives job requests, schedules and monitors MapReduce jobs ontask trackers• TaskTracker• Many per cluster, each TaskTracker spawns Java Virtual Machines to run your map orreduce task.
Before MapReduce…• Large scale data processing was difficult!− Managing hundreds or thousands of processors− Managing parallelization and distribution− I/O Scheduling− Status and monitoring− Fault/crash tolerance• MapReduce provides all of these, easily!
MapReduce Overview• What is it?− Programming model used by Google− A combination of the Map and Reduce models with anassociated implementation− Used for processing and generating large data sets• How does it solve our previously mentioned problems?− MapReduce is highly scalable and can be used acrossmany computers.− Many small machines can be used to process jobs thatnormally could not be processed by a large machine.
Map Abstraction• Inputs a key/value pair– Key is a reference to the input value– Value is the data set on which to operate• Evaluation– Function defined by user– Applies to every value in value input– Might need to parse input• Produces a new list of key/value pairs– Can be different type from input pair
Reduce Abstraction• Starts with intermediate Key / Value pairs• Ends with finalized Key / Value pairs• Starting pairs are sorted by key• Iterator supplies the values for a given key to the Reduce function.• Typically a function that:− Starts with a large number of key/value pairs− One key/value for each word in all files being greped (including multiple entriesfor the same word)− Ends with very few key/value pairs− One key/value for each unique word across all the files with the number ofinstances summed into this entry• Broken up so a given worker works with input of the same key.
Why is this approach better?• Creates an abstraction for dealing with complexoverhead− The computations are simple, the overhead is messy• Removing the overhead makes programs much smallerand thus easier to use− Less testing is required as well. The MapReduce libraries canbe assumed to work properly, so only user code needs to betested• Division of labor also handled by the MapReducelibraries, so programmers only need to focus on theactual computation
MapReduce Advantages• Automatic Parallelization:− Depending on the size of RAW INPUT DATA instantiatemultiple MAP tasks− Similarly, depending upon the number of intermediate <key,value> partitions instantiate multiple REDUCE tasks• Run-time:− Data partitioning− Task scheduling− Handling machine failures− Managing inter-machine communication• Completely transparent to the programmer/analyst/user
MapReduce: A step backwards?• Don’t need 1000 nodes to process petabytes:− Parallel DBs do it in fewer than 100 nodes• No support for schema:− Sharing across multiple MR programs difficult• No indexing:− Wasteful access to unnecessary data• Non-declarative programming model:− Requires highly-skilled programmers• No support for JOINs:− Requires multiple MR phases for the analysis
MapReduce VS Parallel DB• Web application data is inherently distributed on a large number ofsites:− Funneling data to DB nodes is a failed strategy• Distributed and parallel programs difficult to develop:− Failures and dynamics in the cloud• Indexing:− Sequential Disk access 10 times faster than random access.− Not clear if indexing is the right strategy.• Complex queries:− DB community needs to JOIN hands with MR
NoSQL Movement• Initially used for: “Open-Source relational database that did not exposeSQL interface”• Popularly used for: “non-relational, distributed data stores that oftendid not attempt to provide ACID guarantees”• Gained widespread popularity through a number of open sourceprojects− HBase, Cassandra, MongDB, Redis, …• Scale-out, elasticity, flexible data model, high availability
Data in Real World• There real data sets that don’t make sense in therelational model, nor modern ACID databases.• Fit what into where?− Trees− Semi-structured data− Web content− Multi-dimensional cubes− Graphs
NoSQL Database Technology• Not only SQL− No schema, more dynamic data model− Denormalizing, no join− CAP theory− Auto-sharding (elasticity)− Distributed query support− Integrated caching
Key Value Stores• Key-Valued data model− Key is the unique identifier− Key is the granularity for consistent access− Value can be structured or unstructured• Gained widespread popularity− In house: Bigtable (Google), PNUTS (Yahoo!), Dynamo(Amazon)− Open source: HBase, Hypertable, Cassandra, Voldemort• Popular choice for the modern breed of web-applications
Cassandra – A NoSQL Database• An open source, distributed store for structured datathat scales-out on cheap, commodity hardware• Simplicity of Operations• Transparency• Very High Availability• Painless Scale-Out• Solid, Predictable Performance on Commodity andCloud Servers
Massively Parallel Processing (MPP) DB• Vertica (HP)• Greenplum (EMC)• Netezza (IBM)• Teradata (NCR)• Kognitio− In memory analytic− No need for data partition or indexing− Scans data in excess of 650 million rows per second per server. Linearscalability means 100 nodes can scan over 650 billion rows persecond!
Vertica• Supports logical relational models, SQL, ACID transactions, JDBC• Columnar Store Architecture− 50x--‐1000x faster by eliminating costly disk IO− offers aggressive data compression to reduce storage costs by up to 90%• 20x – 100x faster than traditional RDBMS data warehouse, runs on commodityhardware• Scale-out MPP Architecture• Real-time loading and querying• In-Database Analytics• Automatic high availability• Natively support grid computing• Natively support MapReduce and Hadoop
Machine Learning• Machine learning systems automate decision making ondata, automatically producing outputs like productrecommendations or groupings.• WEKA - a Java-based framework and GUI for machinelearning algorithms.• Mahout - an open source framework that can runcommon machine learning algorithms on massivedatasets.
References• Big data: The next frontier for innovation, competition andproductivity, McKinsey Global Institute, May 2011• Understanding Big Data, IBM, 2012• NoSQL Database Technology Whitepaper, CouchBase• Big Data and Cloud Computing: Current State and FutureOpportunities, 2011• Hadoop Definitive Guide• How Do I Cassandra, Nov 2011• BigDataUniversity.com• youtube.com/ibmetinfo• ……