Big data

Titre
Sous-titre
Date
Nom du présentateur
Gong, Zhihong
Data Warehouse Consultant
September 2012
Big Data
The frontier for innovation

Agenda
• Big Data Overview
• Hadoop Theory and Practice
• MapReduce in Action
• NoSQL
• MPP Database
• What’s hot?

Big Five IT Trends
• Mobile
• Social Media
• Cloud Computing
• Consumerization of IT
• Big Data

Big Data Era
• The coming of the Big Data Era is a chance for
everyone in the technology world to decide into which
camp they fall, as this era will bring the biggest
opportunity for companies and individuals in the
technology since the dawn of the Internet.
− Rob Thomas, IBM Vice President, Business Development

6
Big Data – a growing torrent
• 2 billion internet users
• 5 billion mobile phones in use in 2010.
• 30 billion pieces of content shared on Facebook every month.
• 7TB of data are processed by Twitter every day,
• 10TB of data are processed by Facebook every day.
• 40% projected growth in global data generated per year.
• 235T data collected by US library of Congress in April 2011
• 15 out of 17 sectors in the US have more data stored per company
than the US library of Congress.
• 90% of the data in the world today has been created in the last two
years alone.

Data Rich World
• Data capture and collection
− Sensor data, Mobile device, Social Network, Web clickstream,
Traffic monitoring, Multimedia content, Smart energy meters,
DNA analysis, Industry machines in the age of Internet of
Things, Consumer activities – communicating, browsing,
buying, sharing, searching – create enormous trails of data.
• Data Storage
− Cost of storage is reduced tremendously
− Seagate 3 TB Barracuda @ $149.99 from Amazon.com (4.9¢/GB)

Technology world has changed
• Users: 2,000 users vs. a potential user base of 2 billion
• Applications: Online transaction system vs. Web applications.
• Application architecture: centralized vs. scale-up
• Infrastructure: a commodity box has more computational power
than a supercomputer a decade ago
• 80% percent of the world’s information is unstructured.
• Unstructured information is growing at 15 times the rate of
structured information.
• Database architecture has not kept pace

A Sample Case – Big Data
• ShopSavvy5 – mobile shopping App
− 40,000+ retailers
− Millions shoppers
− Millions retail store locations
− 240M+ product pictures and user action shots
− 3040M+ product attributes (color, size, features etc.)
− 14,720M+ prices from retailers
− 100+ price requests per second
− delivering real-time inventory and price information

A Sample Case – Big Data (Cont)
• ShopSavvy Architecture
− An entirely new platform, ProductCloud, leverages the
latest Big Data tool like Cassandra, Hadoop, and Mahout,
maintains HUGE histories of prices, products, scans and
locations that number in the hundreds of billions of items.
− Open architecture layers tools like Mahout on top of the
platform to enable new features like price prediction, user
recommendations, product categorization and product
resolution.

Visualization I
• Retweet network related to Egyptian Revolution

What is “Big Data”
• The term Big Data applies to information that can’t be
processed or analyzed using traditional processes or tool.
• Big Data creates values in several ways
− Create transparency
− Enabling experimentation to discover needs, expose
variability, and improve performance
− Segmenting population to customize actions
− Replacing/supporting human decision making with machine
algorithms
− Innovating new business models, products, and services, e.g.
risk estimation.

14
Big Data = Big Value
• $300 billion potential annual value to US health – more than double
the total annual health care spending in Spain.
• $350 billion potential annual value to Europe’s public sector
administration – more than GDP of Greece.
• $600 billion potential annual consumer surplus from using personal
location data globally.
• 60% potential increase in retailer’s operating margins possible with
big data.
• 140,000 to 190,000 more deep analytic talent positions, and 1.5
million data-savvy managers needed to take full advantage of big
data in the United States.
• Gartner predicts that “Big Data will deliver transformational benefits
to enterprises within 2 to 5 years"

Characteristics of Big Data
• Volume – Terabytes  Zettabytes
• Variety – structured, semi-structured, unstructured data
• Velocity – Batch -> Streaming Data, Real-time

Traditional Data Warehouse vs. Big Data
• Traditional warehouses
− mostly idea for analyzing structured data and producing
insights with known and relatively stable measurements.
• Big Data solutions
− idea for analyzing not only raw structured data, but semi-
structured and structured data from a wide variety of
sources.
− idea when all of the data needs to be analyzed versus a
sample of data.
− Idea for iterative and exploratory analysis when business
measures are not predetermined.

CAP Theorem
• CAP
− Consistency
− Availability
− Tolerance to network Partitions
• Consistency models
− Strong consistency
− Weak consistency
− Eventual consistency
• Architectures
− CA: traditional relational database
− AP: NoSQL database

ACID vs. BASE
• ACID
− Atomicity
− Consistency
− Isolation
− Durability
• BASE
− Basically available
− Soft-state
− Eventual consistency

Lower Priorities
• No Complex querying functionality
− No support for SQL
− CRUD operations through database specific API
• No support for joins
− Materialize simple join results in the relevant row
− Give up normalization of data?
• No support for transactions
− Most data stores support single row transactions
− Tunable consistency and availability (e.g., Dynamo)
 Achieve high scalability

Why sacrifice Consistency?
• It is a simple solution
− nobody understands what sacrificing P means
− sacrificing A is unacceptable in the Web
− possible to push the problem to app developer
• C not needed in many applications
− Banks do not implement ACID (classic example wrong)
− Airline reservation only transacts reads (Huh?)
− MySQL et al. ship by default in lower isolation level
• Data is noisy and inconsistent anyway
− making it, say, 1% worse does not matter

Important Design Goals
• Scale out: designed for scale
− Commodity hardware
− Low latency updates
− Sustain high update/insert throughput
• Elasticity – scale up and down with load
• High availability – downtime implies lost revenue
− Replication (with multi-mastering)
− Geographic replication
− Automated failure recovery

A Brief History of Hadoop
• Hadoop is an open source project of the Apache Foundation.
• Hadoop has its origins in Apache Nutch, an open source web search engine, itself
a part of the Lucene project.
• In 2003, Google published a paper that described the architecture of Google’s
distributed filesystem, called GFS.
• In 2004, Google published the paper that introduced MapReduce.
• It is a framework written in Java originally developed by Doug Cutting, the creator
of Apache Lucene, who named it after his son's toy elephant.
• 2004 - Initial versions of what is now Hadoop Distributed Filesystem and Map-
Reduce implemented.
• January 2006 — Doug Cutting joins Yahoo!.
• February 2006 —Adoption of Hadoop by Yahoo! Grid team.
• April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.

A Brief History of Hadoop (Cont)
• January 2007—Research cluster reaches 900 nodes.
• In January 2008, Hadoop was made its own top-level project at
Apache. By this time, Hadoop was being used by many other
companies such as Facebook and the New York Times.
• In February 2008, Yahoo! announced that its production search index
was being generated by a 10,000-node Hadoop cluster.
• In April 2008, Hadoop broke a world record to become the fastest
system to sort a terabyte of data.
• March 2009 — 17 clusters with a total of 24,000 nodes.
• April 2009 — Won the minute sort by sorting 500 GB in 59 seconds
(on 1,400 nodes) and the 100 terabyte sort in 173 minutes (on 3,400
nodes).

Hadoop Echosystem
• Common - A set of components for distributed filesystems and general I/O
• Avro - A serialization system for efficient data storage.
• MapReduce - A distributed data processing model and execution
environment that runs on large clusters of commodity machines.
• HDFS - A distributed filesystem.
• Pig - A data flow language for exploring very large datasets.
• Hive - A distributed data warehouse system.
• Hbase - A distributed, column-oriented database.
• ZoopKeeper - A distributed, highly available coordination service.
• Sqoop - A tool for efficiently moving data between relational databases
and HDFS.

Hadoop Distributed File System - HDFS
• Hadoop filesystem that runs on top of existing file system
• Designed to handle very large files with streaming data access
patterns
• Use blocks to store a file or parts of file
− 64MB (default), 128MB (recommended) - compare to 4KB in UNIX
• 1 HDFS block is supported by multiple operation system blocks
• Advantages of blocks
− Big throughput
− Fixed size - easy to calculate how many fit on a disk
− A file can be larger than any single disk in the network
− Fits well with replication to provide fault tolerance and availability

Hadoop Node Type
• HDFS nodes
• NameNode
• One per cluster, manages the filesystem namespace and meta data, large memory
requirements, keep entire filesystem metadata in memory
• DataNode
• Many per cluster, manages blocks with data and servers them to clients
• MapReduce nodes
• JobTracker
• One per cluster, receives job requests, schedules and monitors MapReduce jobs on
task trackers
• TaskTracker
• Many per cluster, each TaskTracker spawns Java Virtual Machines to run your map or
reduce task.

Before MapReduce…
• Large scale data processing was difficult!
− Managing hundreds or thousands of processors
− Managing parallelization and distribution
− I/O Scheduling
− Status and monitoring
− Fault/crash tolerance
• MapReduce provides all of these, easily!

MapReduce Overview
• What is it?
− Programming model used by Google
− A combination of the Map and Reduce models with an
associated implementation
− Used for processing and generating large data sets
• How does it solve our previously mentioned problems?
− MapReduce is highly scalable and can be used across
many computers.
− Many small machines can be used to process jobs that
normally could not be processed by a large machine.

Map Abstraction
• Inputs a key/value pair
– Key is a reference to the input value
– Value is the data set on which to operate
• Evaluation
– Function defined by user
– Applies to every value in value input
– Might need to parse input
• Produces a new list of key/value pairs
– Can be different type from input pair

Reduce Abstraction
• Starts with intermediate Key / Value pairs
• Ends with finalized Key / Value pairs
• Starting pairs are sorted by key
• Iterator supplies the values for a given key to the Reduce function.
• Typically a function that:
− Starts with a large number of key/value pairs
− One key/value for each word in all files being greped (including multiple entries
for the same word)
− Ends with very few key/value pairs
− One key/value for each unique word across all the files with the number of
instances summed into this entry
• Broken up so a given worker works with input of the same key.

Why is this approach better?
• Creates an abstraction for dealing with complex
overhead
− The computations are simple, the overhead is messy
• Removing the overhead makes programs much smaller
and thus easier to use
− Less testing is required as well. The MapReduce libraries can
be assumed to work properly, so only user code needs to be
tested
• Division of labor also handled by the MapReduce
libraries, so programmers only need to focus on the
actual computation

MapReduce Advantages
• Automatic Parallelization:
− Depending on the size of RAW INPUT DATA  instantiate
multiple MAP tasks
− Similarly, depending upon the number of intermediate <key,
value> partitions  instantiate multiple REDUCE tasks
• Run-time:
− Data partitioning
− Task scheduling
− Handling machine failures
− Managing inter-machine communication
• Completely transparent to the programmer/analyst/user

MapReduce: A step backwards?
• Don’t need 1000 nodes to process petabytes:
− Parallel DBs do it in fewer than 100 nodes
• No support for schema:
− Sharing across multiple MR programs difficult
• No indexing:
− Wasteful access to unnecessary data
• Non-declarative programming model:
− Requires highly-skilled programmers
• No support for JOINs:
− Requires multiple MR phases for the analysis

MapReduce VS Parallel DB
• Web application data is inherently distributed on a large number of
sites:
− Funneling data to DB nodes is a failed strategy
• Distributed and parallel programs difficult to develop:
− Failures and dynamics in the cloud
• Indexing:
− Sequential Disk access 10 times faster than random access.
− Not clear if indexing is the right strategy.
• Complex queries:
− DB community needs to JOIN hands with MR

NoSQL Movement
• Initially used for: “Open-Source relational database that did not expose
SQL interface”
• Popularly used for: “non-relational, distributed data stores that often
did not attempt to provide ACID guarantees”
• Gained widespread popularity through a number of open source
projects
− HBase, Cassandra, MongDB, Redis, …
• Scale-out, elasticity, flexible data model, high availability

Data in Real World
• There real data sets that don’t make sense in the
relational model, nor modern ACID databases.
• Fit what into where?
− Trees
− Semi-structured data
− Web content
− Multi-dimensional cubes
− Graphs

NoSQL Database Technology
• Not only SQL
− No schema, more dynamic data model
− Denormalizing, no join
− CAP theory
− Auto-sharding (elasticity)
− Distributed query support
− Integrated caching

NoSQL Databases
• Key-Value store
− Redis (in memory), Riak
• Column oriented
− Cassandra, HBase, Dynamo, BigTable
• Document oriented
− MongoDB (JSON), CouchBase
• Graph

Key Value Stores
• Key-Valued data model
− Key is the unique identifier
− Key is the granularity for consistent access
− Value can be structured or unstructured
• Gained widespread popularity
− In house: Bigtable (Google), PNUTS (Yahoo!), Dynamo
(Amazon)
− Open source: HBase, Hypertable, Cassandra, Voldemort
• Popular choice for the modern breed of web-applications

Cassandra – A NoSQL Database
• An open source, distributed store for structured data
that scales-out on cheap, commodity hardware
• Simplicity of Operations
• Transparency
• Very High Availability
• Painless Scale-Out
• Solid, Predictable Performance on Commodity and
Cloud Servers

Column Oriented – Data Structure
• Tuples: {“key”: {“name”: “value”: “timestamp”} }
insert(“carol”, { “car”: “daewoo”, 2011/11/15 15:00 })
Row Key
jim
age: 36
2011/01/01 12:35
car: camaro
2011/01/01
12:35
gender: M
2011/01/01
12:35
carol
age: 37
2011/01/01 12:35
car: subaru
2011/01/01
12:35
gender: F
2011/01/01
12:35
johnny
age: 12
2011/01/01 12:35
gender: M
2011/01/01
12:35
suzy
age: 10
2011/01/01 12:35
gender: F
2011/01/01
12:35

Massively Parallel Processing (MPP) DB
• Vertica (HP)
• Greenplum (EMC)
• Netezza (IBM)
• Teradata (NCR)
• Kognitio
− In memory analytic
− No need for data partition or indexing
− Scans data in excess of 650 million rows per second per server. Linear
scalability means 100 nodes can scan over 650 billion rows per
second!

Vertica
• Supports logical relational models, SQL, ACID transactions, JDBC
• Columnar Store Architecture
− 50x--‐1000x faster by eliminating costly disk IO
− offers aggressive data compression to reduce storage costs by up to 90%
• 20x – 100x faster than traditional RDBMS data warehouse, runs on commodity
hardware
• Scale-out MPP Architecture
• Real-time loading and querying
• In-Database Analytics
• Automatic high availability
• Natively support grid computing
• Natively support MapReduce and Hadoop

Machine Learning
• Machine learning systems automate decision making on
data, automatically producing outputs like product
recommendations or groupings.
• WEKA - a Java-based framework and GUI for machine
learning algorithms.
• Mahout - an open source framework that can run
common machine learning algorithms on massive
datasets.

Popular Technologies
• Databases
− HBase, Cassandra, MongoDB, Redis, CouchDB, Vertica, Greenplum, Netezza
• Programing Languages
− Java; Python, Perl; Hive, Pig, JAQL;
• ETL tools
− Talend, Pentaho
• BI tools
− Pentaho, Tableau
• Analytics
− R, Mahout, BigInsight
• Methology
− Agile
• Other
− Hadoop, MapReduce, Lucene, Solr, JSON, UIMA, ZooKeeper

References
• Big data: The next frontier for innovation, competition and
productivity, McKinsey Global Institute, May 2011
• Understanding Big Data, IBM, 2012
• NoSQL Database Technology Whitepaper, CouchBase
• Big Data and Cloud Computing: Current State and Future
Opportunities, 2011
• Hadoop Definitive Guide
• How Do I Cassandra, Nov 2011
• BigDataUniversity.com
• youtube.com/ibmetinfo
• ……

Big data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data

Similar to Big data (20)

Recently uploaded

Recently uploaded (20)

Big data