13 pv-do es-18-bigdata-v3

- 1-
Data Warehouse design
Design of Enterprise Systems
University of Pavia
10/12/2013 2h for the first; 2h for hadoop

- 2-
Table of Contents
Big Data Overview
Big Data DW & BI
Big Data Market
Hadoop & Mahout

- 3-
BIG DATA OVERVIEW

- 4-
Big Data Overview: Table of Contents
Big Data
Overview
Data Growth Definition
Big Data v.s.
Relational
Data
Its Value
Big Data
Benefit
Big Data
Usage
Challenges

- 5-
Big Data Overview: Data Growth
 Storage capacity increases 23% on
average annually
 End the ability to store all the
available information
0/通用格式
19/通用格式
9/通用格式
29/通用格式
18/通用格式
6/通用格式
26/通用格式
15/通用格式
Exabytes
Years
Data Storage Growth
0/通用格式
8/通用格式
18/通用格式
24/通用格式
3/通用格式
11/通用格式
18/通用格式
28/通用格式
6/通用格式
15/通用格式
Exabytes Years
Data Storage Growth
 Exponential growth during a decade
starts from 2010

- 6-
Big Data Overview: Definition
Gartner Definition(2012): "Big data is high volume,
high velocity, and/or high variety information assets
that require new forms of processing to enable
enhanced decision making, insight discovery and
process optimization."

- 7-
Big Data Overview: Big Data V.S. Relational Data
Application Relation-Based Data Big Data
Data processing
Single-computer
platform that scales with
better CPUs, centralized
processing.
Cluster platforms that
scale to thousands of
nodes, distributed
process.
Data management
Relational database
(SQL), centralized
storage.
Non-relational
databases that manage
varied data types and
formats (NoSQL),
distributed storage.
Analytics
Batched, descriptive,
centralized.
Real-time, predictive
and prescriptive,
distributed analytics.

- 8-
Big Data Overview: Its Value 1/3
Several classes of company
heading the revenue
chart($11.59 billion)
 broad-portfolio tech giants
(IBM, HP, Oracle, EMC)
 leading software houses
(Teradata, SAP, Microsoft)
 professional services
companies (PwC, Accenture)
Source: Wikibon, Big Data
Vendor Revenue and Market
Forecast 2012-2017
Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/

- 9-
 Pure play: vendors who
derive 100 percent of
their revenue from this
market
Source: Wikibon, Big Data
Vendor Revenue and
Market Forecast 2012-
2017

- 10-
Source: Worldwide Big Data Technologies and
Services: 2012-2015 Forecast (IDC, 2012)
 IDC: Big data will become a
$17 billion business by
2015($23.8 billion by
2016)
 Big data storage will
account for 6.8% of the
entire worldwide storage
market by 2015

- 11-
Big Data Overview: Big Data Benefits
Business benefits received by implementing an effective Big Data
methodology. The survey is based on 1153 responses from 325 respondents

- 12-
Big Data Overview: Big Data Usage 1/2
 E-Commerce and Market Intelligence
– Recommender system
– Social media monitoring and analysis
– Crowd-sourcing systems
– Social and virtual games
 E-Government and Politics 2.0
– Ubiquitous government services
– Equal access and public services
– Citizen engagement
 Science & Technology
– S&T innovation
– Hypothesis testing
– Knowledge discovery
 Smart Health and Wellbeing
– Human and plant genomics
– Healthcare decision support
– Patient community analysis
 Security and Public Safety
– Crime analysis
– Computational criminology
– Terrorism informatics
– Open-source intelligence
– Cyber security

- 13-
Big Data Overview: Big Data Usage 2/2
Survey of European companies from Steria's Business Intelligence Maturity Audit (biMA)

- 14-
Big Data Overview: Challenges 1/2
Main challenges between Big Data and companies. The survey is based on
1153 responses from 325 respondents

- 15-
Big Data Overview: Challenges 2/2
A Survey of European
companies from Steria's
Business Intelligence Maturity
Audit (biMA)
 Technical
– 38% has data quality
problem
– A lack of data
governance; no master
data management
system(38%)
 Organizational
– 72% has no BI strategy;
70% has no BI governance
– 7% grades big data as
relevant
Source: http://www.steria.com/uk/media-centre/press-releases/press-releases/article/survey-suggests-only-7-
of-european-companies-rate-big-data-as-very-relevant-to-their-business/

- 16-
BIG DATA, DW & BI

- 17-
Big Data, DW & BI: Table of Contents
Big Data,
DW & BI
Evolution Techniques Cost
Best
Practices

- 18-
BI Evolution
Key Characteristics
Gartner BI Platforms Core
Capabilities
Gartner Hype Cycle
BI&A 1.0
-DBMS-based, structured content.
-RDBMS & data warehousing.
-ETL & OLAP.
-Dashboards & scorecards.
-Data mining & statistical analysis.
-Ad hoc query & search-based BI
-Reporting, dashboards &
scorecards
-OLAP
-Interactive visualization
-Predictive modeling & data mining.
-Column-based DBMS
-In-memory DBMS
-Real-time decision
-Data mining workbenches
BI&A 2.0
Web-based, unstructured content
-Information retrieval and
extraction
-Opinion mining
-Question answering
-Web analytics and web
intelligence
-Social media analytics
-Social network analysis
-Spatial-temporal analysis
-Information semantic
services
-Natural language question
answering
-Content & text analytics
BI&A 3.0
Mobile and sensor-based content
-Location-aware analysis
-Person-centered analysis
-Context-relevant analysis
-Mobile visualization & HCI
-Mobile BI
BI and Analytics: evolution and characteristics

- 19-
Big Data Overview: Techniques 1/2
A/B Testing
A technique in which a control group is compared with a
variety of test groups in order to determine what treatments
will improve a given objective. An example application is
determining what copy text, layouts, images, or colors will
improve conversion rates on an e-commerce Web site. Big
Data enables huge numbers of tests to be executed and
analyzed.
Cluster Analysis
A statistical method aimed to classify an huge data set and
in particular to identify a common behavior.
Classification
Classification. A set of techniques to identify the categories
in which new data points belong, based on a training set
containing data points that have already been categorized.
Data Mining
A set of techniques and technologies with the purpose to
extract patterns from large datasets through the combination
of methods following statistics and algorithms. These
techniques include association rule learning, cluster analysis,
classification, and regression.
McKinsey Global Institute in 2011 provided a list of the top 10 common
techniques applicable across a range of industries, particularly in response to
the need to analyze new amounts of data and their combination.
List of the top 10 techniques which require Big data(1/2)

- 20-
Big Data Overview: Techniques 2/2
McKinsey Global Institute in 2011 provided a list of the top 10 common
techniques applicable across a range of industries, particularly in response to
the need to analyze new amounts of data and their combination.
List of the top 10 techniques which require Big data(2/2)
Network analysis
A set of techniques used to characterize relationships among discrete
nodes in a graph or a network. In social network analysis, connections
between individuals in a community or organization are analyzed.
Predictive modeling
A set of techniques in which a mathematical model is created or
chosen to best predict the probability of an outcome.
Sentiment analysis
Application of natural language processing and other analytic
techniques to identify and extract subjective information from source text
material.
Statistics
The science of the collection, organization, and interpretation of data,
including the design of surveys and experiments. Statistical techniques
are often used to understand the relationships between all the variables.
Visualization
Techniques used to create images, diagrams or animations, usually
integrated in more complex dashboards.

- 21-
Big Data: Cost 1/2
 ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in
particular a comparison between a “build” and “buy” solution.
Item Cost Notes
Servers $400,000
@$22k each; enterprise class with dual
power supplies, 36TB of serial attached
SCSI (SAS) storage, 48-64 gigabytes
memory, 1 rack
Server support $60,000 @15% of server cost
Switches $15,000
3 @ $5k for InfiniBand; in older network
switches will run at least 3x the costs of
InfiniBand
Distribution/systems
management software
$90,000 Cloudera: 18 nodes @ $5k each
Integration $100,000 Licenses and dedicated hardware
Information
Management Tools
$20,000 320 hours @ $100/hour human cost
Node Configuration
and Implementation
$16,000
8 hours/node, 20 nodes = 160 hours,
$100/hour
Build Project Costs $733,000
Those project items where a "buy" option
exists
Build Versus Buy Elements (Using Build Pricing)

- 22-
Big Data: Cost 2/2
 ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in
particular a comparison between a “build” and “buy” solution.
Build Versus Buy Elements (Using Buy Pricing)
Item Cost Notes
Build Total $733,000
Buy (Oracle Big Data
Appliance)
$450,000
Cost of Oracle Big Data
Appliance for same
infrastructure and tasks
costs (list)
Buy (Oracle Big Data
Appliance) Savings
$283,000
Not lifecycle costs, just
for initial project
ESG Estimated Savings
~39%
Oracle Big Data
Appliance lowers costs
versus do-it-yourself

- 23-
Big Data: Best Practices 1/3
First of all, however, we need to focus on some considerations on when is suitable
to use Big Data technologies
 Analyze a huge quantity of data not only structured but also semi-structured and
unstructured from a wide variety of resources;
 All of the data gathered must be analyzed against a sample or in another case,
sampling of data is not as effective as the analysis made upon a large amount of
data;
 Iterative and explorative analysis when business measures on the data are not
determined a priori;
 Solving information and business challenges that are not properly addressed by a
traditional relational database approach.

- 24-
The best practices that we are going to describe regard both the
management aspects and the organizational and technological ones.
 Muting the HiPPOs: the highest-paid person opinions are those on which
depend the most important decisions on how to retrieve and analyze data.
Today these people rely too much on intuition and experience rather than
the pure rationality of data so there is the need to transform this behavior;
 Start with initiative that led to customer-centric outcome. It is very
important for those organization that are customer oriented to begin with
customer analytics that enable better services as a result of a deep
understand of customers needs and future behaviors;
 Develop an enterprise schema that include the vision, the strategies and the
requirements for Big Data and is useful to align the business users need
and the implementation roadmap of information technologies;
 In order to achieve near-term results is crucial the adoption of a pragmatic
approach, starting from the most logical and cost-effective place to look for
insight that is within the enterprise;

- 25-
 Big Data Analytics effectiveness strictly depends on analytical skills and analytics tools.
So the enterprises should invest in acquiring both tools and skills;
 The Big Data strategy and the business analytics should encompass an evaluation of the
decision-making processes of the organization as well as an evaluation on the groups
and types of decision makers;
 Try to uncover new metrics, key performance indicators and new analytics technique to
lock at new and existing data in a different way in order to find new opportunity. This
could require setting up a separate Big Data team with the purpose of experiment and
innovate;
 The final goal of a Big Data project is not the collection of much data as possible but the
support of the concrete business needs and provide new reliable information to decision
makers;
 Only one technology cannot meet all the Big Data requirements. The presence of
different workloads, data types, and user types should be served by the most suitable
technology. For example, Hadoop could be the best choice for a large-scale Web log
analysis but is not suitable for a real-time streaming at all. Multiple Big Data technologies
must coexist and address use cases for which they are optimized.

- 26-
BIG DATA MARKET

- 27-
Big Data Market Definition
IDC(2012) defines
the big data
market as an
aggregation of
storage, server,
networking,
software, and
services market
segments, each
with several sub-
segments.
Big Data Technology Stack

- 28-
Big Data Market Segments
 Services
– business consulting, business process
outsourcing, plus IT projectbased
services, IT outsourcing, and IT support,
and training services related to Big Data
implementations
 Infrastructure
– External storage systems
– Servers(including internal storage,
memory, network cards) and supporting
system software as well as spending for
self-built servers by large cloud service
providers
– Datacenter networking infrastructure
used in support of Big Data server and
storage infrastructure
 Softwares
– Data organization and management
software, including parallel and
distributed file systems and others
– Analytics and discovery software,
including search engines used for Big
Data applications, data mining, text
mining, rich media analysis, data
visualization, and others

- 29-
Big Data Market Analysis
Marketsandmarkets
– Big Data Market By Types (Hardware; Software;
Services; BDaaS - HaaS; Analytics; Visualization as
Service); By Software (Hadoop, Big Data Analytics
and Databases, System Software (IMDB, IMC):
Worldwide Forecasts & Analysis (2013 – 2018)

- 30-
HADOOP & MAHOUT

- 31-
Hadoop & Mahout: Table of Contents
Hadoop
Overview HDFS
Structure
File Write
File Read
Map Reduce
Structure
Job
Submission
Job
Execution
Hadoop
Ecosystem
HBase
Pig
Hive
Mahout
Overview Algorithms

- 32-
Hadoop: Overview
Master Node
Hadoop Overview
Slave Node1 Slave Node K Slave Node N
...... ......
Storage
Computing
Storage
Computing
Storage
Computing
HDFS
Map-Reduce
 The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models
– Open source
– Scalable
– Distributed
 Master Node controls everything!

- 33-
Hadoop
Overview HDFS
Structure
File Write
File Read
Map Reduce
Structure
Job
Submission
Job
Execution
Hadoop
Ecosystem
HBase
Pig
Hive
Mahout
Overview Algorithms

- 34-
Hadoop: HDFS Structure
Name Node Metadata
HDFS Structure
Data Node1 Data Node K Data Node N
…....
..
…....
..
1
22
3
1
22
3
1
22
3
File
 Name node controls almost everything about storage
 Large files are partitioned into chunks and stored across multiple nodes
 File chunks are replicated to mitigate the node failure problems

- 35-
Hadoop: HDFS write
 Operation series when writing a file

- 36-
Hadoop: HDFS Read
 Operation series when reading a file

- 37-
Hadoop
Overview HDFS
Structure
File Write
File Read
Map Reduce
Structure
Job
Submission
Job
Execution
Hadoop
Ecosystem
HBase
Pig
Hive
Mahout
Overview Algorithms

- 38-
Hadoop: Map-Reduce Structure
 Job tracker controls almost everything about computing
 Key concepts of Map-Reduce
– Computation goes with data
Job Tracker
Map-Reduce Structure
TaskTracker1 TaskTracker K TaskTracker N
Mapper
Reducer
Mapper
Reducer
Mapper
Reducer
…......…......

- 39-
Hadoop: Job submission
 The initialization takes some time
 Job execution is monitored by Job tracker through heartbeat

- 40-
Hadoop: Map-Reduce Execution
 Bandwidth required in the copy process

- 41-
Hadoop
Overview HDFS
Structure
File Write
File Read
Map Reduce
Structure
Job
Submission
Job
Execution
Hadoop
Ecosystem
HBase
Pig
Hive
Mahout
Overview Algorithms

- 42-
Hadoop Ecosystem: HBase
 HDFS
– Structured/semi-
structure/unstructure
d data
– Write only once, read
many
 Hbase is an open-
source, distributed,
versioned, column-
oriented store
modeled after
Google's Bigtable
 Column based database. It
supports
– Insert
– Delete
– Update

- 43-
Hadoop Ecosystem: Hbase Storage model 1/3
 Hbase is a column-oriented database

- 44-
 Hbase storage system

- 45-
 Hbase storage system

- 46-
Hadoop Ecosystem: Pig
 Hadoop
– A lot of java codes in
case of analyzing
– No scripting
 Pig is a platform for analyzing large
data sets that consists of a high-
level language for expressing data
analysis programs
 Pig generates and compiles a
Map/Reduce program(s) on the fly.

- 47-
Hadoop Ecosystem: Pig Sample Scripts
RawInput = LOAD '$INPUT' USING
com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');
input = foreach RawInput GENERATE ContextCategoryId as Category,
DefLevelId , TagId, URL,Impressions;
 defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12);
GroupedInput = GROUP defFilter BY (Category, TagId, URL);
result = FOREACH GroupedInput GENERATE group,
SUM(input.Impressions) as Impressions;
STORE result INTO '$OUTPUT' USING
com.contextweb.pig.CWHeaderStore();

- 48-
Hadoop Ecosystem: Hive
 Hive is a data warehouse infrastructure built on top of hadoop
 Supports analysis of large datasets stored in Hadoop compatible file systems like
HDFS and Amazon S3 file system
 Provides SQL-Like query language called HiveSQL
 Provides index to accelerate queries

- 49-
Hadoop Ecosystem: HiveSQL
 DML
– Select
 DDL
– SHOW TABLES
– CREATE TABLE
– ALTER TABLE
– DROP TABLE

- 50-
Mahot
Hadoop
Overview HDFS
Structure
File Write
File Read
Map Reduce
Structure
Job
Submission
Job
Execution
Hadoop
Ecosystem
HBase
Pig
Hive
Mahout
Overview Algorithms

- 51-
Mahout: Overview
 A scalable machine
learning library built on
Hadoop, written in java
 Driven by Ng et al.’s
paper “MapReduce for
Machine Learning on
Multicore”

- 52-
Mahout: Algorithms
 Classification
– Logistic Regression
– Bayesian
– SVM
– NN
– Hidden Markov Models
 Clustering
– Kmeans
– Mean Shift Clustering
– Spectral Clustering
– Top Down Clustering
 Pattern Mining
– Parallel FP Growth
Algorithm
 Regression
– Locally Weighted Linear
Regression
 Dimension reduction
– SVD
– PCA
– GDA
 Collaborative filtering
– Non-distributed
recommenders
– Distributed Item-Based
Collaborative Filtering

- 53-
EXERCISE

- 54-
Mobility Analyzer: A Show Case
HANA DB
CSV Files
Sequence Files
Mahout
Clusterdump
Cluster Info.
Cluster Info.
HANA DB
Site Data Flow Modules
CSVConverter
ImportClusterInfo
ExportTweetsInfoLocal
Hadoop
Local
Run.sh

13 pv-do es-18-bigdata-v3

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (18)

Similar to 13 pv-do es-18-bigdata-v3

Similar to 13 pv-do es-18-bigdata-v3 (20)

Recently uploaded

Recently uploaded (20)

13 pv-do es-18-bigdata-v3