Introduction To Big Data & Hadoop

Blackvard Management Consulting
Introduction to Big Data & Hadoop
Copyright © Blackvard Management Consulting – All rights reserved www.blackvard.com

Agenda
What Will Be Covered:
1. What Is Big Data?
2. Business Intelligence
3. Big Data Analytics
4. Existing Database Technology
5. What is Hadoop?
6. Data Warehouse Appliances vs. Hadoop
7. Hadoop & SAP HANA
8. Business Use Cases

Is “Big Data” Simply “Too Much Data?”
 Is the term “Big Data” just about “big?”
 Big Data is often called “new black gold”
with a lot of undiscovered insights.
http://dilbert.com/strip/2006-11-11

3 Vs of
Big Data
- Tera- and Petabytes
- Transactions
- Tables, files
- Structured
- Semistructured
- Unstructerd
- Batch
- (Near) Real-Time
- Streams
Variety Velocity
Volume
Source: Philip Russom: BIG DATA ANALYTICS – TDWI Best Practice Report
The term “BIG DATA” is defined by the steadily increasing need for VARIETY, VOLUME,
and processing VELOCITY of available data.
Big Data is about 3 “V’s”:
Volume: massive amounts of data to
process with:
Velocity: the speed at which the data
comes into the system
Variety: the variety of structuredness
increases
Big Data Defined

VARIETY:
Most data is unstructured.
Partner data,
reference data,
CRM, ERP, Production,
Finance, HR,
Procurement,
Machine sensor data,
etc.
Documents
email,
Contact center
calls,
Presentations,
security images,
Medical scans
unstructuredstructured
internal
BI + data connections
Social media monitoring
tools
Search,
ECM
Traditional BI
Social media content
channel content
external
Business Intelligence & Variety
In Business Intelligence (BI) systems, data is mostly internal & structured.
Including social media content, digitalization, and a global supply chain
requirement shift to support the broadening variety of structuredness.
Business Intelligence is the
set of techniques and tools
required for the
transformation of raw data
into meaningful and useful
information for business an
alysis purposes.

Analytical appliances
• Tightly integrated hardware-
software combinations
• Analytical bundles: Standalone SW
+ HW combinations
Analytical services
• Systems are stored in an off-site
hosted environment or public cloud
• File-based analytical system
File-based analytical system
• Hadoop
• NoSQL (although it’s not File-
based in a common sense)
Analytical databases
• Software-only analytical platforms
• Most Multi Parallel Processing
(MPP), Columnar and In-Memory
databases
Big Data
Analytics
Big Data Analytics
Big Data Analytics Platforms can be classified into four major categories:
1) Analytical Databases
2) Analytical Appliances
3) Analytical Services
4) File-based analytical systems ( Main Focus)

Several platforms embrace existing database technologies in order to optimize
analytical applications on large data volumes.
Technology Description Vendor / Product
Massively parallel processing (MPP)
Row-based databases designed to scale out on a cluster of
commodity servers.
Also known as “shared-nothing”-architecture
Teradata Active Data Warehouse, Greenplum (EMC),
Microsoft Parallel Data Warehouse, Aster Data
(Teradata), Kognitio
Columnar Databases
DBMS that store data in columns, not rows.
Support high data compression and analytical query performance
Sybase IQ (SAP), ParAccel, Infobright, Vertica (HP),
1010data
Analytical appliances Pre-configured hardware-software systems
Netezza (IBM), Teradata Appliances, Oracle Exadata,
Greenplum Data Computing Appliance (EMC)
In-memory databases Systems load data into memory to execute complex queries SAP HANA, Cognos TM1 (IBM), QlikView, Membase
Distributed file-based systems
Systems designed for storing, manipulating and querying large
volumes of unstructured and semi-structured data.
Hadoop (Apache, Cloudera, MapR, IBM, HortonWorks),
Apache Hive, Apache Pig
Analytical services (Cloud)
Analytical platforms delivered as hosted or public-cloud-based
services
1010data, Kognitio
Nonrelational (NoSQL)
Nonrelational databases optimized for querying unstructured and
structured data
MongoDB, Apache Cassandra, Apache Hbase
Complex Event Processing (CEP)
Systems optimized for calculation and correlation of large volumes
of discrete events and application of conditions
IBM, Tibco, Streambase, Sybase (Aleri), Informatica
Source: Wayne Eckerson: BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
Existing Database Technology

• Google published a paper, which described
• a MapReduce algorithm for processing large
amounts of data
• Doug Cutting, who worked at Yahoo, read
that paper and initiated Hadoop
• Hadoop was the name of the yellow elephant
toy from his son
• Hadoop become an Apache top level project,
• which is supported, among others, by
Facebook, IBM & Yahoo
• Open source project
• Written in Java
• Optimized to handle:
• Massive amounts of data through parallelism
• Using inexpensive commodity hardware
• A variety of data (structured, unstructured, semi-
structured)
• Great performance (on large data volumes)
• Reliability provided through replication
• Not for OLTP, not for OLAP, good for Big Data (1)
FactsHistory
(1)
OLTP: Online Transaction Processing (CRM, ERP)
OLAP: Online Analytical Processing (Data Mining, complex queries over multidimensional data)
What is Hadoop?

Hadoop
Core  HDFS stores data on
several nodes in the cluster,
with the goal of providing
greater bandwidth across
the cluster as well as higher
reliability.
Hadoop consists mainly of two components:
Hadoop Distributed
Filesystem
 It is a computational
paradigm called
Map/Reduce, which
takes an application and
divides it into multiple
fragments of work, each
of which can be
executed on any node in
the cluster.
Hadoop MapReduce
http://mohamednabeel.blogspot.de/2011/03/starting-sub-sandwitch-business.html
Block A Block B Block C
File1.txt
Data
Node 1
Data
Node 2
Data
Node 3
Data
Node 4
Block C
Block ABlock B Block ABlock C
Block A Block B Block B Block C
MAP
1
1
1
1
1
1
1
SORT REDUCE
3
1
1
1
2
2
2
Give every
shape the
value of1
Sort
the
Shapes
For each
shape
type,
count the
vaules
Hadoop Core

Data Warehouse Appliances
▪ Expensive dedicated HW
▪ Built for performance
▪ Designed for high volumes (eg. 10s of TB)
▪ High availability
▪ Initially developed using Relational Database Systems like
Oracle, IBM DB2
▪ Designed for modeled and structured data
▪ Business As Usual ways to design, build and deliver
▪ Teradata, Exadata, Netezza, HANA, ... are examples
Hadoop Infrastructure
▪ Uses commodity PCs
▪ Built for extreme scalability
▪ Designed for extreme volumes (10s of PB and more)
▪ Very high availability
▪ Initially developed for web ranking
▪ Hadoop = Data is distributed over many machines
▪ MapReduce = Computing is distributed and executed
where data is (grid solution)
Data Warehouse Appliances vs. Hadoop
“Classical” Data Warehouse Appliances (DWH) differ in the technical basis and the use of
them, compared to a Hadoop infrastructure. This does not mean that DWH Appliances are
now irrelevant, but rather a combination of both is the basis for being future ready.

 Data import/export (Flume, Sqoop)
 Libraries, algorithms (Mahout, Lzo compression)
 Tools – monitoring, user experience (Hue, Ambari, White
Elephant)
 Data stores (HBase, HCatalog)
 Workflow management, job scheduling (Oozie,
Cascading)
 Data querying (Hive, Pig, Impala, Drill)
 Cluster provisioning & management (Whirr)
 … many more
The Hadoop ecosystem uses several tools to solve individual tasks. For example, Sqoop or
Flume are used to import and export data from/into Hadoop or Hive, as data querying tools.
Most of these tools are combined into distributions Cloudera, Pivotal or Hortonworks to
reduce the managing overhead for customers. Again, a combination of both is the basis for
being future ready.
Hadoop Provides Rich Ecosystems For Tasks

Predictive
Analytics
Reporting,
Dashboarding
Ad-hoc-
Analysis
Data
Exploration
Which data describes my business?
What chances and risks in
business do we see?
Why did our business
run in this way?
How did our
business run?
Customers Get In Touch w/ Big Data
Customers get in touch with Big Data through: visualization

 Find answers to the questions:
• What chances and risks in
business do we see?
• How can we classify our
customers?
• How will sales be in the next
two weeks?
• Based on predictive
algorithms
• Why did our business run in
this way?
• What were the key points?
• Can we find obvious „gaps“
in our business?
 No or less pre-defined reports
 Visualization of data and
corellation is important
 Only historical data
• How did our business run
the last X periods?
• How well did it run?
 Dashboards focus on
management visualization
Reporting, Dashboarding Ad-hoc-Analysis Predictive Analytics
Visualization
 The three types of visualization are as follows:

Leverage The Power Of Hadoop w/ HANA®
HANA®
1) http://www.sap.com/solution/big-data/software/platform.html
SAP promotes (1) Hadoop as THE solution to improve business performance
in real-time, and to leverage the power of Big Data.
HANA® (High Performance ANalytic Appliance) is an SAP product which
allows for rapid analysis of large amounts of data in real-time.
Using Hadoop with HANA®, allows users to take advantage of powerful In-
Memory Analysis, as well as gain insights to undiscovered data (Machine
sensors, Geo-information, social media, etc.) and mine the new black gold
(2).
2) http://www.wired.com/2013/02/is-big-data-the-new-black-gold/

Existing Sources
(ERP, CRM, Logs)
Emerging Sources
(Sensors, Geo, Unstructured)
Sources
Data System
HANA
Applications
NON-SAP
Enterprise
Applications
Mobile
SAP HANA® & Hadoop Integration
Hadoop can be integrated in an SAP HANA® -System to extend the power of In-Memory
computing and the flexibility of SAP HANA® to easy-to-use and cost efficient storage.

Existing Sources
(ERP, CRM, Logs)
Emerging Sources
(Sensors, Geo, Unstructured)
Sources
Data System
HANA
Applications
NON-SAP
Enterprise
Applications Mobile
1
2
34
4 Main Uses For Hadoop With SAP HANA®
1
2
3
4
Data Analytics
Flexible Data Store
Simple Database
Processing Engine
• Mining data held in Hadoop for business
intelligence & analytics.
• Using Hadoop as a flexible store of data
captured from multiple sources, including SAP
and non-SAP software, enterprise software &
externally sourced data.
• Using Hadoop as a simple database for storing &
retrieving data in very large data sets.
• Using computation engine in Hadoop to execute
business logic or other business processes.

Telecommunications Data traffic, retail patterns, geo-location data...
Utilities Smart meter, consumer behavior, network loads.
Cities
People movement, emissions, produce flows,
demographics.
Transportation Product flow, route optimization, hazard location.
Business Use Cases Across All Sectors

Have Additional Questions?
Want To Set Up A Consultation?
Email: info@blackvard.com
Require A Consultation?

 Technical project lead and ABAP architect responsible for quality in technical scope and budget in a global
roll-out of SAP Logistics applications (SAP LE / LO)
 Conducting multiple SAP ABAP and SAP HANA® trainings for various US companies
 Implementation of a standard SAP software solution for Spend Management within SAP AG & ARIBA (annual
spend volume 3 Bill. EUR) which can be used in all SAP systems
 Improved claims management using SAP FS-CM which is generating annual savings of 15 Mio € for a huge
German public healthcare organization
 Implemented a global solution for procurement processes at BMW AG using SAP SRM / B2B
 Blueprinting and implementation of SAP software for banking credit cancelations for VOLKSWAGEN
Key Achievements of Blackvard Management Consulting in Previous Projects
What We’ve Accomplished

Blackvard Management Consultants
www.blackvard.comCopyright © Blackvard Management Consulting – All rights reserved
Short Bio:
Lukas M. Dietzsch is managing director at Blackvard
Management Consulting, LLC. He is holding a Master’s
degree in Information Technology and is an experienced IT
solution architect and project lead.
His strong background in adapting to requirements and
standards in different industries and on various platforms are
valuable assets for Blackvard customers.
He is repeatedly commended by customers for driving
efficient solutions for complex problems in globally
distributed team environments and meeting tough deadlines.
For further information please visit:
www.blackvard.com
Lukas M. Dietzsch
lukas@blackvard.com
Copyright © Blackvard Management Consulting- All rights reserved www.blackvard.com
Managing Director

An overview of current and previous customers:
Customers That Recommend Blackvard

Introduction To Big Data & Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Introduction To Big Data & Hadoop

Similar to Introduction To Big Data & Hadoop (20)

Recently uploaded

Recently uploaded (20)

Introduction To Big Data & Hadoop

Editor's Notes