Big Data/Hadoop Option Analysis

Zafar Ali
BIG DATA
Option Analysis
22/12/2016
IDB Solutions LTD 1

BACKGROUND
“The idea of data creating business value is not new, however, the effective use
of data is becoming the basis of competition”
Enterprises always helps clients derive insights from information in order to
make better, smarter, real time, fact-based decisions: it is this demand for
depth of knowledge that has fueled the growth of big data tools and platforms.
What is BIG DATA?
 Due to advent of smart devices, social media and new technologies – the
amount of data produced by these devices and technologies is astronomical.
 BIG data comprises of conventional/structured data (EDW, RDBMS) as well
as other sources/unstructured data like sensor, social media (twitter,
facebook, linkedin), logs etc to reveal patterns, trends, KPI, Dashboard etc.
222/12/2016

BIG DATA FOUR V’S
322/12/2016
• Big data comprises of conventional and
unconventional source and typically based
on 4Vs
• Volume: the amount of data being
created is vast compared to traditional
data sources like RDBS/EDW
• Variety: data comes from different
sources and is being created by
machines, sensor, logs, humans etc
• Velocity: data is being generated
extremely fast — typically processed
real time but also ingest in form of batch
• Veracity: big data is sourced from
many different places, as a result you
need to test the veracity/quality of the
data

BIG DATA VENDOR
422/12/2016
Big Data Technologies are different from traditional data sources and it require different toolsets and
technologies to mange and process structures/semi-structured and unstructured data -
Below are few players in BIG Data’s world.

TYPICAL BIG DATA PROCESSING
522/12/2016
To harness the power of big data, enterprises would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in real time and in batch processing –
keeping data protection privacy and security at the hearth – Typical Big Data Processing will look like
below

NEXT GENERATION ARCHITECTURE
622/12/2016
Enterprises next generation releases will have both traditional EDW/RDBMS and Big data solutions
hands in hands as one cannot fulfill demands and needs.
Traditional EDW
- Store business critical data
- Integrate existing data
sources
- Integration with existing
reporting/MI solutions
Big Data
• Leverage new data sources
e.g. P6 projects docs, social
media discussion about
projects
• Parallel processing to process
unstructured data e.g. Asset’s
sensor data, geolocation etc

NEXT GENERATION ARCHITECTURE INTEGRATION
722/12/2016
Hadoop, is an open source applications based on MapReduce algorithm, where the data is processed in parallel
on different CPU nodes. Hadoop offers excellent integration with existing AH application (AIM, PIM) ETL (talend)
and Reporting tools (TIBCO Spotfire, TICBO jaspersoft)
Existing Infrastructure
1- Reporting: existing MI/Reporting, EDW
tools are easy to integrate with Big Data
2- ETL/ELT – Apache, HDP 2.0, Cloudera offers
Integration with Talend and existing PL/SQL,
UNIX CRON jobs etc
3- Applications – P6, ERP, SAP API can be
easily integrated with Hadoop’s infrastructure
Reference:
http://hortonworks.com/wp-content/uploads/2013/10/Build-A-Modern-Data-Architecture.pdf

NEXT GENERATION ARCHITECTURE - HADOOP
822/12/2016
Hadoop runs applications using the MapReduce algorithm open source software,
where the data is processed in parallel on different CPU nodes. In short, Hadoop
framework is capable enough to develop applications capable of running on
clusters of computers and they could perform complete statistical analysis for a
huge amounts of data.
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other
Hadoop modules. These libraries provides filesystem and OS level abstractions
and contains the necessary Java files and scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource
management.
Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data – a low cost, flexible data
source reservoir; Hive on the other hand used for SQL access for structured and
semi strurctured data
Hadoop MapReduce: This is YARN-based system for parallel processing of
large data sets.
Key Hadoop distributions are Cloudera CDH, Greenplum, MapR, Hortonworks
HDP1.0+ etc

NEXT GENERATION ARCHITECTURE – HADOOP EVOLUTION
922/12/2016
• Hadoop originally created using
Google MapReduce, BigTable
and Google File System (GFS)
• Over the time Hadoop
ecosystem has evolved to
enhanced functionalities like
Hive (Query), Pig (Scripting),
Workflow and Schedule
(OOZIE), Non Relational
DB(Hbase), Log Processing
(Flume, sqoop), Management
and Monitoring (Amber,
Zookeeper)
• Hcatalog to enhance HDFS,
HIVE, and Pig

NEXT GENERATION ARCHITECTURE – HDP/CLOUDERA/OTHER VENDORS
1022/12/2016
HDP 2.0+:
Hortonworks Data Platform (HDP 2.0) integrates Apache
Hadoop into modern data architecture - This will enable
enterprises to capture, store and process vast quantities of data
in a cost efficient and scalable manner – HDP 2.0 offer excellent
gateways and APIs to integrate with existing applications, EDW.
Cloudera/CDH:
Cloudera is another open source big data platform distribution
based on Apache Hadoop. CDH offers all key components out
of the. CDH also offer hue which provides developers a web
based utility execute jobs and check progress.
Other Big data vendor at following link:
http://www.bigdatavendors.com/top.php
Basic HDP 2.0 Architecture
Cloudera Basic Architecture

NEXT GENERATION ARCHITECTURE – KAFKA
1122/12/2016
Kafka offers streaming platform as having three key capabilities:
• It lets you publish and subscribe to streams of records. In this
respect it is similar to a message queue or enterprise messaging
system.
• It lets you store streams of records in a fault-tolerant way.
• It lets you process streams of records as they occur.
What use in Construction/P6?
Various types of Hardware could use Kafka for processing real time
data.
• Live stream of asset geo location
• Application tracking
• Applications error log real-time processing
• Building real-time streaming applications that transform or react to
the streams of data
More information on Kafka is available at following
https://kafka.apache.org/intro.html
http://hortonworks.com/apache/kafka/#section_1

NEXT GENERATION ARCHITECTURE – R/PYTHON/SAS
1222/12/2016
R/SaS/Python are programming language and software environment for
statistical computing and graphics supported by the R Foundation for
Statistical Computing.
The R language is widely used among statisticians and data miners for
developing statistical software and data analysis.R is typically used at the Raw
source data, EDW or query store – refer to
Any product currently feeding data into an app for data science and statistical
analysis (linear and non-linear modelling, classical statistical tests, time scale
series etc) can be easily integrated with HDP or Cloudera . HDP 2.0+ and
Cloudera both offer their own version of R to provide statistical analysis -
although same feature is available in Hadoop core system in the form of
MapReduce (MPP). Other options could be explored under this hood are Pig,
Spark, Python etc/

NEXT GENERATION ARCHITECTURE – FLUME
1322/12/2016
Apache Flume is the standard way to transport log files from source
through to target
•Initial use-case was webserver log files, but can transport any file from
A-B
•Does not do “data transformation”, but can send to multiple targets /
target types
•Mechanisms and checks to ensure successful transport of
entries - Has a concept of “agents”, “sinks” and “channels”
•Agents collect and forward log data
•Sinks store it in final destination
•Channels store log data en-route
More information on flume is available at following
https://flume.apache.org
http://hortonworks.com/apache/flume/
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-
2.4.3/bk_installing_manually_book/content/understanding_flume.html
http://www.cloudera.com/products/apache-hadoop/apache-flume.html
Kafka and flume in action

NEXT GENERATION ARCHITECTURE - SOURCE
1422/12/2016
Data Sources for Big Data can be categorized into three main forms:
• Structured data : Relational data.
• Semi Structured data : XML data.
• Unstructured data : Word, PDF, Text, Media Logs.
Unstructured Data:
Such form of data normally lands into HDFS(Hive)
• Sensor data collection from HW
• Geo location data from HW
• Server Logs
• Documents related to projects e.g. TP500, Gates file, RIIO Code classification, EES etc
• Social Media discussion about project e.g. LPT (London Power Tunnels) has high presence on twitter, BBC,
facebook, youtube etc
• Physical location of asset e.g. Switchgear, cables etc
• Survey data about projects
Structured/Semi Structured Data:
Such data normally loaded into traditional EDW either through existing ETL or using BIG data e.g. CSV, API, P6, ERP, SAP
etc

NEXT GENERATION ARCHITECTURE - ETL
1522/12/2016
Talend/ODI/Informatica provides excellent framework for running Hadoop ETL jobs with major Hadoop
distributions and existing infrastructure
• ETL/ELT pushes data/transformation down to Hadoop, Cloudera, Hortonworks
• Hive, Sqoop, flame provides native drives to push data into Hadoop/HDFS or HBASE
• Data Loading is typically in ”raw form”
• Files, event
• Semi structured like JASON, XML
• High Volume, high velocity is the reason of using Big data instead of RDBMS
• Data Quality / error handling
• Metadata driven
• Loading types of data in Big data could be:
• Real Time processing
• Batch Processing

NEXT GENERATION ARCHITECTURE - SPARK
1622/12/2016
Spark powers a stack of libraries including SQL and DataFrames, MLlib
for machine learning, GraphX, and Spark Streaming. You can combine
these libraries seamlessly in the same application
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can
access diverse data sources including HDFS, Cassandra, HBase, and
S3. You can run Spark using its standalone cluster mode, on EC2, on
Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra,
HBase, Hive, Tachyon, and any Hadoop data source
Spark and Hadoop are both framework for the Big data but they have
contrast difference between them - refer to below links to understand
what is each frame provides.
Reference
http://spark.apache.org
http://www.infoworld.com/article/3014440/big-data/five-things-you-need-
to-know-about-hadoop-v-apache-spark.html

NEXT GENERATION ARCHITECTURE – NO SQL
1722/12/2016
NoSQL is referring to non-relational or at least non-
SQL database solutions such as HBase (also a part of
the Hadoop ecosystem), Cassandra, MongoDB, Riak,
CouchDB
There are, after all, in excess of 100 NoSQL
databases, as the DB-Engines database popularity
ranking shows
There are three most popular NoSQL vendors for
Hadoop Named casandara, mongoDB, HBASE.
“NoSQL” are gaining popularity - AH could incorporate
BI/Analytical/Reporting using NoSQL which means
end-user/client wont have to write SQL to get the
desired dataset. An in-depth CTO require before
making a final decision on “NoSQL” – though it offers
some stark advantages over RDBMS/Analytics Big
Data. My personal suggestion would be coexistence of
both “NoSQL” and “RDBMS” in Big Data landscape.

Big Data Distributor Option Analysis- Summary Assessment
22/12/2016 18
Option Cost
(indicative
estimate)
Deployment Strategic Fit Windows
Compatibility
Ease of use Licenses Overall
Cloudera
Hortonworks
.
.
Cloudera can be
deployed on
windows OS
Cloudera does n’t
support needs of
EDW in longer run
and see HADOOP as
enterprise data hub –
this contradicts with
AH requirement to
integrate existing
infrastructure.
Cloudera offers
cloud, on-premise
and sand-box
version option for
VM
No clear cost
available
online
Cloudera has a
commercial license -
Cloudera also allows
the use of its open-
source projects free
of cost, but the
package doesnot
include the
management suite
Cloudera Manager or
any other proprietary
software
HDP is available as a
native component
on the windows
server.
Hortonworks see
EDW as integral
part of Hadoop
ecosystem and has
strong tie with
Terdata
No clear cost
available online
Hortonworks is
open source but
chances of
installation error
through command
prompt are very
high compare to
Cloudera
HDP only offers
cloud based
services.
Cloudera has a
proprietary
management
software
Cloudera
Manager, SQL
query handling
interface Impala,
as well as
Cloudera Search
for easy and real-
time access of
products.
Hortonworks has
no proprietary
software, uses
Ambari for
management and
Stinger for
handling queries,
and Apache Solr
for searches of
data

Big Data/Hadoop Option Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Big Data/Hadoop Option Analysis

Similar to Big Data/Hadoop Option Analysis (20)

Recently uploaded

Recently uploaded (20)

Big Data/Hadoop Option Analysis

Editor's Notes