SlideShare a Scribd company logo
Zafar Ali
BIG DATA
Option Analysis
22/12/2016
IDB Solutions LTD 1
BACKGROUND
“The idea of data creating business value is not new, however, the effective use
of data is becoming the basis of competition”
Enterprises always helps clients derive insights from information in order to
make better, smarter, real time, fact-based decisions: it is this demand for
depth of knowledge that has fueled the growth of big data tools and platforms.
What is BIG DATA?
 Due to advent of smart devices, social media and new technologies – the
amount of data produced by these devices and technologies is astronomical.
 BIG data comprises of conventional/structured data (EDW, RDBMS) as well
as other sources/unstructured data like sensor, social media (twitter,
facebook, linkedin), logs etc to reveal patterns, trends, KPI, Dashboard etc.
222/12/2016
BIG DATA FOUR V’S
322/12/2016
• Big data comprises of conventional and
unconventional source and typically based
on 4Vs
• Volume: the amount of data being
created is vast compared to traditional
data sources like RDBS/EDW
• Variety: data comes from different
sources and is being created by
machines, sensor, logs, humans etc
• Velocity: data is being generated
extremely fast — typically processed
real time but also ingest in form of batch
• Veracity: big data is sourced from
many different places, as a result you
need to test the veracity/quality of the
data
BIG DATA VENDOR
422/12/2016
Big Data Technologies are different from traditional data sources and it require different toolsets and
technologies to mange and process structures/semi-structured and unstructured data -
Below are few players in BIG Data’s world.
TYPICAL BIG DATA PROCESSING
522/12/2016
To harness the power of big data, enterprises would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in real time and in batch processing –
keeping data protection privacy and security at the hearth – Typical Big Data Processing will look like
below
NEXT GENERATION ARCHITECTURE
622/12/2016
Enterprises next generation releases will have both traditional EDW/RDBMS and Big data solutions
hands in hands as one cannot fulfill demands and needs.
Traditional EDW
- Store business critical data
- Integrate existing data
sources
- Integration with existing
reporting/MI solutions
Big Data
• Leverage new data sources
e.g. P6 projects docs, social
media discussion about
projects
• Parallel processing to process
unstructured data e.g. Asset’s
sensor data, geolocation etc
NEXT GENERATION ARCHITECTURE INTEGRATION
722/12/2016
Hadoop, is an open source applications based on MapReduce algorithm, where the data is processed in parallel
on different CPU nodes. Hadoop offers excellent integration with existing AH application (AIM, PIM) ETL (talend)
and Reporting tools (TIBCO Spotfire, TICBO jaspersoft)
Existing Infrastructure
1- Reporting: existing MI/Reporting, EDW
tools are easy to integrate with Big Data
2- ETL/ELT – Apache, HDP 2.0, Cloudera offers
Integration with Talend and existing PL/SQL,
UNIX CRON jobs etc
3- Applications – P6, ERP, SAP API can be
easily integrated with Hadoop’s infrastructure
Reference:
http://hortonworks.com/wp-content/uploads/2013/10/Build-A-Modern-Data-Architecture.pdf
NEXT GENERATION ARCHITECTURE - HADOOP
822/12/2016
Hadoop runs applications using the MapReduce algorithm open source software,
where the data is processed in parallel on different CPU nodes. In short, Hadoop
framework is capable enough to develop applications capable of running on
clusters of computers and they could perform complete statistical analysis for a
huge amounts of data.
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other
Hadoop modules. These libraries provides filesystem and OS level abstractions
and contains the necessary Java files and scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource
management.
Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data – a low cost, flexible data
source reservoir; Hive on the other hand used for SQL access for structured and
semi strurctured data
Hadoop MapReduce: This is YARN-based system for parallel processing of
large data sets.
Key Hadoop distributions are Cloudera CDH, Greenplum, MapR, Hortonworks
HDP1.0+ etc
NEXT GENERATION ARCHITECTURE – HADOOP EVOLUTION
922/12/2016
• Hadoop originally created using
Google MapReduce, BigTable
and Google File System (GFS)
• Over the time Hadoop
ecosystem has evolved to
enhanced functionalities like
Hive (Query), Pig (Scripting),
Workflow and Schedule
(OOZIE), Non Relational
DB(Hbase), Log Processing
(Flume, sqoop), Management
and Monitoring (Amber,
Zookeeper)
• Hcatalog to enhance HDFS,
HIVE, and Pig
NEXT GENERATION ARCHITECTURE – HDP/CLOUDERA/OTHER VENDORS
1022/12/2016
HDP 2.0+:
Hortonworks Data Platform (HDP 2.0) integrates Apache
Hadoop into modern data architecture - This will enable
enterprises to capture, store and process vast quantities of data
in a cost efficient and scalable manner – HDP 2.0 offer excellent
gateways and APIs to integrate with existing applications, EDW.
Cloudera/CDH:
Cloudera is another open source big data platform distribution
based on Apache Hadoop. CDH offers all key components out
of the. CDH also offer hue which provides developers a web
based utility execute jobs and check progress.
Other Big data vendor at following link:
http://www.bigdatavendors.com/top.php
Basic HDP 2.0 Architecture
Cloudera Basic Architecture
NEXT GENERATION ARCHITECTURE – KAFKA
1122/12/2016
Kafka offers streaming platform as having three key capabilities:
• It lets you publish and subscribe to streams of records. In this
respect it is similar to a message queue or enterprise messaging
system.
• It lets you store streams of records in a fault-tolerant way.
• It lets you process streams of records as they occur.
What use in Construction/P6?
Various types of Hardware could use Kafka for processing real time
data.
• Live stream of asset geo location
• Application tracking
• Applications error log real-time processing
• Building real-time streaming applications that transform or react to
the streams of data
More information on Kafka is available at following
https://kafka.apache.org/intro.html
http://hortonworks.com/apache/kafka/#section_1
NEXT GENERATION ARCHITECTURE – R/PYTHON/SAS
1222/12/2016
R/SaS/Python are programming language and software environment for
statistical computing and graphics supported by the R Foundation for
Statistical Computing.
The R language is widely used among statisticians and data miners for
developing statistical software and data analysis.R is typically used at the Raw
source data, EDW or query store – refer to
Any product currently feeding data into an app for data science and statistical
analysis (linear and non-linear modelling, classical statistical tests, time scale
series etc) can be easily integrated with HDP or Cloudera . HDP 2.0+ and
Cloudera both offer their own version of R to provide statistical analysis -
although same feature is available in Hadoop core system in the form of
MapReduce (MPP). Other options could be explored under this hood are Pig,
Spark, Python etc/
NEXT GENERATION ARCHITECTURE – FLUME
1322/12/2016
Apache Flume is the standard way to transport log files from source
through to target
•Initial use-case was webserver log files, but can transport any file from
A-B
•Does not do “data transformation”, but can send to multiple targets /
target types
•Mechanisms and checks to ensure successful transport of
entries - Has a concept of “agents”, “sinks” and “channels”
•Agents collect and forward log data
•Sinks store it in final destination
•Channels store log data en-route
More information on flume is available at following
https://flume.apache.org
http://hortonworks.com/apache/flume/
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-
2.4.3/bk_installing_manually_book/content/understanding_flume.html
http://www.cloudera.com/products/apache-hadoop/apache-flume.html
Kafka and flume in action
NEXT GENERATION ARCHITECTURE - SOURCE
1422/12/2016
Data Sources for Big Data can be categorized into three main forms:
• Structured data : Relational data.
• Semi Structured data : XML data.
• Unstructured data : Word, PDF, Text, Media Logs.
Unstructured Data:
Such form of data normally lands into HDFS(Hive)
• Sensor data collection from HW
• Geo location data from HW
• Server Logs
• Documents related to projects e.g. TP500, Gates file, RIIO Code classification, EES etc
• Social Media discussion about project e.g. LPT (London Power Tunnels) has high presence on twitter, BBC,
facebook, youtube etc
• Physical location of asset e.g. Switchgear, cables etc
• Survey data about projects
Structured/Semi Structured Data:
Such data normally loaded into traditional EDW either through existing ETL or using BIG data e.g. CSV, API, P6, ERP, SAP
etc
NEXT GENERATION ARCHITECTURE - ETL
1522/12/2016
Talend/ODI/Informatica provides excellent framework for running Hadoop ETL jobs with major Hadoop
distributions and existing infrastructure
• ETL/ELT pushes data/transformation down to Hadoop, Cloudera, Hortonworks
• Hive, Sqoop, flame provides native drives to push data into Hadoop/HDFS or HBASE
• Data Loading is typically in ”raw form”
• Files, event
• Semi structured like JASON, XML
• High Volume, high velocity is the reason of using Big data instead of RDBMS
• Data Quality / error handling
• Metadata driven
• Loading types of data in Big data could be:
• Real Time processing
• Batch Processing
NEXT GENERATION ARCHITECTURE - SPARK
1622/12/2016
Spark powers a stack of libraries including SQL and DataFrames, MLlib
for machine learning, GraphX, and Spark Streaming. You can combine
these libraries seamlessly in the same application
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can
access diverse data sources including HDFS, Cassandra, HBase, and
S3. You can run Spark using its standalone cluster mode, on EC2, on
Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra,
HBase, Hive, Tachyon, and any Hadoop data source
Spark and Hadoop are both framework for the Big data but they have
contrast difference between them - refer to below links to understand
what is each frame provides.
Reference
http://spark.apache.org
http://www.infoworld.com/article/3014440/big-data/five-things-you-need-
to-know-about-hadoop-v-apache-spark.html
NEXT GENERATION ARCHITECTURE – NO SQL
1722/12/2016
NoSQL is referring to non-relational or at least non-
SQL database solutions such as HBase (also a part of
the Hadoop ecosystem), Cassandra, MongoDB, Riak,
CouchDB
There are, after all, in excess of 100 NoSQL
databases, as the DB-Engines database popularity
ranking shows
There are three most popular NoSQL vendors for
Hadoop Named casandara, mongoDB, HBASE.
“NoSQL” are gaining popularity - AH could incorporate
BI/Analytical/Reporting using NoSQL which means
end-user/client wont have to write SQL to get the
desired dataset. An in-depth CTO require before
making a final decision on “NoSQL” – though it offers
some stark advantages over RDBMS/Analytics Big
Data. My personal suggestion would be coexistence of
both “NoSQL” and “RDBMS” in Big Data landscape.
Big Data Distributor Option Analysis- Summary Assessment
22/12/2016 18
Option Cost
(indicative
estimate)
Deployment Strategic Fit Windows
Compatibility
Ease of use Licenses Overall
Cloudera
Hortonworks
.
.
Cloudera can be
deployed on
windows OS
Cloudera does n’t
support needs of
EDW in longer run
and see HADOOP as
enterprise data hub –
this contradicts with
AH requirement to
integrate existing
infrastructure.
Cloudera offers
cloud, on-premise
and sand-box
version option for
VM
No clear cost
available
online
Cloudera has a
commercial license -
Cloudera also allows
the use of its open-
source projects free
of cost, but the
package doesnot
include the
management suite
Cloudera Manager or
any other proprietary
software
HDP is available as a
native component
on the windows
server.
Hortonworks see
EDW as integral
part of Hadoop
ecosystem and has
strong tie with
Terdata
No clear cost
available online
Hortonworks is
open source but
chances of
installation error
through command
prompt are very
high compare to
Cloudera
HDP only offers
cloud based
services.
Cloudera has a
proprietary
management
software
Cloudera
Manager, SQL
query handling
interface Impala,
as well as
Cloudera Search
for easy and real-
time access of
products.
Hortonworks has
no proprietary
software, uses
Ambari for
management and
Stinger for
handling queries,
and Apache Solr
for searches of
data

More Related Content

What's hot

Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
Hortonworks
 
Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics
DataWorks Summit/Hadoop Summit
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
Hortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
Hadoop and Modern Data Architecture
Hadoop and Modern Data Architecture Hadoop and Modern Data Architecture
Hadoop and Modern Data Architecture
Bilot
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
Hortonworks
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
BigDataExpo
 
Powering the Future of Data  
Powering the Future of Data	   Powering the Future of Data	   
Powering the Future of Data  
Bilot
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
jaxconf
 
Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12
Tanguy MOAL
 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Hortonworks
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Hortonworks
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
Brock Noland
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
DataWorks Summit/Hadoop Summit
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Eric Sun
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital Transformation
MapR Technologies
 
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
Big Data Montreal
 
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
DataStax Academy
 

What's hot (20)

Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Hadoop and Modern Data Architecture
Hadoop and Modern Data Architecture Hadoop and Modern Data Architecture
Hadoop and Modern Data Architecture
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
 
Powering the Future of Data  
Powering the Future of Data	   Powering the Future of Data	   
Powering the Future of Data  
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12Oncrawl elasticsearch meetup france #12
Oncrawl elasticsearch meetup france #12
 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital Transformation
 
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
 
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
 

Viewers also liked

How could I automate log gathering in the distributed system
How could I automate log gathering in the distributed systemHow could I automate log gathering in the distributed system
How could I automate log gathering in the distributed system
Jun Hong Kim
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
Kai Sasaki
 
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Edureka!
 
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
Anna Yen
 
Arte
ArteArte
OOPS-"INHERITANCE"
OOPS-"INHERITANCE"OOPS-"INHERITANCE"
OOPS-"INHERITANCE"
Ankit Surti
 
ADVANCE SQL-"Sub queries"
ADVANCE SQL-"Sub queries"ADVANCE SQL-"Sub queries"
ADVANCE SQL-"Sub queries"
Ankit Surti
 
The growth of the sustainability agenda and the role of facilities management
The growth of the sustainability agenda and the role of facilities managementThe growth of the sustainability agenda and the role of facilities management
The growth of the sustainability agenda and the role of facilities management
FM EXPO
 
Launch of Sealed Air's Intellibot (Robotic Cleaning Technology)
Launch of Sealed Air's Intellibot (Robotic Cleaning Technology)Launch of Sealed Air's Intellibot (Robotic Cleaning Technology)
Launch of Sealed Air's Intellibot (Robotic Cleaning Technology)
FM EXPO
 
Facebook as networking tool
Facebook as networking toolFacebook as networking tool
Facebook as networking tool
Julie Huddle
 
Data entry india bpo - Outsource Data Entry India
Data entry india bpo - Outsource Data Entry IndiaData entry india bpo - Outsource Data Entry India
Data entry india bpo - Outsource Data Entry India
Camila Anderson
 
Parco Isola Carolina, Comune di Lodi - Secondo incontro
Parco Isola Carolina, Comune di Lodi - Secondo incontro Parco Isola Carolina, Comune di Lodi - Secondo incontro
Parco Isola Carolina, Comune di Lodi - Secondo incontro isola_carolina
 
Bogotá
Bogotá Bogotá
Bogotá
LAURARABU
 
CV - Luttrell, Mary Beth
CV - Luttrell, Mary BethCV - Luttrell, Mary Beth
CV - Luttrell, Mary Beth
Mary Beth Luttrell, FNP-C
 
LIST OF PUBLICATIONS
LIST OF PUBLICATIONSLIST OF PUBLICATIONS
LIST OF PUBLICATIONS
Dr. Suresh P M
 
Learning technologies: Developing new standards in the GCC region
Learning technologies: Developing new standards in the GCC regionLearning technologies: Developing new standards in the GCC region
Learning technologies: Developing new standards in the GCC region
FM EXPO
 
Presentazione pubblica 12 Settembre 2015
Presentazione pubblica 12 Settembre 2015Presentazione pubblica 12 Settembre 2015
Presentazione pubblica 12 Settembre 2015
isola_carolina
 

Viewers also liked (19)

How could I automate log gathering in the distributed system
How could I automate log gathering in the distributed systemHow could I automate log gathering in the distributed system
How could I automate log gathering in the distributed system
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
 
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
 
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
 
Arte
ArteArte
Arte
 
OOPS-"INHERITANCE"
OOPS-"INHERITANCE"OOPS-"INHERITANCE"
OOPS-"INHERITANCE"
 
ADVANCE SQL-"Sub queries"
ADVANCE SQL-"Sub queries"ADVANCE SQL-"Sub queries"
ADVANCE SQL-"Sub queries"
 
The growth of the sustainability agenda and the role of facilities management
The growth of the sustainability agenda and the role of facilities managementThe growth of the sustainability agenda and the role of facilities management
The growth of the sustainability agenda and the role of facilities management
 
Launch of Sealed Air's Intellibot (Robotic Cleaning Technology)
Launch of Sealed Air's Intellibot (Robotic Cleaning Technology)Launch of Sealed Air's Intellibot (Robotic Cleaning Technology)
Launch of Sealed Air's Intellibot (Robotic Cleaning Technology)
 
Facebook as networking tool
Facebook as networking toolFacebook as networking tool
Facebook as networking tool
 
belt
beltbelt
belt
 
Data entry india bpo - Outsource Data Entry India
Data entry india bpo - Outsource Data Entry IndiaData entry india bpo - Outsource Data Entry India
Data entry india bpo - Outsource Data Entry India
 
Parco Isola Carolina, Comune di Lodi - Secondo incontro
Parco Isola Carolina, Comune di Lodi - Secondo incontro Parco Isola Carolina, Comune di Lodi - Secondo incontro
Parco Isola Carolina, Comune di Lodi - Secondo incontro
 
Bogotá
Bogotá Bogotá
Bogotá
 
CV - Luttrell, Mary Beth
CV - Luttrell, Mary BethCV - Luttrell, Mary Beth
CV - Luttrell, Mary Beth
 
LIST OF PUBLICATIONS
LIST OF PUBLICATIONSLIST OF PUBLICATIONS
LIST OF PUBLICATIONS
 
Learning technologies: Developing new standards in the GCC region
Learning technologies: Developing new standards in the GCC regionLearning technologies: Developing new standards in the GCC region
Learning technologies: Developing new standards in the GCC region
 
Presentazione pubblica 12 Settembre 2015
Presentazione pubblica 12 Settembre 2015Presentazione pubblica 12 Settembre 2015
Presentazione pubblica 12 Settembre 2015
 
Ad projects
Ad projectsAd projects
Ad projects
 

Similar to Big Data/Hadoop Option Analysis

Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
Gregg Barrett
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
AshishRathore72
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
Cognizant
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
Supratim Ray
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoption
faizrashid1995
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Stephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Stephen Alex
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Sascha Dittmann
 
Hadoop
HadoopHadoop
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
IJSRED
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
sarith divakar
 
Big Data
Big DataBig Data
Big Data
Kirubaburi R
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
Mishika Bharadwaj
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Shweta Sahu
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
Mohammadhasan Farazmand
 

Similar to Big Data/Hadoop Option Analysis (20)

Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoption
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Big Data
Big DataBig Data
Big Data
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 

Recently uploaded

Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 

Big Data/Hadoop Option Analysis

  • 1. Zafar Ali BIG DATA Option Analysis 22/12/2016 IDB Solutions LTD 1
  • 2. BACKGROUND “The idea of data creating business value is not new, however, the effective use of data is becoming the basis of competition” Enterprises always helps clients derive insights from information in order to make better, smarter, real time, fact-based decisions: it is this demand for depth of knowledge that has fueled the growth of big data tools and platforms. What is BIG DATA?  Due to advent of smart devices, social media and new technologies – the amount of data produced by these devices and technologies is astronomical.  BIG data comprises of conventional/structured data (EDW, RDBMS) as well as other sources/unstructured data like sensor, social media (twitter, facebook, linkedin), logs etc to reveal patterns, trends, KPI, Dashboard etc. 222/12/2016
  • 3. BIG DATA FOUR V’S 322/12/2016 • Big data comprises of conventional and unconventional source and typically based on 4Vs • Volume: the amount of data being created is vast compared to traditional data sources like RDBS/EDW • Variety: data comes from different sources and is being created by machines, sensor, logs, humans etc • Velocity: data is being generated extremely fast — typically processed real time but also ingest in form of batch • Veracity: big data is sourced from many different places, as a result you need to test the veracity/quality of the data
  • 4. BIG DATA VENDOR 422/12/2016 Big Data Technologies are different from traditional data sources and it require different toolsets and technologies to mange and process structures/semi-structured and unstructured data - Below are few players in BIG Data’s world.
  • 5. TYPICAL BIG DATA PROCESSING 522/12/2016 To harness the power of big data, enterprises would require an infrastructure that can manage and process huge volumes of structured and unstructured data in real time and in batch processing – keeping data protection privacy and security at the hearth – Typical Big Data Processing will look like below
  • 6. NEXT GENERATION ARCHITECTURE 622/12/2016 Enterprises next generation releases will have both traditional EDW/RDBMS and Big data solutions hands in hands as one cannot fulfill demands and needs. Traditional EDW - Store business critical data - Integrate existing data sources - Integration with existing reporting/MI solutions Big Data • Leverage new data sources e.g. P6 projects docs, social media discussion about projects • Parallel processing to process unstructured data e.g. Asset’s sensor data, geolocation etc
  • 7. NEXT GENERATION ARCHITECTURE INTEGRATION 722/12/2016 Hadoop, is an open source applications based on MapReduce algorithm, where the data is processed in parallel on different CPU nodes. Hadoop offers excellent integration with existing AH application (AIM, PIM) ETL (talend) and Reporting tools (TIBCO Spotfire, TICBO jaspersoft) Existing Infrastructure 1- Reporting: existing MI/Reporting, EDW tools are easy to integrate with Big Data 2- ETL/ELT – Apache, HDP 2.0, Cloudera offers Integration with Talend and existing PL/SQL, UNIX CRON jobs etc 3- Applications – P6, ERP, SAP API can be easily integrated with Hadoop’s infrastructure Reference: http://hortonworks.com/wp-content/uploads/2013/10/Build-A-Modern-Data-Architecture.pdf
  • 8. NEXT GENERATION ARCHITECTURE - HADOOP 822/12/2016 Hadoop runs applications using the MapReduce algorithm open source software, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data. Hadoop framework includes following four modules: Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop. Hadoop YARN: This is a framework for job scheduling and cluster resource management. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data – a low cost, flexible data source reservoir; Hive on the other hand used for SQL access for structured and semi strurctured data Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets. Key Hadoop distributions are Cloudera CDH, Greenplum, MapR, Hortonworks HDP1.0+ etc
  • 9. NEXT GENERATION ARCHITECTURE – HADOOP EVOLUTION 922/12/2016 • Hadoop originally created using Google MapReduce, BigTable and Google File System (GFS) • Over the time Hadoop ecosystem has evolved to enhanced functionalities like Hive (Query), Pig (Scripting), Workflow and Schedule (OOZIE), Non Relational DB(Hbase), Log Processing (Flume, sqoop), Management and Monitoring (Amber, Zookeeper) • Hcatalog to enhance HDFS, HIVE, and Pig
  • 10. NEXT GENERATION ARCHITECTURE – HDP/CLOUDERA/OTHER VENDORS 1022/12/2016 HDP 2.0+: Hortonworks Data Platform (HDP 2.0) integrates Apache Hadoop into modern data architecture - This will enable enterprises to capture, store and process vast quantities of data in a cost efficient and scalable manner – HDP 2.0 offer excellent gateways and APIs to integrate with existing applications, EDW. Cloudera/CDH: Cloudera is another open source big data platform distribution based on Apache Hadoop. CDH offers all key components out of the. CDH also offer hue which provides developers a web based utility execute jobs and check progress. Other Big data vendor at following link: http://www.bigdatavendors.com/top.php Basic HDP 2.0 Architecture Cloudera Basic Architecture
  • 11. NEXT GENERATION ARCHITECTURE – KAFKA 1122/12/2016 Kafka offers streaming platform as having three key capabilities: • It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system. • It lets you store streams of records in a fault-tolerant way. • It lets you process streams of records as they occur. What use in Construction/P6? Various types of Hardware could use Kafka for processing real time data. • Live stream of asset geo location • Application tracking • Applications error log real-time processing • Building real-time streaming applications that transform or react to the streams of data More information on Kafka is available at following https://kafka.apache.org/intro.html http://hortonworks.com/apache/kafka/#section_1
  • 12. NEXT GENERATION ARCHITECTURE – R/PYTHON/SAS 1222/12/2016 R/SaS/Python are programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.R is typically used at the Raw source data, EDW or query store – refer to Any product currently feeding data into an app for data science and statistical analysis (linear and non-linear modelling, classical statistical tests, time scale series etc) can be easily integrated with HDP or Cloudera . HDP 2.0+ and Cloudera both offer their own version of R to provide statistical analysis - although same feature is available in Hadoop core system in the form of MapReduce (MPP). Other options could be explored under this hood are Pig, Spark, Python etc/
  • 13. NEXT GENERATION ARCHITECTURE – FLUME 1322/12/2016 Apache Flume is the standard way to transport log files from source through to target •Initial use-case was webserver log files, but can transport any file from A-B •Does not do “data transformation”, but can send to multiple targets / target types •Mechanisms and checks to ensure successful transport of entries - Has a concept of “agents”, “sinks” and “channels” •Agents collect and forward log data •Sinks store it in final destination •Channels store log data en-route More information on flume is available at following https://flume.apache.org http://hortonworks.com/apache/flume/ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP- 2.4.3/bk_installing_manually_book/content/understanding_flume.html http://www.cloudera.com/products/apache-hadoop/apache-flume.html Kafka and flume in action
  • 14. NEXT GENERATION ARCHITECTURE - SOURCE 1422/12/2016 Data Sources for Big Data can be categorized into three main forms: • Structured data : Relational data. • Semi Structured data : XML data. • Unstructured data : Word, PDF, Text, Media Logs. Unstructured Data: Such form of data normally lands into HDFS(Hive) • Sensor data collection from HW • Geo location data from HW • Server Logs • Documents related to projects e.g. TP500, Gates file, RIIO Code classification, EES etc • Social Media discussion about project e.g. LPT (London Power Tunnels) has high presence on twitter, BBC, facebook, youtube etc • Physical location of asset e.g. Switchgear, cables etc • Survey data about projects Structured/Semi Structured Data: Such data normally loaded into traditional EDW either through existing ETL or using BIG data e.g. CSV, API, P6, ERP, SAP etc
  • 15. NEXT GENERATION ARCHITECTURE - ETL 1522/12/2016 Talend/ODI/Informatica provides excellent framework for running Hadoop ETL jobs with major Hadoop distributions and existing infrastructure • ETL/ELT pushes data/transformation down to Hadoop, Cloudera, Hortonworks • Hive, Sqoop, flame provides native drives to push data into Hadoop/HDFS or HBASE • Data Loading is typically in ”raw form” • Files, event • Semi structured like JASON, XML • High Volume, high velocity is the reason of using Big data instead of RDBMS • Data Quality / error handling • Metadata driven • Loading types of data in Big data could be: • Real Time processing • Batch Processing
  • 16. NEXT GENERATION ARCHITECTURE - SPARK 1622/12/2016 Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source Spark and Hadoop are both framework for the Big data but they have contrast difference between them - refer to below links to understand what is each frame provides. Reference http://spark.apache.org http://www.infoworld.com/article/3014440/big-data/five-things-you-need- to-know-about-hadoop-v-apache-spark.html
  • 17. NEXT GENERATION ARCHITECTURE – NO SQL 1722/12/2016 NoSQL is referring to non-relational or at least non- SQL database solutions such as HBase (also a part of the Hadoop ecosystem), Cassandra, MongoDB, Riak, CouchDB There are, after all, in excess of 100 NoSQL databases, as the DB-Engines database popularity ranking shows There are three most popular NoSQL vendors for Hadoop Named casandara, mongoDB, HBASE. “NoSQL” are gaining popularity - AH could incorporate BI/Analytical/Reporting using NoSQL which means end-user/client wont have to write SQL to get the desired dataset. An in-depth CTO require before making a final decision on “NoSQL” – though it offers some stark advantages over RDBMS/Analytics Big Data. My personal suggestion would be coexistence of both “NoSQL” and “RDBMS” in Big Data landscape.
  • 18. Big Data Distributor Option Analysis- Summary Assessment 22/12/2016 18 Option Cost (indicative estimate) Deployment Strategic Fit Windows Compatibility Ease of use Licenses Overall Cloudera Hortonworks . . Cloudera can be deployed on windows OS Cloudera does n’t support needs of EDW in longer run and see HADOOP as enterprise data hub – this contradicts with AH requirement to integrate existing infrastructure. Cloudera offers cloud, on-premise and sand-box version option for VM No clear cost available online Cloudera has a commercial license - Cloudera also allows the use of its open- source projects free of cost, but the package doesnot include the management suite Cloudera Manager or any other proprietary software HDP is available as a native component on the windows server. Hortonworks see EDW as integral part of Hadoop ecosystem and has strong tie with Terdata No clear cost available online Hortonworks is open source but chances of installation error through command prompt are very high compare to Cloudera HDP only offers cloud based services. Cloudera has a proprietary management software Cloudera Manager, SQL query handling interface Impala, as well as Cloudera Search for easy and real- time access of products. Hortonworks has no proprietary software, uses Ambari for management and Stinger for handling queries, and Apache Solr for searches of data

Editor's Notes

  1. Source https://www.capgemini.com/sites/default/files/technology-blog/files/2012/09/big-data-vendors.jpg
  2. Source https://www.capgemini.com/sites/default/files/technology-blog/files/2012/09/big-data-vendors.jpg
  3. Above are only key component above – more info about Hadoop other objects Like Ambari, Avro etc at the link given below http://searchcloudcomputing.techtarget.com/definition/Hadoop https://en.wikipedia.org/wiki/Apache_Hadoop http://hadoop.apache.org
  4. Above are only key component above – more info about Hadoop other objects Like Ambari, Avro etc at the link given below http://searchcloudcomputing.techtarget.com/definition/Hadoop https://en.wikipedia.org/wiki/Apache_Hadoop http://hadoop.apache.org
  5. Reference http://hortonworks.com/wp-content/uploads/2013/10/Build-A-Modern-Data-Architecture.pdf http://www.cloudera.com/products/apache-hadoop/key-cdh-components.html
  6. More information on R is available at following http://hortonworks.com/hadoop-tutorial/using-revolution-r-enterprise-tutorial-hortonworks-sandbox/ http://blog.cloudera.com/blog/2013/12/how-to-do-statistical-analysis-with-impala-and-r/ https://www.r-bloggers.com/hadoop-for-rs-data-scientist/ https://www.r-bloggers.com/search/hadoop/page/3/
  7. More information can be found at http://blog.cloudera.com/blog/2014/11/nosql-in-a-hadoop-world-2/ https://www.datastax.com/nosql-databases/nosql-cassandra-and-hadoop http://www.infoworld.com/article/2848722/nosql/mongodb-cassandra-hbase-three-nosql-databases-to-watch.html http://blog.couchbase.com/2016/june/why-spark-and-nosql https://www.datanami.com/2016/06/06/spark-makes-inroads-nosql-ecosystem/ https://www.mongodb.com/scale/nosql-vs-relational-databases
  8. Reference http://www.ymc.ch/de/blog/hadoop-overview-of-top-3-distributions/ https://www.cloudera.com/content/dam/www/static/documents/analyst-reports/forrester-wave-big-data-hadoop-distributions.pdf