Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
One of the challenges in storing and processing the data and using the latest internet technologies has resulted in large volumes of data. The technique to manage this massive amount of data and to pull out the value, out of this volume is collectively called Big data. Over the recent years, there has been a rising interest in big data for social media analysis. Online social media have become the important platform across the world to share information. Facebook, one of the largest social media site receives posts in millions every day. One of the efficient technologies that deal with the Big Data is Hadoop. Hadoop, for processing large data volume jobs uses MapReduce programming model. This paper provides a survey on Hadoop and its role in facebook and a brief introduction to HIVE.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
One of the challenges in storing and processing the data and using the latest internet technologies has resulted in large volumes of data. The technique to manage this massive amount of data and to pull out the value, out of this volume is collectively called Big data. Over the recent years, there has been a rising interest in big data for social media analysis. Online social media have become the important platform across the world to share information. Facebook, one of the largest social media site receives posts in millions every day. One of the efficient technologies that deal with the Big Data is Hadoop. Hadoop, for processing large data volume jobs uses MapReduce programming model. This paper provides a survey on Hadoop and its role in facebook and a brief introduction to HIVE.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technological
period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the performance
of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
Though The Lens of an iPhone: San FranciscoPaul Brown
The following photos were entirely taken and processed by me with an iPhone. See more: http://paulgordonbrown.com/category/iphoneography/
iPhoneography is the art of creating photos with an Apple iPhone. This is a style of mobile photography that differs from all other forms of digital photography in that images are both shot and processed on the iOS device.
"Are those with absolute pitch less vulnerable to false auditory memories?" This is a proposal I wrote for a course studying the neuropsychological bases of human memory.
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
Big Data is a term defined for data sets that are extreme and complex where traditional data processing applications are inadequate to deal with them. The term Big Data often refers simply to the use of predictive investigation on analytic methods that extract value from data. Big data is generalized as a large data which is a collection of big datasets that cannot be processed using traditional computing techniques. Big data is not purely a data, rather than it is a complete subject involves various tools, techniques and frameworks. Big data can be any structured collection which results incapability of conventional data management methods. Hadoop is a distributed example used to change the large amount of data. This manipulation contains not only storage as well as processing on the data. Hadoop is an open- source software framework for dispersed storage and processing of big data sets on computer clusters built from commodity hardware. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data. Wordcount example reads text files and counts how often words occur. The input is text files and the result is wordcount file, each line of which contains a word and the count of how often it occurred separated by a tab.
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technological
period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the performance
of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
Though The Lens of an iPhone: San FranciscoPaul Brown
The following photos were entirely taken and processed by me with an iPhone. See more: http://paulgordonbrown.com/category/iphoneography/
iPhoneography is the art of creating photos with an Apple iPhone. This is a style of mobile photography that differs from all other forms of digital photography in that images are both shot and processed on the iOS device.
"Are those with absolute pitch less vulnerable to false auditory memories?" This is a proposal I wrote for a course studying the neuropsychological bases of human memory.
Bistabil Multivibrator adalah sebuah rangkaian yang mempunyai 2 keadaan yang stabil. Rangkaian ini disebut sebagai rangkaian multivibrator bistable ketika kedua tingkat tegangan output atau keluaran di yang diproduksi atau di hasilkan oleh rangkaian multivibrator bistabil tersebut merupakan output stabil dan juga rangkaian ini hanya akan mengubah kondisi pada tingkat tegangan output pada saat dipicu. Rangkaian bistabil multivibrator ini biasanya dipakai pada perangkat elektronik yang biasa di temui di lingkungan anda seperti saklar elektronik dan juga pembangkit gelombang asimetris. Rangkaian ini juga biasa disebut Flip-flop, dimana multivibrator yang output atau hasil tegangan keluarannya merupakan tegangan rendah atau tinggi yaitu 0 atau juga 1.
Papaya farm-lets brochure on Malekula island Vanuatu $19,950 imagine your own 1 acre farm-let receive a passive income finance available to approved applicants.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
International Journal of Distributed and Parallel systems (IJDPS)ijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Harnessing Hadoop and Big Data to Reduce Execution Times
1. Harnessing Hadoop: Understanding the Big
Data Processing Options for Optimizing
Analytical Workloads
As Hadoop’s ecosystem evolves, new open source tools and capabilities
are emerging that can help IT leaders build more capable and business-
relevant big data analytics environments.
Executive Summary
Asserting that data is vital to business is an
understatement. Organizations have generated
more and more data for years, but struggle to use
it effectively. Clearly data has more important
uses than ensuring compliance with regulatory
requirements. In addition, data is being generated
with greater velocity, due to the advent of new
pervasive devices (e.g., smartphones, tablets,
etc.), social Web sites (e.g., Facebook, Twitter,
LinkedIn, etc.) and other sources like GPS, Google
Maps, heat/pressure sensors, etc.
Moreover, a plethora of bits and bytes is being
created by data warehouses, repositories, backups
and archives. IT leaders need to work with the
lines of business to take advantage of the wealth
of insight that can be mined from these reposito-
ries. This search for data patterns will be similar
to the searches the oil and gas industry conduct
for precious resources. Similar business value can
be uncovered if organizations can improve how
“dirty” data is processed to refine key business
decisions.
According to IDC1
and Century Link,2
our planet
now has 2.75 zettabytes (ZB) worth of data;
this number is likely to reach 8ZB by 2015. This
situation has given rise to the new buzzword
known as “big data.”
Today, many organizations are applying big data
analytics to challenges such as fraud detection,
risk/threat analysis, sentiment analysis, equities
analysis, weather forecasting, product recom-
mendations, classifications, etc. All of these
activities require a platform that can process this
data and deliver useful (often predictive) insights.
This white paper discusses how to leverage a big
data analytics platform, dissecting its capabilities,
limitations and vendor choices.
Big Data’s Ground Zero
Apache Hadoop3
is a Java-based open source
platform that makes processing of large data
possible over thousands of distributed nodes.
Hadoop’s development resulted from publication
of two Google authored whitepapers (i.e., Google
File System and Google MapReduce). There are
numerous vendor-specific distributions available
• Cognizant 20-20 Insights
cognizant 20-20 insights | november 2013
http://withDrDavid.com
2. 2
based on Apache Hadoop, from companies such
as Cloudera (CDH3/4), Hortonworks (1.x/2.x) and
MapR (M3/M5). In addition, appliance-based
solutions are offered by many vendors, such as
IBM, EMC, Oracle, etc.
Figure 1 illustrates Apache Hadoop’s two base
components.
Hadoop distributed file system (HDFS) is a base
component of the Hadoop framework that
manages the data storage. It stores data in the
form of data blocks (default size: 64MB) on the
local hard disk. The input data size defines the
block size to be used in the cluster. A block size of
128MB for a large file set is a good choice.
Keeping large block sizes means a small number
of blocks can be stored, thereby minimizing
the memory requirement on the master node
(commonly known as NameNode) for storing the
metadata information. Block size also impacts
the job execution time, and thereby cluster per-
formance. A file can be stored in HDFS with a
different block size during the upload process. For
file sizes smaller than the block size, the Hadoop
cluster may not perform optimally.
Example: A file size of 512MB requires eight
blocks (default: 64MB), so it will need memory to
store information of 25 items (1 filename + 8 * 3
copies). The same file with a block size of 128MB
will require memory for just 13 items.
HDFS has built-in fault tolerance capability,
thereby eliminating the need for setting up a
redundant array of inexpensive disks (RAID) on
the data nodes. The data blocks stored in HDFS
automatically get replicated across multiple data
nodes based on the replication factor (default: 3).
The replication factor determines the number of
copies of a data block to be created and distrib-
uted across other data nodes. The larger the rep-
lication factor, the better the read performance
of the cluster. On the other hand, it requires
more time to replicate the data blocks and more
storage for replicated copies.
For example, storing a file size of 1GB requires
around 4GB of storage space, due to a replica-
tion factor of 3 and an additional 20% to 30% of
space required during data processing.
Both block size and replication factor impact
Hadoop cluster performance. The value for
both the parameters depends on the input data
type and the application workload to run on this
cluster. In a development environment, the rep-
lication factor can be set to a smaller number —
say, 1 or 2 — but it is advisable to set this value to
a higher factor, if the data available on the cluster
is crucial. Keeping a default replication factor of
3 is a good choice from a data redundancy and
cluster performance perspective.
Figure 2 (next page) depicts the NameNode and
DataNode that constitute the HDFS service.
MapReduce is the second base component
of a Hadoop framework. It manages the data
processing part on the Hadoop cluster. A cluster
runs MapReduce jobs written in Java or any other
language (Python, Ruby, R, etc.) via streaming.
A single job may consist of multiple tasks. The
number of tasks that can run on a data node is
governed by the amount of memory installed. It is
recommended to have more memory installed on
each data node for better cluster performance.
Based on the application workload that is going
to run on the cluster, a data node may require
high compute power, high disk I/O or additional
memory.
cognizant 20-20 insights
Figure 1
Hadoop Base Components
MapReduce
(Distributed Programming Framework)
HDFS
(Hadoop Distributed File System)
3. 3cognizant 20-20 insights
The job execution process is handled by the
JobTracker and TaskTracker services (see
Figure 2). The type of job scheduler chosen for
a cluster depends on the application workload.
If the cluster will be running short running jobs,
then FIFO scheduler will serve the purpose. For
a much more balanced cluster, one can choose
to use Fair Scheduler. In MRv1.0, there’s no high
availability (HA) for the JobTracker service. If the
node running JobTracker service fails, all jobs
submitted in the cluster will be discontinued and
must be resubmitted once the node is recovered
or another node is made available. This issue has
been fixed in the newer release — i.e., MRv2.0,
also known as yet another resource negotiator
(YARN).4
Figure 2 also depicts the interaction among these
five service components.
The Secondary Name Node plays the vital role of
check-pointing. It is this node that off-loads the
work of metadata image creation and updating
on a regular basis. It is recommended to keep the
machine configuration the same as Name Node
so that in the case of an emergency this node can
be promoted as a Name Node.
The Hadoop Ecosystem
Hadoop base components are surrounded by
numerous ecosystem5
projects. Many of them are
already top-level Apache (open source) projects;
some, however, remain in the incubation stage.
Figure 3 (next page) highlights a few of the ecosys-
tem components. The most common ones include:
• Pig provides a high-level procedural language
known as PigLatin. Programs written in PigLatin
are automatically converted into map-reduce
jobs. This is very suitable in situations where
developers are not familiar with high-level
languages such as Java, but have strong
scripting knowledge and expertise. They can
create their own customized functions. It is
predominantly recommended for processing
large amounts of semi-structured data.
• Hive provides a SQL-like language known
as HiveQL (query language). Analysts can
run queries against the data stored in HDFS
using SQL-like queries that automatically
get converted into map-reduce jobs. It is rec-
ommended where we have large amounts
of unstructured data and we want to fetch
particular details without writing map-reduce
jobs. Companies like RevolutionAnalytics6
can leverage Hadoop clusters running Hive
via RHadoop connectors. RHadoop provides
a connector for HDFS, MapReduce and HBase
connectivity.
• HBase is a type of NOSQL database and offers
scalable and distributed database service.
However, it is not optimized for transactional
applications and relational analytics. Be aware:
HBase is not a replacement for a relational
database.AclusterwithHBaseinstalledrequires
an additional 24GB to 32GB of memory on each
data node. It is recommended that the data
nodes running HBase Region Server should run
a smaller number of MapReduce tasks.
Figure 2
Hadoop Services Schematic
HDFS – Hadoop Distributed File System
Task Tracker
Data Node
Job Tracker
Task Tracker Task Tracker
....
Name Node Data Node Data Node....
MapReduce
(Processing)
Secondary
Name Node
HDFS
(Storage)
4. cognizant 20-20 insights 4
Hadoop clusters are mainly used in R&D-relat-
ed activities. Many enterprises are now using
Hadoop with production workloads. The type of
ecosystem components required depends upon
the enterprise needs. Enterprises with large data
warehouses normally go for production-ready
tools from proven vendors such as IBM, Oracle,
EMC, etc. Also, not every enterprise requires
a Hadoop cluster. A cluster is mainly required
by business units that are developing Hadoop-
based solutions or running analytics workloads.
Business units that are just entering into analytics
may start small by building a 5-10 node cluster.
As shown in Figure 4 (next page), a Hadoop
cluster can scale from just a few data nodes to
thousands of data nodes as the need arises or
data grows. Hadoop clusters show linear scal-
ability, both in terms of performance and storage
capacity, as the count of nodes increase. Apache
Hadoop runs primarily on the Intel x86/x86_64-
based architecture.
Clusters must be designed with a particular
workload in mind (e.g., data-intensive loads,
processor-intensive loads, mixed loads, real-time
clusters, etc.). As there are many Hadoop dis-
tributions available, implementation selection
depends upon feature availability, licensing cost,
performance reports, hardware requirements,
skills availability within the enterprise and the
workload requirements.
Per a comStore report,7
MapR performs better
than Cloudera and works out to be less costly.
Hortonworks, on the other hand, has a strong
relationship with Microsoft. Its distribution is also
available for a Microsoft platform (HDP v1.3 for
Windows), which makes this distribution popular
among the Microsoft community.
Hadoop Appliance
An alternate approach to building a Hadoop
cluster is to use a Hadoop appliance. These are
customizable and ready-to-use complete stacks
from vendors such as IBM (IBM Netezza), Oracle
(Big Data Appliance), EMC (Data Computing
Appliance), HP (AppSystem), Teradata Aster, etc.
Using an appliance-based approach speeds go-
to-market and results in an integrated solution
from a single (and accountable) vendor like IBM,
Teradata, HP, Oracle, etc.
These appliances come in different sizes and con-
figurations. Every vendor uses a different Hadoop
distribution. For example: Oracle Big Data is based
on a Cloudera distribution; EMC uses MapR and
HP AppSystem can be configured with Cloudera,
Figure 3
Scanning the Hadoop Project Ecosystem
MapReduce
(Distributed Programming
Framework)
Pig
(Data Flow)
Zookeeper
(Coordination)
Ambari
(SystemManagement)
Hive
(SQL)
HBase
(NoSQLStore)
HDFS
(Hadoop Distributed File System)
HCatalog
(Table/SchemaMgmt.)
Avro
(Serialization)
Sqoop
(CLIforDataMovement)
Hadoop
Ecosystem
OS
Layer
Physical
Machines
Virtual
Machines Storage Network
Infrastructure
Layer
5. cognizant 20-20 insights 5
MapR or Hortonworks. These appliances are
designed to solve a particular type of problem or
for a particular workload in mind.
• IBM Netezza 1000 is designed for high-per-
formance business intelligence and advanced
analytics. It’s a data warehouse appliance
that can run complex and mixed workloads.
In addition, it supports multiple languages (C/
C++, Java, Python, FORTRAN), frameworks
(Hadoop) and tools (IBM SPSS, R and SAS).
• The Teradata Aster appliance comes with 50
pre-built MapReduce functions optimized for
graph, text and behavior/marketing analysis.
It also comes with Aster Database and Apache
Hadoop to process structured, unstructured
and semi-structured data. A full Hadoop cabinet
includes two masters and 16 data nodes which
can deliver up to 152TB of user space.
• The HP AppSystem appliance comes with pre-
configured HP ProLiant DL380e Gen8 servers
running Red Hat Enterprise Linux 6.1 and a
Cloudera distribution. Enterprises can use this
appliance with HP Vertica Community Edition
to run real-time analysis of structured data.
HDFS and Hadoop
HDFS is not the only platform for performing big
data tasks. Vendors have emerged with solutions
that offer a replacement for HDFS. For example,
DataStax8
uses Cassandra; Amazon Web Services
uses S3; IBM has General Parallel File System
(GPFS); EMC relies on Isilon; MapR uses Direct
NFS; etc. (To learn more, read about alternate
technologies.)9
An EMC appliance uses Isilon
scale-out NAS technology, along with HDFS. This
combination eliminates the need to export/import
the data into HDFS (see Figure 5, next page).
IBM uses GPFS instead of HDFS in its solution
after finding that GPFS performed better than
HDFS (see Figure 6, also next page).
Many IT shops are scrambling to build Hadoop
clusters in their enterprise. The question is: What
are they going to do with these clusters? Not
every enterprise needs to have its own cluster;
it all depends on the nature of the business
and whether the cluster serves a legitimate and
compelling business use case. For example, enter-
prises into financial transactions would definitely
require an analytic platform for fraud detection,
Figure 4
Hadoop Cluster: An Anatomical View
Data Nodes
(Slave Nodes)
Network Switch
Client Nodes
Kerberos
(Security)
Name Nodes
Manual/Automatic
Failover
(Master Node)
Cluster Manager
(Management & Monitoring)
Job Tracker/
Resource Manager
(Scheduler)
(Edge Nodes)
6. cognizant 20-20 insights 6
portfolio analysis, churn analysis, etc. Similarly,
enterprises into research work require large-size
clusters for faster processing and large data
analysis.
Hadoop-as-a-Service
There are couple of ways in which Hadoop
clusters can be leveraged in addition to building
a new cluster or taking an appliance-based
approach. A more cost-effective approach is to
use Amazon10
Elastic Map-Reduce (EMR) service.
Enterprises can build a cluster on demand and
pay for the duration they used it. Amazon offers
compute instances running open-source Apache
Hadoop and MapR (M3/M5/M7) distribution.
Other vendors that offer “Hadoop-as-a-Service”
include:
• Joyent, which offers Hadoop services based on
Hortonworks distribution.
• Skytap, which provides Hadoop services based
on Cloudera (CDH4) distribution.
• Mortar, which delivers Hadoop services based
on MapR distribution.
All of these vendors charge on a pay-per-use
basis. New vendors will eventually emerge as
market demand increases and more enterprises
require additional analytics, business intelligence
(BI) and data warehousing options. As the com-
petitive stakes rise, we recommend comparing
prices and capabilities to get the best price/per-
formance for your big data environment.
Figure 5
EMC Subs Isilon for HDFS
R (RHIPE) PIG Mahout
Job Tracker Task Tracker
Name Node
Data Node
Ethernet HCFS
HIVE HBASE
Figure 6
Benchmarking HDFS vs. GPFS
0
500
Postmark
ExecutionTime
HDFS GPFS
Teragen Grep CacheTest Terasort
1000
1500
2000
7. cognizant 20-20 insights 7
The Hadoop platform by itself can’t do anything.
Enterprises need applications that can leverage
this platform. Developing an analytics application
that offers business value requires data scientists
and data analysts. These applications can perform
many different types of jobs, from pure analytical,
data and BI jobs through data processing. Hadoop
can run all types of workloads, so it is becoming
increasingly important to design an effective and
efficient cluster using the latest technologies and
framework.
Hadoop Competitors
The Hadoop framework provides a batch
processing environment that works well with
historical data in data warehouses and archives.
What if there is a need to perform real-time
processing such as in the case of fraud detection,
bank transactions, etc.? Enterprises have begun
assessing alternative approaches and technolo-
gies to solve this issue.
Storm is an open source project that helps in
setting up the environment for real-time (stream)
processing. Besides, there are other options
such as massively parallel processing (MPP),
high performance computing cluster (HPCC) and
Hadapt to consider. However, all these technolo-
gies will require a strong understanding of the
workload that will be run. If the workload is more
processing intensive, it’s always recommended
to go with HPCC. Besides, HPCC uses a declara-
tive programming language that is extensible,
homogeneous and complete. It uses an Open
Data Model which is not constrained by key-value
pairs, and data processing can happen in parallel.
To learn more, read HPCC vs. Hadoop.11
Massively
parallel processing databases have been around
even before MapReduce hit the market. It makes
use of SQL as an interface to interact with the
system. SQL is a commonly used language by
data scientists and analysts. MPP systems are
mainly used in high-end data warehouse imple-
mentations.
Looking Ahead
Hadoop is a strong framework for running
data-intensive jobs, but it’s still evolving as the
ecosystem projects around it mature and expand.
Market demand is spurring new technological
developments, daily. With these innovations come
various options to choose from, such as building
a new cluster, leveraging an appliance or making
use of Hadoop-as-a-Service offerings.
Moreover,themarketisfillingwithnewcompanies,
which is creating more competitive technologies
and options to consider. For starters, enterprises
need to make the right decision in choosing their
Hadoop distributions. Enterprises can start small
by picking up any distribution and building a
5-10 node cluster to test the features, function-
ality and performance. Most reputable vendors
provide a free version of their distribution for
evaluation purposes. We advise enterprises to
test 2-3 vendor distributions against its business
requirements before making any final decisions.
However, IT decision-makers should pay close
attention to emerging competitors to position
their organizations for impending technology
enhancements.
Footnotes
1
IDC report: Worldwide Big Data Technology and Services 2012-2015 Forecast.
2
www.centurylink.com/business/artifacts/pdf/resources/big-data-defining-the-digital-deluge.pdf.
3
http://en.wikipedia.org/wiki/Hadoop.
4
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.
5
http://pig.apache.org/; http://hive.apache.org/; http://hbase.apache.org; “Comparing Pig Latin and SQL
for Constructing Data Processing Pipelines,” developer.yahoo.com/blogs/hadoop/comparing-pig-latin-
sql-constructing-data-processing-pipelines-444.html; “Process a Million Songs with Apache Pig,” blog.
cloudera.com/blog/2012/08/process-a-million-songs-with-apache-pig/.
6
www.revolutionanalytics.com/products/revolution-enterprise.php.
7
comStore report: Switching to MapR for Reliable Insights into Global Internet Consumer Behavior.
8
www.datastax.com/wp-content/uploads/2012/09/WP-DataStax-HDFSvsCFS.pdf.
9
gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/.
10
aws.amazon.com/elasticmapreduce/pricing/.
11
hpccsystems.com/Why-HPCC/HPCC-vs-Hadoop.
http://Online.withDrDavid.com