3. Security Classification: Internal
Objectives
Big data and Hadoop
introduction 3
• Big data overview.
• Apache Hadoop common architecture:
– Read/write a file in Hadoop File System
– How Hadoop MapReduce tasks work
– Hadoop 1 & 2 difference
• Develop a MapReduce job using Hadoop
• Apply Hadoop in the real world
6. Security Classification: Internal
Big data – Definition
Big data and Hadoop
introduction 6
“Big data is high-volume, high-velocity
and/or high-variety information assets that
demand cost-effective, innovative forms of
information processing that enable
enhanced insight, decision making, and
process automation”
- Gartner
7. Security Classification: Internal
Big data – The 3Vs
Big data and Hadoop
introduction 7
• Volume :
– Google receives over 2 million search queries every minute
– transactional data or sensor data are being stored every fraction of
seconds
• Variety :
– YouTube, Facebook generate video, audio, image and text data
– Over 200 million emails are sent every minute
• Velocity:
– Experiments at CERN generate colossal amounts of data.
– Particles collide 600 million times per second.
– Their Data Center processes about one petabyte of data every day.
8. Security Classification: Internal
Big data – Challenges
Big data and Hadoop
introduction 8
• Difficult in identifying the right data and determining how to best
use it.
• Struggling to find the right talent.
• Data access and connectivity obstacle.
• Data technology landscape is evolving extremely fast.
• Finding new ways of collaborating across functions and
businesses.
• Security concerns.
12. Security Classification: Internal
Apache Hadoop – What?
Big data and Hadoop
introduction 12
• It is a software platform:
– allows us easily write and run data related applications
– facilitates processing and manipulating massive amount of
data
– the processes are conveniently scalable
14. Security Classification: Internal
Apache Hadoop – Characteristics
Big data and Hadoop
introduction 14
• Reliable shared storage (HDFS) and analysis system
(MapReduce).
• Highly scalable
• Cost effective as it can work with commodity hardware.
• Highly flexible and can process both structured as well as
unstructured data.
• Built-in fault tolerance.
• Write once and read multiple times.
• Optimized for large and very large data sets.
15. Security Classification: Internal
Apache Hadoop – Design principals
Big data and Hadoop
introduction 15
• Moving computation is cheaper than moving data
• Hardware will fail, manage it
• Hide execution details from the user
• Use streaming data access
• Use a simple file system coherency model
29. Security Classification: Internal
Apache Hadoop – Using
Big data and Hadoop
introduction 29
• When to use Hadoop:
– Hadoop can be used in various scenarios including some of the following:
– Analytics
– Search
– Data Retention
– Log file processing
– Analysis of Text, Image, Audio, & Video content
– Recommendation systems like in E-Commerce Websites
• When Not to Use Hadoop:
– Low-latency or near real-time data access.
– Having a large number of small files to be processed.
– Multiple writes scenario or scenarios requiring arbitrary writes or writes between
the files
38. Security Classification: Internal
References
Big data and Hadoop
introduction 38
- http://hadoop.apache.org
- Hadoop in action – Chuck Lam
- Hadoop: The definitive guide – Tom White
- http://www.bigdatanews.com/
- http://stackoverflow.com
- http://codeproject.com
- Hadoop 2 Fundamentals – LiveLession
Editor's Notes
This definition consists of three parts:
Part One: 3Vs (Variety – Velocity – Volume)
Part Two: Cost-Effective, Innovative Forms of Information Processing
Part Three: Enhanced Insight and Decision Making
Data scientists from some companies break big data into 4 Vs: Volume, Variety, Velocity, Veracity. The others add one more V to the characteristics of big data: Value.
Information about Big data Ecosystem can be found at URL: http://hadoopilluminated.com/hadoop_illuminated/Bigdata_Ecosystem.html
Here are few highlights of the Hadoop Architecture:
- Hadoop works in a master-worker / master-slave fashion.
- Hadoop has two core components: HDFS and MapReduce.
HDFS (Hadoop Distributed File System) offers a highly reliable and distributed storage, and ensures reliability, even on a commodity hardware, by replicating the data across multiple nodes. Unlike a regular file system, when data is pushed to HDFS, it will automatically split into multiple blocks (configurable parameter) and stores/replicates the data across various datanodes. This ensures high availability and fault tolerance.
MapReduce offers an analysis system which can perform complex computations on large datasets. This component is responsible for performing all the computations and works by breaking down a large complex computation into multiple tasks and assigns those to individual worker/slave nodes and takes care of coordination and consolidation of results.
- The master contains the Namenode and Job Tracker components.
Namenode holds the information about all the other nodes in the Hadoop Cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of the Hadoop Cluster.
Job Tracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results.
- Each Worker / Slave contains the Task Tracker and a Datanode components.
Task Tracker is responsible for running the task / computation assigned to it.
Datanode is responsible for holding the data.
The computers present in the cluster can be present in any location and there is no dependency on the location of the physical server.
The differences between Hadoop 1 & 2 are:
Hadoop 1 limits to 4000 nodes per cluster, Hadoop 2 is up to 10000 nodes per cluster.
Hadoop 1 has a Jobtracker bottle-neck, Hadoop 2 has efficient cluster utilization – YARN.
Hadoop 1 only supports MapReduce jobs but Hadoop 2 supports more job types.
Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.
YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.
YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.
Hadoop commands reference:
The syntax is: hadoop fs –command, with command be either:
1.ls <path>
Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry.
2.lsr <path>
Behaves like -ls, but recursively displays entries in all subdirectories of path.
3.du <path>
Shows disk usage, in bytes, for all the files which match path; filenames are reported with the full HDFS protocol prefix.
4.dus <path>
Like -du, but prints a summary of disk usage of all files/directories in the path.
5.mv <src><dest>
Moves the file or directory indicated by src to dest, within HDFS.
6.cp <src> <dest>
Copies the file or directory identified by src to dest, within HDFS.
7.rm <path>
Removes the file or empty directory identified by path.
8.rmr <path>
Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path).
9.put <localSrc> <dest>
Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
10.copyFromLocal <localSrc> <dest>
Identical to –put
11.moveFromLocal <localSrc> <dest>
Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and then deletes the local copy on success.
12.get [-crc] <src> <localDest>
Copies the file or directory in HDFS identified by src to the local file system path identified by localDest.
13.getmerge <src> <localDest>
Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest.
14.cat <filen-ame>
Displays the contents of filename on stdout.
15.copyToLocal <src> <localDest>
Identical to –get
16.moveToLocal <src> <localDest>
Works like -get, but deletes the HDFS copy on success.
17.mkdir <path>
Creates a directory named path in HDFS.
Creates any parent directories in path that are missing (e.g., mkdir -p in Linux).
18.setrep [-R] [-w] rep <path>
Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time)
19.touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0.
20.test -[ezd] <path>
Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.
21.stat [format] <path>
Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).
22.tail [-f] <file2name>
Shows the last 1KB of file on stdout.
23.chmod [-R] mode,mode,... <path>...
Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes if no scope is specified and does not apply an umask.
24.chown [-R] [owner][:[group]] <path>...
Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified.
25.chgrp [-R] group <path>...
Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified.
26.help <cmd-name>
Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd.
When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.
* Why the default number of replications is 3?
Hadoop is used in clustered environment where we have clusters, each cluster will have multiple racks, each rack will have multiple datanodes.So to make HDFS fault tolerant in the cluster we need to consider following failures:- DataNode failure
- Rack failure
Chances of Cluster failure is fairly low so let not think about it. In the above cases we need to make sure that - If one DataNode fails, you can get the same data from another DataNode
- If the entire Rack fails, you can get the same data from another Rack
So now its pretty clear why default replication factor is set to 3, so that not 2 replica goes to same DataNode and at-least 1 replica goes to different Rack to fulfill the above mentioned Fault-Tolerant criteria.
The term "secondary name-node" is somewhat misleading. It is not a name-node in the sense that data-nodes cannot connect to the secondary name-node, and in no event it can replace the primary name-node in case of its failure.
The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary name-node periodically downloads current name-node image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) name-node.
So if the name-node fails and you can restart it on the same physical node then there is no need to shutdown data-nodes, just the name-node need to be restarted. If you cannot use the old node anymore you will need to copy the latest image somewhere else. The latest image can be found either on the node that used to be the primary before failure if available; or on the secondary name-node. The latter will be the latest checkpoint without subsequent edits logs, that is the most recent name space modifications may be missing there. You will also need to restart the whole cluster in this case.
Step 1: First the Client will open the file by giving a call to open() method on FileSystem object, which for HDFS is an instance of DistributedFileSystem class.
Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure Call), to determine the locations of the blocks for the first few blocks of the file. For each block, the namenode returns the addresses of all the datanodes that have a copy of that block. Client will interact with respective datanodes to read the file. Namenode also provide a token to the client which it shows to data node for authentication.
The DistributedFileSystem returns an object of FSDataInputStream(an input stream that supports file seeks) to the client for it to read data from FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first closest datanode for the first block in the file.
Step 4: Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block. This happens transparently to the client, which from its point of view is just reading a continuous stream.
Step 6: Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namnode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream.
Step 1: The client creates the file by calling create() method on DistributedFileSystem.
Step 2: DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it.
The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException. TheDistributedFileSystem returns an FSDataOutputStream for the client to start writing data to.
Step 3: As the client writes data, DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. TheDataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline.
Step 4: Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline.
Step 5: DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline.
Step 6: When the client has finished writing data, it calls close() on the stream.
Step 7: This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete The namenode already knows which blocks the file is made up of , so it only has to wait for blocks to be minimally replicated before returning successfully.
- There are two procedures:
+ map filters and sort the data
+ reduce summarize the data
+ reduce is not necessary, you can have map only process
This facilitates scalability and parallelization
- Each job in MapReduce process are processed in datanode:
+ jobs are simple and nodes perform similar jobs
+ when combined together, operation could be powerful and even complex
+ it is necessary to write MapReduce program with great care
HBase - Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables.Hive - Data Warehouse infrastructure developed by Facebook. Data summarization, query, and analysis. It’s provides SQL-like language (not SQL92 compliant): HiveQL.Pig - Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce. Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow.Zookeeper - It’s a coordination service that gives you the tools you need to write correct distributed applications. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of those are Apache HBase, Storm, Kafka. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations. Back in 2006, Google published a paper on "Chubby", a distributed lock service which gained wide adoption within their data centers. Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure.Mahout - Machine learning library and math library, on top of MapReduce.
Sqoop - Sqoop works to transport data from RDBMS to HDFS.
Flume - Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS
Oozie - Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs
Ambari – Monitoring & management of Hadoop clusters and nodes
A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes.
Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search
AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity.
Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage;
FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning
University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data;