Big data

By Amin Badirzadeh
Mina Soltani Siapoosh

Big Data is a phrase used to mean a massive
volume of both structured and unstructured
data that is so large it is difficult to process using
traditional database and software techniques
“
”
“

Twitter, Linkedin, Facebook, Tumblr, Blog,
SlideShare, YouTube, Google+, Instagram,
Flickr, Pinterest, Vimeo, WordPress, IM,
RSS, Review, Chatter, Jive, Yammer, etc.
Docs
Sensor
data
Public
Web
Archive
Media
Social
Media
Medical devices, smart electric
meters, car sensors, road cameras,
satellites, traffic recording devices,
processors found within vehicles,
video games, cable boxes,
assembly lines, office building, cell
towers, jet engines, air
conditioning units, refrigerators
XLS, PDF, CSV, email, Word,
PPT, HTML, HTML 5, plain
text, XML, JSON, etc.
Images, videos, audio, Flash, live
streams, podcasts, etc.
Government, weather,
competitive, traffic, regulatory,
compliance, health care
services, economic, census,
public finance, stock, OSINT,
the World Bank, SEC/Edgar,
Wikipedia, IMDb, etc.
Archives of scanned documents,
statements, insurance forms, medical
record and customer correspondence,
paper archives, and print stream files that
contain original systems of record between
organizations and their customers
Event logs, server data, application
logs, business process logs, audit
logs, call detail records (CDRs),
mobile location, mobile app usage,
Project management, marketing automation,
productivity, CRM, ERP content management
system, HR, storage, talent management,
procurement, expense management Google
Docs, intranets, portals, etc.
Business
Apps
Log
Data

An open source framework
From Apache Foundation
Java-based
Distributed Processing
Reliability
High Availability
By Doug
Cutting

 Analytics
 Search
 Data Retention
 Log file processing
 Analysis of Text, Image, Audio, & Video content
 Recommendation systems like in E-Commerce Websites

 For Real-Time Data Analysis
 For a Relational Database System:
 For a General Network File System:
 For Non-Parallel Data Processing:

• Ability to store and process huge amounts of any kind
of data, quickly. With data volumes and varieties
constantly increasing, especially from social media and
the Internet of Things (IoT), that's a key consideration.
• Computing power. Hadoop's distributed computing
model processes big data fast. The more computing
nodes you use, the more processing power you have.
• Fault tolerance. Data and application processing are
protected against hardware failure. If a node goes down,
jobs are automatically redirected to other nodes to make
sure the distributed computing does not fail. Multiple
copies of all data are stored automatically.
• Flexibility. Unlike traditional relational databases, you
don’t have to preprocess data before storing it. You can
store as much data as you want and decide how to use it
later. That includes unstructured data like text, images
and videos.
• Low cost. The open-source framework is free and uses
commodity hardware to store large quantities of data.
• Scalability. You can easily grow your system to handle
more data simply by adding nodes. Little administration
is required.

 HDFS is the data storage source of MR, which is a distributed
file system running on commercial hardware and designed in
reference to Google’s DFS.
 HDFS is the basis for main data storage of Hadoop applications,
which distributes files in data blocks of 64MB and stores such
data blocks in different nodes of a cluster, so as to enable
parallel computing for MR.
 An HDFS cluster includes a single NameNode for managing the
metadata of the file system and DataNodes for storing actual
data. A file is divided into one or multiple blocks and such
blocks are stored in DataNodes. Copies of blocks are
distributed to different DataNodes to prevent data loss.

 MR was developed similar to MapReduce of Google.
 The MR framework consists of one JobTracker node and
multiple TaskTracker nodes. The JobTracker node is used for
task distribution and task scheduling; TaskTracker nodes are
used to receive Map or Reduce tasks distributed from
JobTracker node and execute such tasks and feed task
status back to the JobTracker node.
 MR framework and HDFS run in the same node set, so as to
schedule tasks on nodes presented with data.

 Hadoop YARN framework allows one to do job scheduling and
cluster resource management, meaning users can submit and
kill applications through the Hadoop REST API.
 In Hadoop, the combination of all of the Java JAR files and
classes needed to run a MapReduce program is called a job.
 You can submit jobs to a JobTracker from the command line or
by HTTP posting them to the REST API.
 These jobs contain the “tasks” that execute the individual map
and reduce steps.

 Ambari: A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters, Ambari includes support
for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase,
ZooKeeper, Oozie, Pig, and Sqoop
 Avro: Avro is a data serialization system.
 Cassandra: Cassandra is a scalable multi-master database with
no single points of failure.
 Chukwa: A data collection system, Chukwa is used to manage
large distributed systems.
 HBase: A scalable, distributed database, HBase supports
structured data storage for large tables.
 Hive: Hive is a data warehouse infrastructure that provides
data summaries and ad-hoc querying.

 Mahout: Mahout is a scalable machine learning and data
mining library.
 Pig: This is a high-level data flow language and execution
framework for parallel computation.
 Spark: A fast and general compute engine for Hadoop data,
Spark provides a simple and expressive programming model
that supports a wide range of applications, including ETL,
machine learning, stream processing, and graph computation.
 Tez: Tez is a generalized data flow programming framework
built on Hadoop YARN that provides a powerful and flexible
engine to execute an arbitrary DAG of tasks to process data for
both batch and interactive use-cases.
 ZooKeeper: This is a high-performance coordination service for
distributed applications.

1. Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A survey." Mobile
networks and applications 19.2 (2014): 171-209.
2. https://www.sas.com/en_us/insights/big-data/hadoop.html
3. Gudivada, Venkat N., Ricardo A. Baeza-Yates, and Vijay V. Raghavan. "Big
Data: Promises and Problems." IEEE Computer 48.3 (2015): 20-23.
4. https://www.mssqltips.com/sqlservertip/3140/big-data-basics--part-3--
overview-of-hadoop/
5. https://data-flair.training/blogs/features-of-hadoop-and-design-principles/
6. https://www.hostingadvice.com/how-to/what-is-hadoop/

Big data

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Big data

Similar to Big data (20)

Recently uploaded

Recently uploaded (20)

Big data