By Amin Badirzadeh
Mina Soltani Siapoosh
Big Data is a phrase used to mean a massive
volume of both structured and unstructured
data that is so large it is difficult to process using
traditional database and software techniques
“
”
“
Twitter, Linkedin, Facebook, Tumblr, Blog,
SlideShare, YouTube, Google+, Instagram,
Flickr, Pinterest, Vimeo, WordPress, IM,
RSS, Review, Chatter, Jive, Yammer, etc.
Docs
Sensor
data
Public
Web
Archive
Media
Social
Media
Medical devices, smart electric
meters, car sensors, road cameras,
satellites, traffic recording devices,
processors found within vehicles,
video games, cable boxes,
assembly lines, office building, cell
towers, jet engines, air
conditioning units, refrigerators
XLS, PDF, CSV, email, Word,
PPT, HTML, HTML 5, plain
text, XML, JSON, etc.
Images, videos, audio, Flash, live
streams, podcasts, etc.
Government, weather,
competitive, traffic, regulatory,
compliance, health care
services, economic, census,
public finance, stock, OSINT,
the World Bank, SEC/Edgar,
Wikipedia, IMDb, etc.
Archives of scanned documents,
statements, insurance forms, medical
record and customer correspondence,
paper archives, and print stream files that
contain original systems of record between
organizations and their customers
Event logs, server data, application
logs, business process logs, audit
logs, call detail records (CDRs),
mobile location, mobile app usage,
Project management, marketing automation,
productivity, CRM, ERP content management
system, HR, storage, talent management,
procurement, expense management Google
Docs, intranets, portals, etc.
Business
Apps
Log
Data
An open source framework
From Apache Foundation
Java-based
Distributed Processing
Reliability
High Availability
By Doug
Cutting
 Analytics
 Search
 Data Retention
 Log file processing
 Analysis of Text, Image, Audio, & Video content
 Recommendation systems like in E-Commerce Websites
 For Real-Time Data Analysis
 For a Relational Database System:
 For a General Network File System:
 For Non-Parallel Data Processing:
• Ability to store and process huge amounts of any kind
of data, quickly. With data volumes and varieties
constantly increasing, especially from social media and
the Internet of Things (IoT), that's a key consideration.
• Computing power. Hadoop's distributed computing
model processes big data fast. The more computing
nodes you use, the more processing power you have.
• Fault tolerance. Data and application processing are
protected against hardware failure. If a node goes down,
jobs are automatically redirected to other nodes to make
sure the distributed computing does not fail. Multiple
copies of all data are stored automatically.
• Flexibility. Unlike traditional relational databases, you
don’t have to preprocess data before storing it. You can
store as much data as you want and decide how to use it
later. That includes unstructured data like text, images
and videos.
• Low cost. The open-source framework is free and uses
commodity hardware to store large quantities of data.
• Scalability. You can easily grow your system to handle
more data simply by adding nodes. Little administration
is required.
 HDFS is the data storage source of MR, which is a distributed
file system running on commercial hardware and designed in
reference to Google’s DFS.
 HDFS is the basis for main data storage of Hadoop applications,
which distributes files in data blocks of 64MB and stores such
data blocks in different nodes of a cluster, so as to enable
parallel computing for MR.
 An HDFS cluster includes a single NameNode for managing the
metadata of the file system and DataNodes for storing actual
data. A file is divided into one or multiple blocks and such
blocks are stored in DataNodes. Copies of blocks are
distributed to different DataNodes to prevent data loss.
 MR was developed similar to MapReduce of Google.
 The MR framework consists of one JobTracker node and
multiple TaskTracker nodes. The JobTracker node is used for
task distribution and task scheduling; TaskTracker nodes are
used to receive Map or Reduce tasks distributed from
JobTracker node and execute such tasks and feed task
status back to the JobTracker node.
 MR framework and HDFS run in the same node set, so as to
schedule tasks on nodes presented with data.
 Hadoop YARN framework allows one to do job scheduling and
cluster resource management, meaning users can submit and
kill applications through the Hadoop REST API.
 In Hadoop, the combination of all of the Java JAR files and
classes needed to run a MapReduce program is called a job.
 You can submit jobs to a JobTracker from the command line or
by HTTP posting them to the REST API.
 These jobs contain the “tasks” that execute the individual map
and reduce steps.
 Ambari: A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters, Ambari includes support
for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase,
ZooKeeper, Oozie, Pig, and Sqoop
 Avro: Avro is a data serialization system.
 Cassandra: Cassandra is a scalable multi-master database with
no single points of failure.
 Chukwa: A data collection system, Chukwa is used to manage
large distributed systems.
 HBase: A scalable, distributed database, HBase supports
structured data storage for large tables.
 Hive: Hive is a data warehouse infrastructure that provides
data summaries and ad-hoc querying.
 Mahout: Mahout is a scalable machine learning and data
mining library.
 Pig: This is a high-level data flow language and execution
framework for parallel computation.
 Spark: A fast and general compute engine for Hadoop data,
Spark provides a simple and expressive programming model
that supports a wide range of applications, including ETL,
machine learning, stream processing, and graph computation.
 Tez: Tez is a generalized data flow programming framework
built on Hadoop YARN that provides a powerful and flexible
engine to execute an arbitrary DAG of tasks to process data for
both batch and interactive use-cases.
 ZooKeeper: This is a high-performance coordination service for
distributed applications.
1. Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A survey." Mobile
networks and applications 19.2 (2014): 171-209.
2. https://www.sas.com/en_us/insights/big-data/hadoop.html
3. Gudivada, Venkat N., Ricardo A. Baeza-Yates, and Vijay V. Raghavan. "Big
Data: Promises and Problems." IEEE Computer 48.3 (2015): 20-23.
4. https://www.mssqltips.com/sqlservertip/3140/big-data-basics--part-3--
overview-of-hadoop/
5. https://data-flair.training/blogs/features-of-hadoop-and-design-principles/
6. https://www.hostingadvice.com/how-to/what-is-hadoop/

Big data

  • 1.
    By Amin Badirzadeh MinaSoltani Siapoosh
  • 3.
    Big Data isa phrase used to mean a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques “ ” “
  • 5.
    Twitter, Linkedin, Facebook,Tumblr, Blog, SlideShare, YouTube, Google+, Instagram, Flickr, Pinterest, Vimeo, WordPress, IM, RSS, Review, Chatter, Jive, Yammer, etc. Docs Sensor data Public Web Archive Media Social Media Medical devices, smart electric meters, car sensors, road cameras, satellites, traffic recording devices, processors found within vehicles, video games, cable boxes, assembly lines, office building, cell towers, jet engines, air conditioning units, refrigerators XLS, PDF, CSV, email, Word, PPT, HTML, HTML 5, plain text, XML, JSON, etc. Images, videos, audio, Flash, live streams, podcasts, etc. Government, weather, competitive, traffic, regulatory, compliance, health care services, economic, census, public finance, stock, OSINT, the World Bank, SEC/Edgar, Wikipedia, IMDb, etc. Archives of scanned documents, statements, insurance forms, medical record and customer correspondence, paper archives, and print stream files that contain original systems of record between organizations and their customers Event logs, server data, application logs, business process logs, audit logs, call detail records (CDRs), mobile location, mobile app usage, Project management, marketing automation, productivity, CRM, ERP content management system, HR, storage, talent management, procurement, expense management Google Docs, intranets, portals, etc. Business Apps Log Data
  • 6.
    An open sourceframework From Apache Foundation Java-based Distributed Processing Reliability High Availability By Doug Cutting
  • 7.
     Analytics  Search Data Retention  Log file processing  Analysis of Text, Image, Audio, & Video content  Recommendation systems like in E-Commerce Websites
  • 8.
     For Real-TimeData Analysis  For a Relational Database System:  For a General Network File System:  For Non-Parallel Data Processing:
  • 10.
    • Ability tostore and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration. • Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have. • Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically. • Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos. • Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data. • Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
  • 14.
     HDFS isthe data storage source of MR, which is a distributed file system running on commercial hardware and designed in reference to Google’s DFS.  HDFS is the basis for main data storage of Hadoop applications, which distributes files in data blocks of 64MB and stores such data blocks in different nodes of a cluster, so as to enable parallel computing for MR.  An HDFS cluster includes a single NameNode for managing the metadata of the file system and DataNodes for storing actual data. A file is divided into one or multiple blocks and such blocks are stored in DataNodes. Copies of blocks are distributed to different DataNodes to prevent data loss.
  • 16.
     MR wasdeveloped similar to MapReduce of Google.  The MR framework consists of one JobTracker node and multiple TaskTracker nodes. The JobTracker node is used for task distribution and task scheduling; TaskTracker nodes are used to receive Map or Reduce tasks distributed from JobTracker node and execute such tasks and feed task status back to the JobTracker node.  MR framework and HDFS run in the same node set, so as to schedule tasks on nodes presented with data.
  • 18.
     Hadoop YARNframework allows one to do job scheduling and cluster resource management, meaning users can submit and kill applications through the Hadoop REST API.  In Hadoop, the combination of all of the Java JAR files and classes needed to run a MapReduce program is called a job.  You can submit jobs to a JobTracker from the command line or by HTTP posting them to the REST API.  These jobs contain the “tasks” that execute the individual map and reduce steps.
  • 20.
     Ambari: Aweb-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, Ambari includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop  Avro: Avro is a data serialization system.  Cassandra: Cassandra is a scalable multi-master database with no single points of failure.  Chukwa: A data collection system, Chukwa is used to manage large distributed systems.  HBase: A scalable, distributed database, HBase supports structured data storage for large tables.  Hive: Hive is a data warehouse infrastructure that provides data summaries and ad-hoc querying.
  • 21.
     Mahout: Mahoutis a scalable machine learning and data mining library.  Pig: This is a high-level data flow language and execution framework for parallel computation.  Spark: A fast and general compute engine for Hadoop data, Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.  Tez: Tez is a generalized data flow programming framework built on Hadoop YARN that provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases.  ZooKeeper: This is a high-performance coordination service for distributed applications.
  • 22.
    1. Chen, Min,Shiwen Mao, and Yunhao Liu. "Big data: A survey." Mobile networks and applications 19.2 (2014): 171-209. 2. https://www.sas.com/en_us/insights/big-data/hadoop.html 3. Gudivada, Venkat N., Ricardo A. Baeza-Yates, and Vijay V. Raghavan. "Big Data: Promises and Problems." IEEE Computer 48.3 (2015): 20-23. 4. https://www.mssqltips.com/sqlservertip/3140/big-data-basics--part-3-- overview-of-hadoop/ 5. https://data-flair.training/blogs/features-of-hadoop-and-design-principles/ 6. https://www.hostingadvice.com/how-to/what-is-hadoop/