SlideShare a Scribd company logo
By Amin Badirzadeh
Mina Soltani Siapoosh
Big Data is a phrase used to mean a massive
volume of both structured and unstructured
data that is so large it is difficult to process using
traditional database and software techniques
“
”
“
Twitter, Linkedin, Facebook, Tumblr, Blog,
SlideShare, YouTube, Google+, Instagram,
Flickr, Pinterest, Vimeo, WordPress, IM,
RSS, Review, Chatter, Jive, Yammer, etc.
Docs
Sensor
data
Public
Web
Archive
Media
Social
Media
Medical devices, smart electric
meters, car sensors, road cameras,
satellites, traffic recording devices,
processors found within vehicles,
video games, cable boxes,
assembly lines, office building, cell
towers, jet engines, air
conditioning units, refrigerators
XLS, PDF, CSV, email, Word,
PPT, HTML, HTML 5, plain
text, XML, JSON, etc.
Images, videos, audio, Flash, live
streams, podcasts, etc.
Government, weather,
competitive, traffic, regulatory,
compliance, health care
services, economic, census,
public finance, stock, OSINT,
the World Bank, SEC/Edgar,
Wikipedia, IMDb, etc.
Archives of scanned documents,
statements, insurance forms, medical
record and customer correspondence,
paper archives, and print stream files that
contain original systems of record between
organizations and their customers
Event logs, server data, application
logs, business process logs, audit
logs, call detail records (CDRs),
mobile location, mobile app usage,
Project management, marketing automation,
productivity, CRM, ERP content management
system, HR, storage, talent management,
procurement, expense management Google
Docs, intranets, portals, etc.
Business
Apps
Log
Data
An open source framework
From Apache Foundation
Java-based
Distributed Processing
Reliability
High Availability
By Doug
Cutting
 Analytics
 Search
 Data Retention
 Log file processing
 Analysis of Text, Image, Audio, & Video content
 Recommendation systems like in E-Commerce Websites
 For Real-Time Data Analysis
 For a Relational Database System:
 For a General Network File System:
 For Non-Parallel Data Processing:
• Ability to store and process huge amounts of any kind
of data, quickly. With data volumes and varieties
constantly increasing, especially from social media and
the Internet of Things (IoT), that's a key consideration.
• Computing power. Hadoop's distributed computing
model processes big data fast. The more computing
nodes you use, the more processing power you have.
• Fault tolerance. Data and application processing are
protected against hardware failure. If a node goes down,
jobs are automatically redirected to other nodes to make
sure the distributed computing does not fail. Multiple
copies of all data are stored automatically.
• Flexibility. Unlike traditional relational databases, you
don’t have to preprocess data before storing it. You can
store as much data as you want and decide how to use it
later. That includes unstructured data like text, images
and videos.
• Low cost. The open-source framework is free and uses
commodity hardware to store large quantities of data.
• Scalability. You can easily grow your system to handle
more data simply by adding nodes. Little administration
is required.
 HDFS is the data storage source of MR, which is a distributed
file system running on commercial hardware and designed in
reference to Google’s DFS.
 HDFS is the basis for main data storage of Hadoop applications,
which distributes files in data blocks of 64MB and stores such
data blocks in different nodes of a cluster, so as to enable
parallel computing for MR.
 An HDFS cluster includes a single NameNode for managing the
metadata of the file system and DataNodes for storing actual
data. A file is divided into one or multiple blocks and such
blocks are stored in DataNodes. Copies of blocks are
distributed to different DataNodes to prevent data loss.
 MR was developed similar to MapReduce of Google.
 The MR framework consists of one JobTracker node and
multiple TaskTracker nodes. The JobTracker node is used for
task distribution and task scheduling; TaskTracker nodes are
used to receive Map or Reduce tasks distributed from
JobTracker node and execute such tasks and feed task
status back to the JobTracker node.
 MR framework and HDFS run in the same node set, so as to
schedule tasks on nodes presented with data.
 Hadoop YARN framework allows one to do job scheduling and
cluster resource management, meaning users can submit and
kill applications through the Hadoop REST API.
 In Hadoop, the combination of all of the Java JAR files and
classes needed to run a MapReduce program is called a job.
 You can submit jobs to a JobTracker from the command line or
by HTTP posting them to the REST API.
 These jobs contain the “tasks” that execute the individual map
and reduce steps.
 Ambari: A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters, Ambari includes support
for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase,
ZooKeeper, Oozie, Pig, and Sqoop
 Avro: Avro is a data serialization system.
 Cassandra: Cassandra is a scalable multi-master database with
no single points of failure.
 Chukwa: A data collection system, Chukwa is used to manage
large distributed systems.
 HBase: A scalable, distributed database, HBase supports
structured data storage for large tables.
 Hive: Hive is a data warehouse infrastructure that provides
data summaries and ad-hoc querying.
 Mahout: Mahout is a scalable machine learning and data
mining library.
 Pig: This is a high-level data flow language and execution
framework for parallel computation.
 Spark: A fast and general compute engine for Hadoop data,
Spark provides a simple and expressive programming model
that supports a wide range of applications, including ETL,
machine learning, stream processing, and graph computation.
 Tez: Tez is a generalized data flow programming framework
built on Hadoop YARN that provides a powerful and flexible
engine to execute an arbitrary DAG of tasks to process data for
both batch and interactive use-cases.
 ZooKeeper: This is a high-performance coordination service for
distributed applications.
1. Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A survey." Mobile
networks and applications 19.2 (2014): 171-209.
2. https://www.sas.com/en_us/insights/big-data/hadoop.html
3. Gudivada, Venkat N., Ricardo A. Baeza-Yates, and Vijay V. Raghavan. "Big
Data: Promises and Problems." IEEE Computer 48.3 (2015): 20-23.
4. https://www.mssqltips.com/sqlservertip/3140/big-data-basics--part-3--
overview-of-hadoop/
5. https://data-flair.training/blogs/features-of-hadoop-and-design-principles/
6. https://www.hostingadvice.com/how-to/what-is-hadoop/

More Related Content

What's hot

Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
Mansi Mehra
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
Hadoop
HadoopHadoop
Hadoop
HadoopHadoop
Hadoop
Ankit Prasad
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
AshishRathore72
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
SahilRaina21
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
Khalid Imran
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
siliconsudipt
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
Natalino Busa
 
Processing cassandra datasets with hadoop streaming based approaches
Processing cassandra datasets with hadoop streaming based approachesProcessing cassandra datasets with hadoop streaming based approaches
Processing cassandra datasets with hadoop streaming based approaches
LeMeniz Infotech
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
Lucian Neghina
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Hadoop
HadoopHadoop
Hadoop
Aarti Bedre
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
Sysfore Technologies
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Thirunavukkarasu Ps
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
saisreealekhya
 

What's hot (18)

Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 
Processing cassandra datasets with hadoop streaming based approaches
Processing cassandra datasets with hadoop streaming based approachesProcessing cassandra datasets with hadoop streaming based approaches
Processing cassandra datasets with hadoop streaming based approaches
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 

Similar to Big data

Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
Krishna Sujeer
 
Hadoop
HadoopHadoop
Big data
Big dataBig data
Big data
revathireddyb
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
Nitesh Ghosh
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
avenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
BlibBlobb
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
Spotle.ai
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
snehal parikh
 
Big Data
Big DataBig Data
Big Data
Kirubaburi R
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
paper
paperpaper
G017143640
G017143640G017143640
G017143640
IOSR Journals
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
Cognizant
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
himanshu arora
 
Big data
Big dataBig data
Big data
Abilash Mavila
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
Sarvesh Meena
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
Aditya Srinivasan
 

Similar to Big data (20)

Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Big data
Big dataBig data
Big data
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
Big Data
Big DataBig Data
Big Data
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
paper
paperpaper
paper
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Big data
Big dataBig data
Big data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 

Recently uploaded

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 

Recently uploaded (20)

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 

Big data

  • 1. By Amin Badirzadeh Mina Soltani Siapoosh
  • 2.
  • 3. Big Data is a phrase used to mean a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques “ ” “
  • 4.
  • 5. Twitter, Linkedin, Facebook, Tumblr, Blog, SlideShare, YouTube, Google+, Instagram, Flickr, Pinterest, Vimeo, WordPress, IM, RSS, Review, Chatter, Jive, Yammer, etc. Docs Sensor data Public Web Archive Media Social Media Medical devices, smart electric meters, car sensors, road cameras, satellites, traffic recording devices, processors found within vehicles, video games, cable boxes, assembly lines, office building, cell towers, jet engines, air conditioning units, refrigerators XLS, PDF, CSV, email, Word, PPT, HTML, HTML 5, plain text, XML, JSON, etc. Images, videos, audio, Flash, live streams, podcasts, etc. Government, weather, competitive, traffic, regulatory, compliance, health care services, economic, census, public finance, stock, OSINT, the World Bank, SEC/Edgar, Wikipedia, IMDb, etc. Archives of scanned documents, statements, insurance forms, medical record and customer correspondence, paper archives, and print stream files that contain original systems of record between organizations and their customers Event logs, server data, application logs, business process logs, audit logs, call detail records (CDRs), mobile location, mobile app usage, Project management, marketing automation, productivity, CRM, ERP content management system, HR, storage, talent management, procurement, expense management Google Docs, intranets, portals, etc. Business Apps Log Data
  • 6. An open source framework From Apache Foundation Java-based Distributed Processing Reliability High Availability By Doug Cutting
  • 7.  Analytics  Search  Data Retention  Log file processing  Analysis of Text, Image, Audio, & Video content  Recommendation systems like in E-Commerce Websites
  • 8.  For Real-Time Data Analysis  For a Relational Database System:  For a General Network File System:  For Non-Parallel Data Processing:
  • 9.
  • 10. • Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration. • Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have. • Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically. • Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos. • Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data. • Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
  • 11.
  • 12.
  • 13.
  • 14.  HDFS is the data storage source of MR, which is a distributed file system running on commercial hardware and designed in reference to Google’s DFS.  HDFS is the basis for main data storage of Hadoop applications, which distributes files in data blocks of 64MB and stores such data blocks in different nodes of a cluster, so as to enable parallel computing for MR.  An HDFS cluster includes a single NameNode for managing the metadata of the file system and DataNodes for storing actual data. A file is divided into one or multiple blocks and such blocks are stored in DataNodes. Copies of blocks are distributed to different DataNodes to prevent data loss.
  • 15.
  • 16.  MR was developed similar to MapReduce of Google.  The MR framework consists of one JobTracker node and multiple TaskTracker nodes. The JobTracker node is used for task distribution and task scheduling; TaskTracker nodes are used to receive Map or Reduce tasks distributed from JobTracker node and execute such tasks and feed task status back to the JobTracker node.  MR framework and HDFS run in the same node set, so as to schedule tasks on nodes presented with data.
  • 17.
  • 18.  Hadoop YARN framework allows one to do job scheduling and cluster resource management, meaning users can submit and kill applications through the Hadoop REST API.  In Hadoop, the combination of all of the Java JAR files and classes needed to run a MapReduce program is called a job.  You can submit jobs to a JobTracker from the command line or by HTTP posting them to the REST API.  These jobs contain the “tasks” that execute the individual map and reduce steps.
  • 19.
  • 20.  Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, Ambari includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop  Avro: Avro is a data serialization system.  Cassandra: Cassandra is a scalable multi-master database with no single points of failure.  Chukwa: A data collection system, Chukwa is used to manage large distributed systems.  HBase: A scalable, distributed database, HBase supports structured data storage for large tables.  Hive: Hive is a data warehouse infrastructure that provides data summaries and ad-hoc querying.
  • 21.  Mahout: Mahout is a scalable machine learning and data mining library.  Pig: This is a high-level data flow language and execution framework for parallel computation.  Spark: A fast and general compute engine for Hadoop data, Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.  Tez: Tez is a generalized data flow programming framework built on Hadoop YARN that provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases.  ZooKeeper: This is a high-performance coordination service for distributed applications.
  • 22. 1. Chen, Min, Shiwen Mao, and Yunhao Liu. "Big data: A survey." Mobile networks and applications 19.2 (2014): 171-209. 2. https://www.sas.com/en_us/insights/big-data/hadoop.html 3. Gudivada, Venkat N., Ricardo A. Baeza-Yates, and Vijay V. Raghavan. "Big Data: Promises and Problems." IEEE Computer 48.3 (2015): 20-23. 4. https://www.mssqltips.com/sqlservertip/3140/big-data-basics--part-3-- overview-of-hadoop/ 5. https://data-flair.training/blogs/features-of-hadoop-and-design-principles/ 6. https://www.hostingadvice.com/how-to/what-is-hadoop/