Faculty Name:Namrata
Sharma/Arjun S. Parihar
Year/Branch:3rd/CSE
Subject Code:CS-503(A)
Subject Name:Data
Analytics
Learning Objectives
In this session you will learn about:
• Big data sources and challenges
• Big data technology
• Hadoop’s history and advantages
• Hadoop Ecosystem
Big sources of Big data
The major challenges associated with big data are
as follows −
Capturing data
Storage
Searching
Sharing
Transfer
Analysis
Presentation
Big Data Challenges
Different sources generate huge amount of data.
More data More storage space
More money to
spend
Storage cost?
STORAGE
PLATFORM
IBM ORACLE HADOOP TERADATA
COST
(TERABYTE)
$10,000 $14,000 $333 $16,500
Price required to store data at cloud platform
It may vary because it depend on which tied data is
put for how many years.
Traditional Approach
Data Storage -Traditional vs. New
DBMS
• Store data as file
• Single user
support
• Low volume
data(MB)
RDBMS
• Store in form of
tables
• Multiple user
support
• High volume
data(GB)
Big data
technology
• Store data of
different formats
• Very high
volume (PB-EB-
YB)
Big Data technologies
Before starting with the list of technologies let us first see
the broad classification of all these technologies.
They can mainly be classified into 4 domains.
1.Data storage
2.Analytics
3.Data mining
4.Visualization
Types of Big Data Technologies
Big Data technologies
1.Data storage and Management
MongoDB is a document database
with the scalability and flexibility
that you want with the querying and
indexing that you need.
It is used to store process and
analyze Big Data.
It is a distributed database that provides
high availability and scalability without
compromising performance efficiency.
A column-oriented database management
system that runs on top of HDFS, a main
component of Apache Hadoop.
Apache ZooKeeper is an effort to develop
and maintain an open-source server which
enables highly reliable distributed
coordination.
2. Data cleaning
Data needs to be cleaned up and well-structured. Tools
which help in defining and reshaping the data into useable data sets are-
3. Data Mining
Data mining is a process of discovery in sites within a database
some of the popular tools used for data mining are-
4. Data visualization
visualization tools are a useful way of conveying and
complex data insights in a pictorial way that is easy to understand
5. Data Reporting
6. Data ingestion
Data ingestion is the transportation of data from assorted
sources to a storage medium where it can be accessed, used, and
analyzed by an organization.
Here getting the data into Hadoop Ford.
7. Data Analysis
Data analysis requires asking questions and finding the answers and data the popular tools
used for data analysis are –
 hive
 MapReduce
 spark data
 PIG
HADOOP PARALLEL
WORLD
What is Hadoop?
 Apache top level project, open-source implementation of
frameworks for –
•Reliable
•Scalable
•Parallel and distributed computing
•Data storage
 It is a flexible and highly-available architecture for large
scale computation.
 Abstract and facilitate the storage and processing of
large and/or rapidly growing data sets.
• Structured and non-structured data
• Simple programming models
 Use commodity (cheap!) hardware with little
redundancy
 Fault-tolerance
 Move computation rather than data
Brief History of Hadoop
Designed to answer the question: “How to process
big data with reasonable cost and time?”
At its core, Hadoop has two major layers namely −
Hadoop Architecture
MapReduce
Hadoop
Distributed File
System
Processing/Computation layer Storage layer
Hadoop
 Hadoop is in use at most organizations that handle
big data:
• Yahoo!
• Facebook
• Amazon
• Netflix Etc…
 Some examples of scale:
• Yahoo!’s Search Webmap runs on 10,000 core
Linux cluster and powers Yahoo! Web search
• FB’s Hadoop cluster hosts 100+ PB of data (July,
2012) & growing at ½ PB/day (Nov, 2012)
Hadoop Ecosystem
Summary of different technology
Thank you

data analytics lecture3.ppt

  • 2.
    Faculty Name:Namrata Sharma/Arjun S.Parihar Year/Branch:3rd/CSE Subject Code:CS-503(A) Subject Name:Data Analytics
  • 3.
    Learning Objectives In thissession you will learn about: • Big data sources and challenges • Big data technology • Hadoop’s history and advantages • Hadoop Ecosystem
  • 4.
  • 5.
    The major challengesassociated with big data are as follows − Capturing data Storage Searching Sharing Transfer Analysis Presentation Big Data Challenges
  • 6.
    Different sources generatehuge amount of data. More data More storage space More money to spend Storage cost?
  • 7.
    STORAGE PLATFORM IBM ORACLE HADOOPTERADATA COST (TERABYTE) $10,000 $14,000 $333 $16,500 Price required to store data at cloud platform It may vary because it depend on which tied data is put for how many years.
  • 8.
  • 9.
    Data Storage -Traditionalvs. New DBMS • Store data as file • Single user support • Low volume data(MB) RDBMS • Store in form of tables • Multiple user support • High volume data(GB) Big data technology • Store data of different formats • Very high volume (PB-EB- YB)
  • 10.
  • 11.
    Before starting withthe list of technologies let us first see the broad classification of all these technologies. They can mainly be classified into 4 domains. 1.Data storage 2.Analytics 3.Data mining 4.Visualization Types of Big Data Technologies
  • 12.
    Big Data technologies 1.Datastorage and Management MongoDB is a document database with the scalability and flexibility that you want with the querying and indexing that you need. It is used to store process and analyze Big Data.
  • 13.
    It is adistributed database that provides high availability and scalability without compromising performance efficiency. A column-oriented database management system that runs on top of HDFS, a main component of Apache Hadoop. Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.
  • 14.
    2. Data cleaning Dataneeds to be cleaned up and well-structured. Tools which help in defining and reshaping the data into useable data sets are-
  • 15.
    3. Data Mining Datamining is a process of discovery in sites within a database some of the popular tools used for data mining are-
  • 16.
    4. Data visualization visualizationtools are a useful way of conveying and complex data insights in a pictorial way that is easy to understand
  • 17.
  • 18.
    6. Data ingestion Dataingestion is the transportation of data from assorted sources to a storage medium where it can be accessed, used, and analyzed by an organization. Here getting the data into Hadoop Ford.
  • 19.
    7. Data Analysis Dataanalysis requires asking questions and finding the answers and data the popular tools used for data analysis are –  hive  MapReduce  spark data  PIG
  • 21.
  • 22.
    What is Hadoop? Apache top level project, open-source implementation of frameworks for – •Reliable •Scalable •Parallel and distributed computing •Data storage  It is a flexible and highly-available architecture for large scale computation.
  • 23.
     Abstract andfacilitate the storage and processing of large and/or rapidly growing data sets. • Structured and non-structured data • Simple programming models  Use commodity (cheap!) hardware with little redundancy  Fault-tolerance  Move computation rather than data
  • 24.
    Brief History ofHadoop Designed to answer the question: “How to process big data with reasonable cost and time?”
  • 25.
    At its core,Hadoop has two major layers namely − Hadoop Architecture MapReduce Hadoop Distributed File System Processing/Computation layer Storage layer Hadoop
  • 28.
     Hadoop isin use at most organizations that handle big data: • Yahoo! • Facebook • Amazon • Netflix Etc…  Some examples of scale: • Yahoo!’s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo! Web search • FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at ½ PB/day (Nov, 2012)
  • 29.
  • 30.
  • 31.