data analytics lecture3.ppt

Faculty Name:Namrata
Sharma/Arjun S. Parihar
Year/Branch:3rd/CSE
Subject Code:CS-503(A)
Subject Name:Data
Analytics

Learning Objectives
In this session you will learn about:
• Big data sources and challenges
• Big data technology
• Hadoop’s history and advantages
• Hadoop Ecosystem

The major challenges associated with big data are
as follows −
Capturing data
Storage
Searching
Sharing
Transfer
Analysis
Presentation
Big Data Challenges

Different sources generate huge amount of data.
More data More storage space
More money to
spend
Storage cost?

STORAGE
PLATFORM
IBM ORACLE HADOOP TERADATA
COST
(TERABYTE)
$10,000 $14,000 $333 $16,500
Price required to store data at cloud platform
It may vary because it depend on which tied data is
put for how many years.

Data Storage -Traditional vs. New
DBMS
• Store data as file
• Single user
support
• Low volume
data(MB)
RDBMS
• Store in form of
tables
• Multiple user
support
• High volume
data(GB)
Big data
technology
• Store data of
different formats
• Very high
volume (PB-EB-
YB)

Before starting with the list of technologies let us first see
the broad classification of all these technologies.
They can mainly be classified into 4 domains.
1.Data storage
2.Analytics
3.Data mining
4.Visualization
Types of Big Data Technologies

Big Data technologies
1.Data storage and Management
MongoDB is a document database
with the scalability and flexibility
that you want with the querying and
indexing that you need.
It is used to store process and
analyze Big Data.

It is a distributed database that provides
high availability and scalability without
compromising performance efficiency.
A column-oriented database management
system that runs on top of HDFS, a main
component of Apache Hadoop.
Apache ZooKeeper is an effort to develop
and maintain an open-source server which
enables highly reliable distributed
coordination.

2. Data cleaning
Data needs to be cleaned up and well-structured. Tools
which help in defining and reshaping the data into useable data sets are-

3. Data Mining
Data mining is a process of discovery in sites within a database
some of the popular tools used for data mining are-

4. Data visualization
visualization tools are a useful way of conveying and
complex data insights in a pictorial way that is easy to understand

6. Data ingestion
Data ingestion is the transportation of data from assorted
sources to a storage medium where it can be accessed, used, and
analyzed by an organization.
Here getting the data into Hadoop Ford.

7. Data Analysis
Data analysis requires asking questions and finding the answers and data the popular tools
used for data analysis are –
 hive
 MapReduce
 spark data
 PIG

What is Hadoop?
 Apache top level project, open-source implementation of
frameworks for –
•Reliable
•Scalable
•Parallel and distributed computing
•Data storage
 It is a flexible and highly-available architecture for large
scale computation.

 Abstract and facilitate the storage and processing of
large and/or rapidly growing data sets.
• Structured and non-structured data
• Simple programming models
 Use commodity (cheap!) hardware with little
redundancy
 Fault-tolerance
 Move computation rather than data

Brief History of Hadoop
Designed to answer the question: “How to process
big data with reasonable cost and time?”

At its core, Hadoop has two major layers namely −
Hadoop Architecture
MapReduce
Hadoop
Distributed File
System
Processing/Computation layer Storage layer
Hadoop

 Hadoop is in use at most organizations that handle
big data:
• Yahoo!
• Facebook
• Amazon
• Netflix Etc…
 Some examples of scale:
• Yahoo!’s Search Webmap runs on 10,000 core
Linux cluster and powers Yahoo! Web search
• FB’s Hadoop cluster hosts 100+ PB of data (July,
2012) & growing at ½ PB/day (Nov, 2012)

Summary of different technology

data analytics lecture3.ppt

More Related Content

Similar to data analytics lecture3.ppt

Recently uploaded

data analytics lecture3.ppt