Big Data Infrastructure and Hadoop components.pptx

Lecture 5:
Big Data Storage and
Infrastructure

Characteristics of Big Data: The four Vs
● A crucial part of the rise of data science is the steep increase in amount and availability of data
● According to IBM scientists big data can be analyzed from four dimensions

Analysis of
structured data
• Tools
• OLAP, SQLite,
MySQL,
PostgreSQL
• Use cases
• Customer
Relationship
Management
(CRM)
• Online bookings
• Accounting

Analysis of unstructured data
● ML algorithms + NLP
● Tools
○ MongoDB, DynamoDB, Hadoop
● Use cases
○ Sentiment analysis, topic analysis, language detection, intent detection

Big Data Hardware
● Need to think of :
○ Data collection hardware
○ Data storage hardware
○ Data processing hardware

Data Collection Hardware
● Smartphones, cameras, cars, watches, security systems, motion sensors,
credit card terminals etc.
● Capture Requirements
○ Data accuracy
○ Real time transmission
○ Compatibility with analytical systems
○ Support for standard protocols i.e. IEEE 802.11, Z-Wave, ZigBee, Bluetooth etc

Hardware of
data storage
● Big Data requires big hardware
○ Powerful hardware optimized for
processing lots of information
● Even small applications generate
huge amount of information
● Traditional single server is insufficient
● We need massive data stored on
multiple optimized nodes

Data Science
Supportive
Hardware
Trends
● Cloud technology
● Solid state Drives
● AI focused Chips

Cloud
Technology
● No need of buying physical servers
but can rent hardware on the cloud
● Benefits include:
○ access to specialized resources,
○ quick deployment,
○ easily expanded capacity,
○ the ability to discontinue a cloud
service when it is no longer needed,
○ cost savings, and
○ good backup and recovery.

Cloud Technology
Software as a Service (SaaS):
the vendor provides the
hardware, application software,
operating system, and storage.
Platform as a Service (PaaS):
differs from SaaS in that the
vendor does not provide the
software for building or running
specific applications; this is up
to the company. Only the basic
platform is provided.
Infrastructure as a Service
(IaaS): the vendor provides raw
computing power and storage;
neither operating system nor
application software are included.
Customers upload an image that
includes the application and
operating system

Solid State Drive
(SSD)
● Faster
● No moving parts
● Smaller
● Best for storing
frequently accessed
data

Processors?
● Usual processor : Central Processing Unit
● Can scale buy adding more cores (Multi-core)
● However, scaling is limited.

Processors
for Big Data
Analytics
Chips that has been specially designed for
Artificial used in the field of Artificial Intelligence.
Examples:
● Graphics Processing Units (GPUs)
● Application specific Integrated Circuits (ASIC)
such as TPUs (Tensor Processing Unit)

Relational
databases
● Queries are issued using
Structured Query Language
(SQL)
● Used for storing structured
data
● Examples:
○ MySQL
○ MariaDB
○ Oracle
○ PostgreSQL

Databases
● Traditional databases : Relational databases
● Consists of tables (rows and columns)
● Two types:
○ Row oriented databases
○ Column oriented databases

Row oriented Databases
● Scenario: Updating Data
Use Case: Update the Last Purchase Amount for a specific customer.
Efficiency: Highly efficient. It can quickly locate the row and update the single
entry.
● Scenario: Aggregating a Single Column
Use Case: Calculate the average Last Purchase Amount.
Efficiency: Less efficient. The database has to read through all rows, picking out
the Last Purchase Amount from each, which can be slow if the dataset is large.

Column Oriented Databases
● Scenario: Updating Data
Use Case: Update the Last Purchase Amount for a specific customer.
Efficiency: Less efficient compared to row-oriented. It needs to locate the right
column and then find the specific customer within that column.
● Scenario: Aggregating a Single Column
Use Case: Calculate the average Last Purchase Amount.
Efficiency: Highly efficient. The database can quickly aggregate this single
column as it doesn’t need to read through the entire dataset, only the relevant
column.

Row Oriented Vs. Column Oriented Databases
• Row-Oriented Database: Best for transactional operations or scenarios
where entire records are frequently accessed or updated together.
• Column-Oriented Database: Ideal for analytical queries and operations that
require fast read access to specific columns for aggregation, like in data
warehousing.

Big data databases
● Remember the 4 Vs (Volume, Velocity, Variety, Veracity)?
● Databases need to handle all these characteristics
● Commonly known as NoSQL (Not Only SQL)

NoSQL
Databases
● Can accommodate unstructured data
● No need to store data in rows and columns,
several data models are acceptable (Files,
graph, etc. )
● Do not rely on SQL to retrieve data (though
some do support SQL)
● Data is stored and retrieved “as is” through
key-value pairs that use keys to provide links
to where files are stored on disk.
● Examples :
○ Apache Hadoop
○ Apache Cassandra,
○ MongoDB, and
○ Apache Couchbase

OLTP Vs.
OLAP
OLTP – Online Transaction Processing systems
OLAP – Online Analytical Processing systems

OLTP and OLAP
https://www.geeksforgeeks.org/difference-between-olap-and-oltp-in-dbms/

Big data analytics pipeline
● Data sources
● Data storage
● Data applications

Typical Data Storage Implementation

Solution
Move the algorithms to the data instead of the data
to the algorithms

Advantages
● No data movement
● Faster performance
● High security
● Scalability
● Real-time deployment and environments
● Production deployment

Hadoop
● Most used technology with Big data
● Apache top level project, open-source implementation of
frameworks for reliable, scalable, distributed computing and
data storage.
● It is a flexible and highly-available architecture for large scale
computation and data processing on a network of commodity
hardware.

Goals / Requirements:
• Abstract and facilitate the storage and processing of large and/or rapidly growing
data sets
• Structured
• Semi-structured
• and unstructured data
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data

● https://youtu.be/aReuLtY0YMI?si=mjbjZ6Hpyd3S4n5c

Something to research
● What do you think the chatGPT hardware infrastructure looks like?
● Amazon data centers?
● Google data centers?

Paper presentations – Next week (25th Jan 2023)
● Group 6 : Biswas, S., Wardat, M., & Rajan, H. (2022, May). The art and practice of data science pipelines:
A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. In Proceedings of
the 44th International Conference on Software Engineering (pp. 2091-2103).
● Group 7 : Talib, M. A., Majzoub, S., Nasir, Q., & Jamal, D. (2021). A systematic literature review on
hardware implementation of artificial intelligence algorithms. The Journal of Supercomputing, 77(2), 1897-
1938.
● Group 8 : Ngo, V. M., Le-Khac, N. A., & Kechadi, M. (2019, June). Designing and implementing
data warehouse for agricultural big data. In International Conference on Big Data (pp. 1-17).
Springer, Cham.

Big Data Infrastructure and Hadoop components.pptx

More Related Content

Similar to Big Data Infrastructure and Hadoop components.pptx

Recently uploaded

Big Data Infrastructure and Hadoop components.pptx