Lecture 5:
Big Data Storage and
Infrastructure
Big Data
Characteristics of Big Data: The four Vs
● A crucial part of the rise of data science is the steep increase in amount and availability of data
● According to IBM scientists big data can be analyzed from four dimensions
Data Types
Analysis of
structured data
• Tools
• OLAP, SQLite,
MySQL,
PostgreSQL
• Use cases
• Customer
Relationship
Management
(CRM)
• Online bookings
• Accounting
Analysis of unstructured data
● ML algorithms + NLP
● Tools
○ MongoDB, DynamoDB, Hadoop
● Use cases
○ Sentiment analysis, topic analysis, language detection, intent detection
Analysis of unstructured
data
Hardware
and Storage
Big Data Hardware
● Need to think of :
○ Data collection hardware
○ Data storage hardware
○ Data processing hardware
Data Collection Hardware
● Smartphones, cameras, cars, watches, security systems, motion sensors,
credit card terminals etc.
● Capture Requirements
○ Data accuracy
○ Real time transmission
○ Compatibility with analytical systems
○ Support for standard protocols i.e. IEEE 802.11, Z-Wave, ZigBee, Bluetooth etc
Hardware of
data storage
● Big Data requires big hardware
○ Powerful hardware optimized for
processing lots of information
● Even small applications generate
huge amount of information
● Traditional single server is insufficient
● We need massive data stored on
multiple optimized nodes
Data Science
Supportive
Hardware
Trends
● Cloud technology
● Solid state Drives
● AI focused Chips
Cloud
Technology
● No need of buying physical servers
but can rent hardware on the cloud
● Benefits include:
○ access to specialized resources,
○ quick deployment,
○ easily expanded capacity,
○ the ability to discontinue a cloud
service when it is no longer needed,
○ cost savings, and
○ good backup and recovery.
Cloud Technology
Software as a Service (SaaS):
the vendor provides the
hardware, application software,
operating system, and storage.
Platform as a Service (PaaS):
differs from SaaS in that the
vendor does not provide the
software for building or running
specific applications; this is up
to the company. Only the basic
platform is provided.
Infrastructure as a Service
(IaaS): the vendor provides raw
computing power and storage;
neither operating system nor
application software are included.
Customers upload an image that
includes the application and
operating system
Solid State Drive
(SSD)
● Faster
● No moving parts
● Smaller
● Best for storing
frequently accessed
data
Processors?
● Usual processor : Central Processing Unit
● Can scale buy adding more cores (Multi-core)
● However, scaling is limited.
Processors
for Big Data
Analytics
Chips that has been specially designed for
Artificial used in the field of Artificial Intelligence.
Examples:
● Graphics Processing Units (GPUs)
● Application specific Integrated Circuits (ASIC)
such as TPUs (Tensor Processing Unit)
Databases
Relational
databases
● Queries are issued using
Structured Query Language
(SQL)
● Used for storing structured
data
● Examples:
○ MySQL
○ MariaDB
○ Oracle
○ PostgreSQL
Databases
● Traditional databases : Relational databases
● Consists of tables (rows and columns)
● Two types:
○ Row oriented databases
○ Column oriented databases
Row oriented Databases
● Scenario: Updating Data
Use Case: Update the Last Purchase Amount for a specific customer.
Efficiency: Highly efficient. It can quickly locate the row and update the single
entry.
● Scenario: Aggregating a Single Column
Use Case: Calculate the average Last Purchase Amount.
Efficiency: Less efficient. The database has to read through all rows, picking out
the Last Purchase Amount from each, which can be slow if the dataset is large.
Column Oriented Databases
● Scenario: Updating Data
Use Case: Update the Last Purchase Amount for a specific customer.
Efficiency: Less efficient compared to row-oriented. It needs to locate the right
column and then find the specific customer within that column.
● Scenario: Aggregating a Single Column
Use Case: Calculate the average Last Purchase Amount.
Efficiency: Highly efficient. The database can quickly aggregate this single
column as it doesn’t need to read through the entire dataset, only the relevant
column.
Row Oriented Vs. Column Oriented Databases
• Row-Oriented Database: Best for transactional operations or scenarios
where entire records are frequently accessed or updated together.
• Column-Oriented Database: Ideal for analytical queries and operations that
require fast read access to specific columns for aggregation, like in data
warehousing.
Big data databases
● Remember the 4 Vs (Volume, Velocity, Variety, Veracity)?
● Databases need to handle all these characteristics
● Commonly known as NoSQL (Not Only SQL)
NoSQL
Databases
● Can accommodate unstructured data
● No need to store data in rows and columns,
several data models are acceptable (Files,
graph, etc. )
● Do not rely on SQL to retrieve data (though
some do support SQL)
● Data is stored and retrieved “as is” through
key-value pairs that use keys to provide links
to where files are stored on disk.
● Examples :
○ Apache Hadoop
○ Apache Cassandra,
○ MongoDB, and
○ Apache Couchbase
OLTP Vs.
OLAP
OLTP – Online Transaction Processing systems
OLAP – Online Analytical Processing systems
OLTP and OLAP
https://www.geeksforgeeks.org/difference-between-olap-and-oltp-in-dbms/
Data warehousing
Typical setup
Big data analytics pipeline
● Data sources
● Data storage
● Data applications
Typical Data Storage Implementation
Solution
Move the algorithms to the data instead of the data
to the algorithms
Advantages
● No data movement
● Faster performance
● High security
● Scalability
● Real-time deployment and environments
● Production deployment
Hadoop
● Most used technology with Big data
● Apache top level project, open-source implementation of
frameworks for reliable, scalable, distributed computing and
data storage.
● It is a flexible and highly-available architecture for large scale
computation and data processing on a network of commodity
hardware.
Goals / Requirements:
• Abstract and facilitate the storage and processing of large and/or rapidly growing
data sets
• Structured
• Semi-structured
• and unstructured data
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data
● https://youtu.be/aReuLtY0YMI?si=mjbjZ6Hpyd3S4n5c
Questions?
Something to research
● What do you think the chatGPT hardware infrastructure looks like?
● Amazon data centers?
● Google data centers?
How was the test?
Paper presentations – Next week (25th Jan 2023)
● Group 6 : Biswas, S., Wardat, M., & Rajan, H. (2022, May). The art and practice of data science pipelines:
A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. In Proceedings of
the 44th International Conference on Software Engineering (pp. 2091-2103).
● Group 7 : Talib, M. A., Majzoub, S., Nasir, Q., & Jamal, D. (2021). A systematic literature review on
hardware implementation of artificial intelligence algorithms. The Journal of Supercomputing, 77(2), 1897-
1938.
● Group 8 : Ngo, V. M., Le-Khac, N. A., & Kechadi, M. (2019, June). Designing and implementing
data warehouse for agricultural big data. In International Conference on Big Data (pp. 1-17).
Springer, Cham.

Big Data Infrastructure and Hadoop components.pptx

  • 1.
    Lecture 5: Big DataStorage and Infrastructure
  • 2.
  • 3.
    Characteristics of BigData: The four Vs ● A crucial part of the rise of data science is the steep increase in amount and availability of data ● According to IBM scientists big data can be analyzed from four dimensions
  • 4.
  • 5.
    Analysis of structured data •Tools • OLAP, SQLite, MySQL, PostgreSQL • Use cases • Customer Relationship Management (CRM) • Online bookings • Accounting
  • 6.
    Analysis of unstructureddata ● ML algorithms + NLP ● Tools ○ MongoDB, DynamoDB, Hadoop ● Use cases ○ Sentiment analysis, topic analysis, language detection, intent detection
  • 7.
  • 8.
  • 9.
    Big Data Hardware ●Need to think of : ○ Data collection hardware ○ Data storage hardware ○ Data processing hardware
  • 10.
    Data Collection Hardware ●Smartphones, cameras, cars, watches, security systems, motion sensors, credit card terminals etc. ● Capture Requirements ○ Data accuracy ○ Real time transmission ○ Compatibility with analytical systems ○ Support for standard protocols i.e. IEEE 802.11, Z-Wave, ZigBee, Bluetooth etc
  • 11.
    Hardware of data storage ●Big Data requires big hardware ○ Powerful hardware optimized for processing lots of information ● Even small applications generate huge amount of information ● Traditional single server is insufficient ● We need massive data stored on multiple optimized nodes
  • 12.
    Data Science Supportive Hardware Trends ● Cloudtechnology ● Solid state Drives ● AI focused Chips
  • 13.
    Cloud Technology ● No needof buying physical servers but can rent hardware on the cloud ● Benefits include: ○ access to specialized resources, ○ quick deployment, ○ easily expanded capacity, ○ the ability to discontinue a cloud service when it is no longer needed, ○ cost savings, and ○ good backup and recovery.
  • 14.
    Cloud Technology Software asa Service (SaaS): the vendor provides the hardware, application software, operating system, and storage. Platform as a Service (PaaS): differs from SaaS in that the vendor does not provide the software for building or running specific applications; this is up to the company. Only the basic platform is provided. Infrastructure as a Service (IaaS): the vendor provides raw computing power and storage; neither operating system nor application software are included. Customers upload an image that includes the application and operating system
  • 15.
    Solid State Drive (SSD) ●Faster ● No moving parts ● Smaller ● Best for storing frequently accessed data
  • 16.
    Processors? ● Usual processor: Central Processing Unit ● Can scale buy adding more cores (Multi-core) ● However, scaling is limited.
  • 17.
    Processors for Big Data Analytics Chipsthat has been specially designed for Artificial used in the field of Artificial Intelligence. Examples: ● Graphics Processing Units (GPUs) ● Application specific Integrated Circuits (ASIC) such as TPUs (Tensor Processing Unit)
  • 18.
  • 19.
    Relational databases ● Queries areissued using Structured Query Language (SQL) ● Used for storing structured data ● Examples: ○ MySQL ○ MariaDB ○ Oracle ○ PostgreSQL
  • 20.
    Databases ● Traditional databases: Relational databases ● Consists of tables (rows and columns) ● Two types: ○ Row oriented databases ○ Column oriented databases
  • 21.
    Row oriented Databases ●Scenario: Updating Data Use Case: Update the Last Purchase Amount for a specific customer. Efficiency: Highly efficient. It can quickly locate the row and update the single entry. ● Scenario: Aggregating a Single Column Use Case: Calculate the average Last Purchase Amount. Efficiency: Less efficient. The database has to read through all rows, picking out the Last Purchase Amount from each, which can be slow if the dataset is large.
  • 22.
    Column Oriented Databases ●Scenario: Updating Data Use Case: Update the Last Purchase Amount for a specific customer. Efficiency: Less efficient compared to row-oriented. It needs to locate the right column and then find the specific customer within that column. ● Scenario: Aggregating a Single Column Use Case: Calculate the average Last Purchase Amount. Efficiency: Highly efficient. The database can quickly aggregate this single column as it doesn’t need to read through the entire dataset, only the relevant column.
  • 23.
    Row Oriented Vs.Column Oriented Databases • Row-Oriented Database: Best for transactional operations or scenarios where entire records are frequently accessed or updated together. • Column-Oriented Database: Ideal for analytical queries and operations that require fast read access to specific columns for aggregation, like in data warehousing.
  • 24.
    Big data databases ●Remember the 4 Vs (Volume, Velocity, Variety, Veracity)? ● Databases need to handle all these characteristics ● Commonly known as NoSQL (Not Only SQL)
  • 25.
    NoSQL Databases ● Can accommodateunstructured data ● No need to store data in rows and columns, several data models are acceptable (Files, graph, etc. ) ● Do not rely on SQL to retrieve data (though some do support SQL) ● Data is stored and retrieved “as is” through key-value pairs that use keys to provide links to where files are stored on disk. ● Examples : ○ Apache Hadoop ○ Apache Cassandra, ○ MongoDB, and ○ Apache Couchbase
  • 26.
    OLTP Vs. OLAP OLTP –Online Transaction Processing systems OLAP – Online Analytical Processing systems
  • 27.
  • 29.
  • 30.
  • 31.
    Big data analyticspipeline ● Data sources ● Data storage ● Data applications
  • 32.
    Typical Data StorageImplementation
  • 34.
    Solution Move the algorithmsto the data instead of the data to the algorithms
  • 35.
    Advantages ● No datamovement ● Faster performance ● High security ● Scalability ● Real-time deployment and environments ● Production deployment
  • 36.
    Hadoop ● Most usedtechnology with Big data ● Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. ● It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.
  • 37.
    Goals / Requirements: •Abstract and facilitate the storage and processing of large and/or rapidly growing data sets • Structured • Semi-structured • and unstructured data • High scalability and availability • Use commodity (cheap!) hardware with little redundancy • Fault-tolerance • Move computation rather than data
  • 38.
  • 41.
  • 42.
    Something to research ●What do you think the chatGPT hardware infrastructure looks like? ● Amazon data centers? ● Google data centers?
  • 43.
  • 44.
    Paper presentations –Next week (25th Jan 2023) ● Group 6 : Biswas, S., Wardat, M., & Rajan, H. (2022, May). The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. In Proceedings of the 44th International Conference on Software Engineering (pp. 2091-2103). ● Group 7 : Talib, M. A., Majzoub, S., Nasir, Q., & Jamal, D. (2021). A systematic literature review on hardware implementation of artificial intelligence algorithms. The Journal of Supercomputing, 77(2), 1897- 1938. ● Group 8 : Ngo, V. M., Le-Khac, N. A., & Kechadi, M. (2019, June). Designing and implementing data warehouse for agricultural big data. In International Conference on Big Data (pp. 1-17). Springer, Cham.