2. What is RDBMS?
Stand for “Relational Database Management System”.
The data in RDBMS is stored in database objects called tables.
A table is a collection of related data entries and it consists of columns and rows.
Each RDMS hosts one or more databases.
Each database consists of one or more tables.
3. Sample Employee Table on RDBMS
Emp_no DOB First Name Last Name Gender Date Joined
499990 11/3/1963 Khaled Kohling M 10/10/1985
499991 2/26/1962 Pohua Sichman F 1/12/1989
499992 10/12/1960 Siamak Salverda F 5/10/1987
499993 6/4/1963 DeForest Mullainathan M 4/7/1997
499994 2/26/1952 Navin Argence F 4/24/1990
4. What is SQL?
Stand for “Structured Query Language”
SQL lets you access and manipulate databases.
SQL is an ANSI (American National Standards Institute) standard.
Major commands include SELECT, UPDATE, DELETE, INSERT, WHERE.
5. Demo on MySQL
Querying a relational database with >300K employees hosted on AWS cloud (free-tier)
6. Data Warehouse
OLAP database mainly used for analytical purposes, such as
analyzing historical trends and patterns, instead of daily operational
transactions.
Import from various data sources, typically from different databases
& ERP systems
The process of importing and manipulating transactional data into
the warehouse is referred as Extraction, Transformation and Loading
(ETL).
Provide summarized and multi-dimensional views of consolidated
data i.e. Data Cube. Give contexts to various perspectives e.g. time
dimension show the breakdown of sales by year, quarter, month,
day and hour. Product dimension help to see which products bring
in the most revenue etc.
7. Problems of Relational Database &
Warehouse
Must have schemas or planned data models i.e. strict data types, difficult to change etc.
Not suited for unstructured data e.g. social media, story books, news, journals, photos etc.
99% of real-world data is unstructured.
Expensive, unadaptable, unable to scale big easily (often measured in Gigabytes or
Terabytes at best), and often require specialized hardware & licensed proprietary software
Almost all Big Tech companies (Facebook, Google, Yahoo! etc) have long decided that
traditional RDBMS is bad for their data business models that change frequently and
measured in Petabytes, Zettabytes, Exabytes and beyond.
However, many data analysts are familiar with traditional ETL and BI concepts but not with
Hadoop programming.
8. Hadoop + NoSQL
Development of the Hadoop file system (HDFS) and associated NoSQL databases such as
Cassandra and HBase.
NoSQL stands for “Not Only SQL”.
Ability to store data in raw formats and decide what-to-do later i.e. Data Lakes
NoSQL can be schema, schema-less, flexible and adaptable to changes.
Well suited for both structured and unstructured big data.
Ability to dynamically expand using cheap commodity hardware and free open-source
software.
Cost of storing data in a Hadoop solution grows linearly with the volume of data and there is no ultimate
limit.
Hadoop + NoSQL ecosystem bring back familiar data warehouse and BI concepts to Hadoop
14. Pig and Hive
Not everyone can code in Java for Hadoop apps
Introducing Pig Latin
High level abstract of Java MapReduce programming
Introducing Hive
Early-day nosql data warehouse on Hadoop Filesystem
Data-warehousing activities on Hadoop e.g. Extract, Transform and Load (ETL)
15. Demo
Pig and Hive (ETL) demo on all Shakespeare's literatures (unstructured data)!
16. Many SQL Processing Engines and NoSQL
DBs on Hadoop Ecosystem
Apache Impala by Cloudera
Apache Drill by MapR
HAWQ (HDFS) and GemFire (in-memory) by Pivotal
Presto by Facebook
Apache Spark SQL (Shark)– distributed by almost all Hadoop vendors
There are much more…….
Check out https://hadoopecosystemtable.github.io/