Introduction to SQL on Hadoop
Samuel Yee
What is RDBMS?
 Stand for “Relational Database Management System”.
 The data in RDBMS is stored in database objects called tables.
 A table is a collection of related data entries and it consists of columns and rows.
 Each RDMS hosts one or more databases.
 Each database consists of one or more tables.
Sample Employee Table on RDBMS
Emp_no DOB First Name Last Name Gender Date Joined
499990 11/3/1963 Khaled Kohling M 10/10/1985
499991 2/26/1962 Pohua Sichman F 1/12/1989
499992 10/12/1960 Siamak Salverda F 5/10/1987
499993 6/4/1963 DeForest Mullainathan M 4/7/1997
499994 2/26/1952 Navin Argence F 4/24/1990
What is SQL?
 Stand for “Structured Query Language”
 SQL lets you access and manipulate databases.
 SQL is an ANSI (American National Standards Institute) standard.
 Major commands include SELECT, UPDATE, DELETE, INSERT, WHERE.
Demo on MySQL
 Querying a relational database with >300K employees hosted on AWS cloud (free-tier)
Data Warehouse
 OLAP database mainly used for analytical purposes, such as
analyzing historical trends and patterns, instead of daily operational
transactions.
 Import from various data sources, typically from different databases
& ERP systems
 The process of importing and manipulating transactional data into
the warehouse is referred as Extraction, Transformation and Loading
(ETL).
 Provide summarized and multi-dimensional views of consolidated
data i.e. Data Cube. Give contexts to various perspectives e.g. time
dimension show the breakdown of sales by year, quarter, month,
day and hour. Product dimension help to see which products bring
in the most revenue etc.
Problems of Relational Database &
Warehouse
 Must have schemas or planned data models i.e. strict data types, difficult to change etc.
 Not suited for unstructured data e.g. social media, story books, news, journals, photos etc.
 99% of real-world data is unstructured.
 Expensive, unadaptable, unable to scale big easily (often measured in Gigabytes or
Terabytes at best), and often require specialized hardware & licensed proprietary software
 Almost all Big Tech companies (Facebook, Google, Yahoo! etc) have long decided that
traditional RDBMS is bad for their data business models that change frequently and
measured in Petabytes, Zettabytes, Exabytes and beyond.
 However, many data analysts are familiar with traditional ETL and BI concepts but not with
Hadoop programming.
Hadoop + NoSQL
 Development of the Hadoop file system (HDFS) and associated NoSQL databases such as
Cassandra and HBase.
 NoSQL stands for “Not Only SQL”.
 Ability to store data in raw formats and decide what-to-do later i.e. Data Lakes
 NoSQL can be schema, schema-less, flexible and adaptable to changes.
 Well suited for both structured and unstructured big data.
 Ability to dynamically expand using cheap commodity hardware and free open-source
software.
 Cost of storing data in a Hadoop solution grows linearly with the volume of data and there is no ultimate
limit.
 Hadoop + NoSQL ecosystem bring back familiar data warehouse and BI concepts to Hadoop
DataModels
Column-Based Model
Unique Name Value
101 ProductName = "Book 101 Title"
ISBN = "111-1111111111"
Authors = [ "Author 1", "Author 2" ]
201 ProductName = "18-Bicycle 201"
Brand = "Brand-Company A"
Color = [ "Red", "Black" ]
ProductCategory = "Bike"
Examples: Cassandra, Hbase, Vertica
Key-Value Model
{
Id = 101
ProductName = "Book 101 Title"
ISBN = "111-1111111111"
Authors = [ "Author 1", "Author 2" ]
ProductCategory = "Book"
}
{
Id = 201
ProductName = "18-Bicycle 201"
Color = [ "Red", "Black" ]
ProductCategory = "Bike"
} Examples: Amazon DynamoDB, Hive, MemcacheDB,
Redis
Document-Based Model
Examples: Apache CouchDB, MongoDB
Graph-Based Model
Examples: Allegro, Neo4J,
InfiniteGraph, OrientDB, Virtuoso,
Stardog
Pig and Hive
 Not everyone can code in Java for Hadoop apps
 Introducing Pig Latin
 High level abstract of Java MapReduce programming
 Introducing Hive
 Early-day nosql data warehouse on Hadoop Filesystem
 Data-warehousing activities on Hadoop e.g. Extract, Transform and Load (ETL)
Demo
 Pig and Hive (ETL) demo on all Shakespeare's literatures (unstructured data)!
Many SQL Processing Engines and NoSQL
DBs on Hadoop Ecosystem
 Apache Impala by Cloudera
 Apache Drill by MapR
 HAWQ (HDFS) and GemFire (in-memory) by Pivotal
 Presto by Facebook
 Apache Spark SQL (Shark)– distributed by almost all Hadoop vendors
 There are much more…….
 Check out https://hadoopecosystemtable.github.io/

Introduction to Sql on Hadoop

  • 1.
    Introduction to SQLon Hadoop Samuel Yee
  • 2.
    What is RDBMS? Stand for “Relational Database Management System”.  The data in RDBMS is stored in database objects called tables.  A table is a collection of related data entries and it consists of columns and rows.  Each RDMS hosts one or more databases.  Each database consists of one or more tables.
  • 3.
    Sample Employee Tableon RDBMS Emp_no DOB First Name Last Name Gender Date Joined 499990 11/3/1963 Khaled Kohling M 10/10/1985 499991 2/26/1962 Pohua Sichman F 1/12/1989 499992 10/12/1960 Siamak Salverda F 5/10/1987 499993 6/4/1963 DeForest Mullainathan M 4/7/1997 499994 2/26/1952 Navin Argence F 4/24/1990
  • 4.
    What is SQL? Stand for “Structured Query Language”  SQL lets you access and manipulate databases.  SQL is an ANSI (American National Standards Institute) standard.  Major commands include SELECT, UPDATE, DELETE, INSERT, WHERE.
  • 5.
    Demo on MySQL Querying a relational database with >300K employees hosted on AWS cloud (free-tier)
  • 6.
    Data Warehouse  OLAPdatabase mainly used for analytical purposes, such as analyzing historical trends and patterns, instead of daily operational transactions.  Import from various data sources, typically from different databases & ERP systems  The process of importing and manipulating transactional data into the warehouse is referred as Extraction, Transformation and Loading (ETL).  Provide summarized and multi-dimensional views of consolidated data i.e. Data Cube. Give contexts to various perspectives e.g. time dimension show the breakdown of sales by year, quarter, month, day and hour. Product dimension help to see which products bring in the most revenue etc.
  • 7.
    Problems of RelationalDatabase & Warehouse  Must have schemas or planned data models i.e. strict data types, difficult to change etc.  Not suited for unstructured data e.g. social media, story books, news, journals, photos etc.  99% of real-world data is unstructured.  Expensive, unadaptable, unable to scale big easily (often measured in Gigabytes or Terabytes at best), and often require specialized hardware & licensed proprietary software  Almost all Big Tech companies (Facebook, Google, Yahoo! etc) have long decided that traditional RDBMS is bad for their data business models that change frequently and measured in Petabytes, Zettabytes, Exabytes and beyond.  However, many data analysts are familiar with traditional ETL and BI concepts but not with Hadoop programming.
  • 8.
    Hadoop + NoSQL Development of the Hadoop file system (HDFS) and associated NoSQL databases such as Cassandra and HBase.  NoSQL stands for “Not Only SQL”.  Ability to store data in raw formats and decide what-to-do later i.e. Data Lakes  NoSQL can be schema, schema-less, flexible and adaptable to changes.  Well suited for both structured and unstructured big data.  Ability to dynamically expand using cheap commodity hardware and free open-source software.  Cost of storing data in a Hadoop solution grows linearly with the volume of data and there is no ultimate limit.  Hadoop + NoSQL ecosystem bring back familiar data warehouse and BI concepts to Hadoop
  • 9.
  • 10.
    Column-Based Model Unique NameValue 101 ProductName = "Book 101 Title" ISBN = "111-1111111111" Authors = [ "Author 1", "Author 2" ] 201 ProductName = "18-Bicycle 201" Brand = "Brand-Company A" Color = [ "Red", "Black" ] ProductCategory = "Bike" Examples: Cassandra, Hbase, Vertica
  • 11.
    Key-Value Model { Id =101 ProductName = "Book 101 Title" ISBN = "111-1111111111" Authors = [ "Author 1", "Author 2" ] ProductCategory = "Book" } { Id = 201 ProductName = "18-Bicycle 201" Color = [ "Red", "Black" ] ProductCategory = "Bike" } Examples: Amazon DynamoDB, Hive, MemcacheDB, Redis
  • 12.
  • 13.
    Graph-Based Model Examples: Allegro,Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog
  • 14.
    Pig and Hive Not everyone can code in Java for Hadoop apps  Introducing Pig Latin  High level abstract of Java MapReduce programming  Introducing Hive  Early-day nosql data warehouse on Hadoop Filesystem  Data-warehousing activities on Hadoop e.g. Extract, Transform and Load (ETL)
  • 15.
    Demo  Pig andHive (ETL) demo on all Shakespeare's literatures (unstructured data)!
  • 16.
    Many SQL ProcessingEngines and NoSQL DBs on Hadoop Ecosystem  Apache Impala by Cloudera  Apache Drill by MapR  HAWQ (HDFS) and GemFire (in-memory) by Pivotal  Presto by Facebook  Apache Spark SQL (Shark)– distributed by almost all Hadoop vendors  There are much more…….  Check out https://hadoopecosystemtable.github.io/