From Hadoop to Enterprise Data Warehouse

Bui Ha
Bui HaCloud Solution Architect at SoftBank
From Hadoop to
Data Ware House
Bui Hong Ha
2018/3/31
For Vietnamese AI Community in Japan
2018/4/1 1
Agenda
1. Hadoop Technologies
2. Data Warehouse
3. From Data Warehouse to Big Data
4. Observations
2018/4/1 2
Goals
1. Understanding the technologies and relationships between Hadoop,
Big Data and Data Warehouse
2. Understanding of vocabularies to “present” about Big Data and
Data Warehouse
2018/4/1 3
Raise your hands when you are in doubts
Self-Introduction
• Name: Bui Hong Ha
• Company: SBCloud (SoftBank + Alibaba Cloud JV)
• Role: Cloud Architect
• Internet: telescreen
• Video Delivery System
• Big Data
• I built one cluster (100ノード 1.5PB)
• CDH4.3、CDH5.4
• AWS Certified Solution Architect
• Alibaba Cloud Professional / MVP
Skills
Profile
2018/4/1 4
Interests: taking photos with famous people
2018/4/1 5
2018/4/1 6
Quiz
2018/4/1 7
Softwares Positions
2018/4/1 8
1. Hadoop technologies
1. Hadoop
2. Query methods
3. UI
2018/4/1 9
Statisticians will be the
next sexy Job in next
decade
Google Flu Trends
Google:MapReduce
paper
Hadoop Initial
Release
2004 20092006
Google published
BigTable paper
2008
HBase Release
Yahoo Launch
Hadoop Cluster
Pig, Hive
Development
2012
YARN
Impala: MPP SQL
on Hadoop
2014
Spark
Big Data Timeline
Kudu
2017
Beam
Big Data Hype
2018/4/1 10
Big Data technologies and hypes originated from the innovations made by Google
engineers/analysts and the hard works of Open Source hackers
Hadoop: map-reduce framework
Map-Reduce first splits data into several parts (splitting) and processes those parts in
different computers (Mapping and Shuffling) and then aggregate results (Reducing)
2018/4/1 11
Hadoop Architect
Hadoop includes 2 components: Node
Manager and Data Manager
• Node Manager: manage tasks and
computing resources (CPU and
Memory)
• Data Manager: manage data stored
on local disks
2018/4/1 12
Features of Hadoop
 Fault Tolerant
 Scalability - Economic
 Data Locality
• Move computation to data
2018/4/1 13
Hardwares
Lots of Cores – average frequency
CPUs (to reduce energy consumption)
Lots of memory (32G – 128G)
Lots of HDD (10 HDDs + 2 HDDs)
• SATA (not SAS, SSD)
• No RAID (Raid0) (excluding system
areas)
Produces a huge amount of heat
Hadoop uses commodity type servers. Using special hardware
against the design philosophy of Hadoop
2018/4/1 14
Network and Rack Designs
 Hadoop tasks include a lot of
moving data around
 “Moving data around” produces
high traffics
• 10 HDD * 100 MB/s ~ 8Gbps
(Ethernet 1Gbps)
Design Strategy
10G Switch for Top-Of-rack switches
40G Switch for Core Switches
Enable “rack-awareness” for Hadoop
Hadoop performance does not only come from the power of machines
in the cluster but also from how we design cluster networks
2018/4/1 15
Pig
16
- High-level platform for creating
programs that run on Hadoop
- Jobs run on
- Map-Reduce
- Spark
- Apache Tez
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS
(line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS
word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES 'w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words)
AS count, group AS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
Ideas for Pig come from Sawzall, developed by
legendary programmer Rob Pike from Google
2018/4/1
https://static.googleusercontent.com/media/research.google.c
om/en//archive/sawzall-sciprog.pdf
Hive
17
- Support SQL like query: HiveQL
- Compatible with processing
framework
- MapReduce
- Apache Tez
- Spark
2018/4/1
Traditional Data Analysis and Reporting tools require SQL like query languages
 The needs for SQL on Hadoop
Hue
18
2. Data Warehouse
Technologies
2018/4/1 19
1. OLAP vs OLTP
2. Column vs Row storage
Data Warehouse vs Transactional
Database
Data Warehouse Transactional Database
Suitable Workloads Analytics, Big Data Transaction processing
Types of Operations Optimized for batched write operations and
reading high volumes of data to minimize I/O
and maximize data throughput
Optimized for continuous write operations and
high volumes of small read operations to
maximize transaction throughput
Data Normalization Employ denormalized schemas like the Star
schema and Snowflake schema
Employ highly normalized schemas, which are
more suited for high transaction throughput
requirements
Storage Requires columnar or other specialized
storage
Row-oriented databases that store whole rows in a
physical block
2018/4/1 20
Analytical vs Transactional (OLAP vs OLTP)
※ Understanding Analytic Workloads - IBM
2018/4/1 21
OLTP: Forms of Data Normalization
First Normal Form (1NF)
“An entity type is in 1NF when it contains no repeating groups of data.”
Second Normal Form (2NF)
“An entity type is in 2NF when it is in 1NF and when all of its non-key attributes are fully dependent on its Primary Key”
Third Normal Form (3NF)
“An entity type is in 3NF when it is in 2NF and when all of its attributes are directly dependent on the Primary Key”
2018/4/1 22
OLAP: Data Modeling
2018/4/1 23
FACT TABLE includes all PRIMARY KEYS to DIMENSION TABLE. Query is analysis by
JOIN(ing) of FACT and DIMENSION tables
Abstract Star-Schema Detailed Example of Star-Schema
Columnar vs Row Storage
2018/4/1 24
 Columnar storage is used when
some fields are queried
 Same column  same data type
 Only queried columns are read
Row storage is used when all fields
are queried in table
 All fields can be queried by primary
key
3. Big Data Hype
2018/4/1 25
Statisticians will be the
next sexy Job in next
decade
Google Flu Trends
Google:MapReduce
paper
Hadoop Initial
Release
2004 20092006
Google published
BigTable paper
2008
HBase Release
Yahoo Launch
Hadoop Cluster
Pig, Hive
Development
2012
YARN
Impala: MPP SQL
on Hadoop
2014
Spark
Big Data Timeline
Kudu
2017
Beam
Big Data Hype
2018/4/1 26
Big Data technologies and hypes originated from the innovations made by Google
engineers/analysts and the hard works of Open Source hackers
Big Data の 3V
Volume
量
Velocity
速度
Variety
多様性
Value
価値
Veracity
真実性
Hype Cycle 2011: On Radar (Nobody even knows what BigData is)
2018/4/1 28
Hype Cycle 2012: Rising
2018/4/1 29
Hype Cycle 2013: Peak of Inflated Expectation
2018/4/1 30
Hype Cycle 2014: Trough of Disillusionment (false claims of
simplicity, promise beyond reason)
2018/4/1 31
Hype Cycle 2015: BigData Disappeared (Adoption > 20% market)
2018/4/1 32
“ But what’s happening is that big data has quickly moved over the Peak of Inflated
Expectations, and has become prevalent in our lives across many hype cycles. So big data
has become a part of many hype cycles. ”
Betsy Burton
2018/4/1 33
4. Personal Observations
and Suggestions
2018/4/1 34
Obs + Sugg 1: mrjob is good for learning
• https://github.com/Yelp/mrjob
• Python
• Run on local machine or clusters
• Hadoop streaming
2018/4/1
http://calcite.apache.org/docs/stream.html
https://hadoop.apache.org/docs/current/hadoop-
streaming/HadoopStreaming.html
35
Obs + Sugg 2: Moving to the Cloud
On Premise  Cloud-based Big Data
2018/4/1 36
Obs + Sugg 3: Data Scientist uses SQL
 Hadoop is solely a data processing framework
• Map-Reduce is primitive
• Sometimes a over-killed solution
 SQL is great
• Mature analysis tools: BI, UI
2018/4/1 37
The End
2018/4/1 38
1 of 38

Recommended

Fast and Furious: From POC to an Enterprise Big Data Stack in 2014 by
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014MapR Technologies
2.9K views37 slides
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny... by
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Cloudera, Inc.
684 views33 slides
Disaster Recovery Site Implementation with MySQL by
Disaster Recovery Site Implementation with MySQLDisaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQLSyed Jahanzaib Bin Hassan - JBH Syed
101 views15 slides
SQL Server Disaster Recovery Implementation by
SQL Server Disaster Recovery ImplementationSQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery ImplementationSyed Jahanzaib Bin Hassan - JBH Syed
71 views16 slides
2022 02 Integration Bootcamp by
2022 02 Integration Bootcamp2022 02 Integration Bootcamp
2022 02 Integration BootcampMichael Stephenson
318 views36 slides
Data management by
Data managementData management
Data managementRahulJoshi975765
314 views22 slides

More Related Content

What's hot

Company report xinglian by
Company report xinglianCompany report xinglian
Company report xinglianXinglian Liu
326 views14 slides
Cloud Storage Spring Cleaning: A Treasure Hunt by
Cloud Storage Spring Cleaning: A Treasure HuntCloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure HuntSteven Moy
105 views48 slides
Enterprise Data Lake - Scalable Digital by
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
99 views17 slides
Creating a Next-Generation Big Data Architecture by
Creating a Next-Generation Big Data ArchitectureCreating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data ArchitecturePerficient, Inc.
5.9K views45 slides
2012 10 bigdata_overview by
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
887 views36 slides
Data lake by
Data lakeData lake
Data lakeGHAZOUANI WAEL
194 views20 slides

What's hot(20)

Company report xinglian by Xinglian Liu
Company report xinglianCompany report xinglian
Company report xinglian
Xinglian Liu326 views
Cloud Storage Spring Cleaning: A Treasure Hunt by Steven Moy
Cloud Storage Spring Cleaning: A Treasure HuntCloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure Hunt
Steven Moy105 views
Enterprise Data Lake - Scalable Digital by sambiswal
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
sambiswal99 views
Creating a Next-Generation Big Data Architecture by Perficient, Inc.
Creating a Next-Generation Big Data ArchitectureCreating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data Architecture
Perficient, Inc.5.9K views
2012 10 bigdata_overview by jdijcks
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
jdijcks887 views
Big data architectures and the data lake by James Serra
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra54.1K views
Modern data warehouse by Stephen Alex
Modern data warehouseModern data warehouse
Modern data warehouse
Stephen Alex349 views
Designing modern dw and data lake by punedevscom
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
punedevscom1K views
How to select a modern data warehouse and get the most out of it? by Slim Baltagi
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
Slim Baltagi2.7K views
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa... by Zaloni
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Zaloni 2.9K views
Data Lake Acceleration vs. Data Virtualization - What’s the difference? by Denodo
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Denodo 181 views
Data Lakehouse, Data Mesh, and Data Fabric (r1) by James Serra
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra5.5K views
From Traditional Data Warehouse To Real Time Data Warehouse by Osama Hussein
From Traditional Data Warehouse To Real Time Data WarehouseFrom Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data Warehouse
Osama Hussein133 views
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a... by Denodo
Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Denodo 812 views
O'Reilly ebook: Operationalizing the Data Lake by Vasu S
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
Vasu S119 views
Big Data: Setting Up the Big Data Lake by Caserta
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
Caserta 3.4K views
The Data Lake and Getting Buisnesses the Big Data Insights They Need by Dunn Solutions Group
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need

Similar to From Hadoop to Enterprise Data Warehouse

Architecting the Future of Big Data and Search by
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
2.6K views27 slides
Big Data/Hadoop Option Analysis by
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysiszafarali1981
466 views18 slides
Hadoop: An Industry Perspective by
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
32.4K views25 slides
Unstructured Datasets Analysis: Thesaurus Model by
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
87 views4 slides
Overview of big data & hadoop v1 by
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
1K views50 slides
Apache spark - History and market overview by
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
3.5K views26 slides

Similar to From Hadoop to Enterprise Data Warehouse(20)

Architecting the Future of Big Data and Search by Hortonworks
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
Hortonworks2.6K views
Big Data/Hadoop Option Analysis by zafarali1981
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysis
zafarali1981466 views
Hadoop: An Industry Perspective by Cloudera, Inc.
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.32.4K views
Unstructured Datasets Analysis: Thesaurus Model by Editor IJCATR
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR87 views
Overview of big data & hadoop v1 by Thanh Nguyen
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen1K views
Apache spark - History and market overview by Martin Zapletal
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal3.5K views
Introduction To Big Data & Hadoop by Blackvard
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
Blackvard 1.1K views
Eric Baldeschwieler Keynote from Storage Developers Conference by Hortonworks
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks2.2K views
Big Data Analytics with Hadoop, MongoDB and SQL Server by Mark Kromer
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer6.9K views
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop by Evert Lammerts
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
Evert Lammerts1.1K views
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc... by Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Hitachi Data Systems Hadoop Solution by Hitachi Vantara
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
Hitachi Vantara2.6K views
Hadoop Developer by Edureka!
Hadoop DeveloperHadoop Developer
Hadoop Developer
Edureka!2.6K views

Recently uploaded

BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth by
BCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for GrowthBCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for Growth
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for GrowthInnomantra
15 views4 slides
START Newsletter 3 by
START Newsletter 3START Newsletter 3
START Newsletter 3Start Project
7 views25 slides
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf by
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfAlhamduKure
8 views11 slides
Créativité dans le design mécanique à l’aide de l’optimisation topologique by
Créativité dans le design mécanique à l’aide de l’optimisation topologiqueCréativité dans le design mécanique à l’aide de l’optimisation topologique
Créativité dans le design mécanique à l’aide de l’optimisation topologiqueLIEGE CREATIVE
8 views84 slides
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc... by
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...csegroupvn
8 views210 slides
dummy.pptx by
dummy.pptxdummy.pptx
dummy.pptxJamesLamp
5 views2 slides

Recently uploaded(20)

BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth by Innomantra
BCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for GrowthBCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for Growth
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth
Innomantra 15 views
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf by AlhamduKure
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
AlhamduKure8 views
Créativité dans le design mécanique à l’aide de l’optimisation topologique by LIEGE CREATIVE
Créativité dans le design mécanique à l’aide de l’optimisation topologiqueCréativité dans le design mécanique à l’aide de l’optimisation topologique
Créativité dans le design mécanique à l’aide de l’optimisation topologique
LIEGE CREATIVE8 views
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc... by csegroupvn
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
csegroupvn8 views
SUMIT SQL PROJECT SUPERSTORE 1.pptx by Sumit Jadhav
SUMIT SQL PROJECT SUPERSTORE 1.pptxSUMIT SQL PROJECT SUPERSTORE 1.pptx
SUMIT SQL PROJECT SUPERSTORE 1.pptx
Sumit Jadhav 22 views
REACTJS.pdf by ArthyR3
REACTJS.pdfREACTJS.pdf
REACTJS.pdf
ArthyR337 views
_MAKRIADI-FOTEINI_diploma thesis.pptx by fotinimakriadi
_MAKRIADI-FOTEINI_diploma thesis.pptx_MAKRIADI-FOTEINI_diploma thesis.pptx
_MAKRIADI-FOTEINI_diploma thesis.pptx
fotinimakriadi12 views
Ansari: Practical experiences with an LLM-based Islamic Assistant by M Waleed Kadous
Ansari: Practical experiences with an LLM-based Islamic AssistantAnsari: Practical experiences with an LLM-based Islamic Assistant
Ansari: Practical experiences with an LLM-based Islamic Assistant
M Waleed Kadous9 views
Design_Discover_Develop_Campaign.pptx by ShivanshSeth6
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptx
ShivanshSeth649 views
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by lwang78
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
lwang78180 views
MongoDB.pdf by ArthyR3
MongoDB.pdfMongoDB.pdf
MongoDB.pdf
ArthyR349 views
Web Dev Session 1.pptx by VedVekhande
Web Dev Session 1.pptxWeb Dev Session 1.pptx
Web Dev Session 1.pptx
VedVekhande17 views

From Hadoop to Enterprise Data Warehouse

  • 1. From Hadoop to Data Ware House Bui Hong Ha 2018/3/31 For Vietnamese AI Community in Japan 2018/4/1 1
  • 2. Agenda 1. Hadoop Technologies 2. Data Warehouse 3. From Data Warehouse to Big Data 4. Observations 2018/4/1 2
  • 3. Goals 1. Understanding the technologies and relationships between Hadoop, Big Data and Data Warehouse 2. Understanding of vocabularies to “present” about Big Data and Data Warehouse 2018/4/1 3 Raise your hands when you are in doubts
  • 4. Self-Introduction • Name: Bui Hong Ha • Company: SBCloud (SoftBank + Alibaba Cloud JV) • Role: Cloud Architect • Internet: telescreen • Video Delivery System • Big Data • I built one cluster (100ノード 1.5PB) • CDH4.3、CDH5.4 • AWS Certified Solution Architect • Alibaba Cloud Professional / MVP Skills Profile 2018/4/1 4
  • 5. Interests: taking photos with famous people 2018/4/1 5
  • 9. 1. Hadoop technologies 1. Hadoop 2. Query methods 3. UI 2018/4/1 9
  • 10. Statisticians will be the next sexy Job in next decade Google Flu Trends Google:MapReduce paper Hadoop Initial Release 2004 20092006 Google published BigTable paper 2008 HBase Release Yahoo Launch Hadoop Cluster Pig, Hive Development 2012 YARN Impala: MPP SQL on Hadoop 2014 Spark Big Data Timeline Kudu 2017 Beam Big Data Hype 2018/4/1 10 Big Data technologies and hypes originated from the innovations made by Google engineers/analysts and the hard works of Open Source hackers
  • 11. Hadoop: map-reduce framework Map-Reduce first splits data into several parts (splitting) and processes those parts in different computers (Mapping and Shuffling) and then aggregate results (Reducing) 2018/4/1 11
  • 12. Hadoop Architect Hadoop includes 2 components: Node Manager and Data Manager • Node Manager: manage tasks and computing resources (CPU and Memory) • Data Manager: manage data stored on local disks 2018/4/1 12
  • 13. Features of Hadoop  Fault Tolerant  Scalability - Economic  Data Locality • Move computation to data 2018/4/1 13
  • 14. Hardwares Lots of Cores – average frequency CPUs (to reduce energy consumption) Lots of memory (32G – 128G) Lots of HDD (10 HDDs + 2 HDDs) • SATA (not SAS, SSD) • No RAID (Raid0) (excluding system areas) Produces a huge amount of heat Hadoop uses commodity type servers. Using special hardware against the design philosophy of Hadoop 2018/4/1 14
  • 15. Network and Rack Designs  Hadoop tasks include a lot of moving data around  “Moving data around” produces high traffics • 10 HDD * 100 MB/s ~ 8Gbps (Ethernet 1Gbps) Design Strategy 10G Switch for Top-Of-rack switches 40G Switch for Core Switches Enable “rack-awareness” for Hadoop Hadoop performance does not only come from the power of machines in the cluster but also from how we design cluster networks 2018/4/1 15
  • 16. Pig 16 - High-level platform for creating programs that run on Hadoop - Jobs run on - Map-Reduce - Spark - Apache Tez input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES 'w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; Ideas for Pig come from Sawzall, developed by legendary programmer Rob Pike from Google 2018/4/1 https://static.googleusercontent.com/media/research.google.c om/en//archive/sawzall-sciprog.pdf
  • 17. Hive 17 - Support SQL like query: HiveQL - Compatible with processing framework - MapReduce - Apache Tez - Spark 2018/4/1 Traditional Data Analysis and Reporting tools require SQL like query languages  The needs for SQL on Hadoop
  • 19. 2. Data Warehouse Technologies 2018/4/1 19 1. OLAP vs OLTP 2. Column vs Row storage
  • 20. Data Warehouse vs Transactional Database Data Warehouse Transactional Database Suitable Workloads Analytics, Big Data Transaction processing Types of Operations Optimized for batched write operations and reading high volumes of data to minimize I/O and maximize data throughput Optimized for continuous write operations and high volumes of small read operations to maximize transaction throughput Data Normalization Employ denormalized schemas like the Star schema and Snowflake schema Employ highly normalized schemas, which are more suited for high transaction throughput requirements Storage Requires columnar or other specialized storage Row-oriented databases that store whole rows in a physical block 2018/4/1 20
  • 21. Analytical vs Transactional (OLAP vs OLTP) ※ Understanding Analytic Workloads - IBM 2018/4/1 21
  • 22. OLTP: Forms of Data Normalization First Normal Form (1NF) “An entity type is in 1NF when it contains no repeating groups of data.” Second Normal Form (2NF) “An entity type is in 2NF when it is in 1NF and when all of its non-key attributes are fully dependent on its Primary Key” Third Normal Form (3NF) “An entity type is in 3NF when it is in 2NF and when all of its attributes are directly dependent on the Primary Key” 2018/4/1 22
  • 23. OLAP: Data Modeling 2018/4/1 23 FACT TABLE includes all PRIMARY KEYS to DIMENSION TABLE. Query is analysis by JOIN(ing) of FACT and DIMENSION tables Abstract Star-Schema Detailed Example of Star-Schema
  • 24. Columnar vs Row Storage 2018/4/1 24  Columnar storage is used when some fields are queried  Same column  same data type  Only queried columns are read Row storage is used when all fields are queried in table  All fields can be queried by primary key
  • 25. 3. Big Data Hype 2018/4/1 25
  • 26. Statisticians will be the next sexy Job in next decade Google Flu Trends Google:MapReduce paper Hadoop Initial Release 2004 20092006 Google published BigTable paper 2008 HBase Release Yahoo Launch Hadoop Cluster Pig, Hive Development 2012 YARN Impala: MPP SQL on Hadoop 2014 Spark Big Data Timeline Kudu 2017 Beam Big Data Hype 2018/4/1 26 Big Data technologies and hypes originated from the innovations made by Google engineers/analysts and the hard works of Open Source hackers
  • 27. Big Data の 3V Volume 量 Velocity 速度 Variety 多様性 Value 価値 Veracity 真実性
  • 28. Hype Cycle 2011: On Radar (Nobody even knows what BigData is) 2018/4/1 28
  • 29. Hype Cycle 2012: Rising 2018/4/1 29
  • 30. Hype Cycle 2013: Peak of Inflated Expectation 2018/4/1 30
  • 31. Hype Cycle 2014: Trough of Disillusionment (false claims of simplicity, promise beyond reason) 2018/4/1 31
  • 32. Hype Cycle 2015: BigData Disappeared (Adoption > 20% market) 2018/4/1 32
  • 33. “ But what’s happening is that big data has quickly moved over the Peak of Inflated Expectations, and has become prevalent in our lives across many hype cycles. So big data has become a part of many hype cycles. ” Betsy Burton 2018/4/1 33
  • 34. 4. Personal Observations and Suggestions 2018/4/1 34
  • 35. Obs + Sugg 1: mrjob is good for learning • https://github.com/Yelp/mrjob • Python • Run on local machine or clusters • Hadoop streaming 2018/4/1 http://calcite.apache.org/docs/stream.html https://hadoop.apache.org/docs/current/hadoop- streaming/HadoopStreaming.html 35
  • 36. Obs + Sugg 2: Moving to the Cloud On Premise  Cloud-based Big Data 2018/4/1 36
  • 37. Obs + Sugg 3: Data Scientist uses SQL  Hadoop is solely a data processing framework • Map-Reduce is primitive • Sometimes a over-killed solution  SQL is great • Mature analysis tools: BI, UI 2018/4/1 37