BUILD YOUR BI SYSTEM
PRACTICE IN DATA LAKE ECOSYSTEM
Bryan@Vpon Data
• Experience
Vpon Data Engineer
TWM, Keywear, Nielsen
• Bryan’s notes for data analysis
http://bryannotes.blogspot.tw
• Spark.TW
• Linikedin
https://tw.linkedin.com/pub/bryan-yang/7b/763/a79
ABOUT ME
AGENDA
• User Story
• Data Lake
• Frame Work of BI
DEAL WITH BIG DATA
SMALL RETAILER
MORE COMPLEX AND
BIG…
http://www.slideteam.net/technology-powerpoint-templates/mobile-phones.html
3 KINDS OF PROBLEMS
https://kavyamuthanna.wordpress.com/category/big-data/
BIG DATA BIG PROBLEM
http://www.mn.uio.no/ifi/studier/masteroppgaver/nd/masteroppgave_cloud_bigdata_hpc.html
BIG DATA BIG COST
• The cost of data storage
What does the data keep?
How long?
• The cost of data management
Is the machine and infra easy to maintain?
Data Flow(ETL)?
• The time cost of data processing
How long will the users can wait?
Accessibility of the data
Human costs you can not see
A REAL CASE
SO MANY ADHOC QUERIES
SALES
MARKETING
FINANCE
BUSINESS
EVEN A SIMPLE QUERY
Q: HI, PLEASE TELL ME HOW MANY
USERS FROM THE BEGINNING?
A:SELECT COUNT(1)
FROM LOG
ttps://myreelpov.wordpress.com/2012/12/23/which-story-do-you-prefer-life-of-pi/life-of-pi-2
Your Life
Boss
Family and Lover
Customers
Data Ocean
Overviews
Business intelligence (BI) is the set of techniques and tools for
the transformation of raw data into meaningful and useful
information for business analysis purposes. —Wikipedia
DIFFERENT FEATHERS
Price Perfomance Accessibility
Hadoop Low Median Low
SQL Server Low-Median Depends on Median
Data
Warehouse
High High Median
BI System High Depends on High
http://www.datalytyx.com/big-data-data-lakes/
WHY DATA LAKE
tp://thesologuide.com/332/the-seesaw-of-success-when-taking-a-rest-is-bes
HIVE
• Create at Facebook
• Data warehouse in Hadoop ecosystem
• HiveQL(SQL like interface)
• Metastore(Save the schema of data,
schema on read)
• UDF
http://www.stratapps.net/intro-hive.php
http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/
ONE MORE THING
TERADATA
• Massively Parallel Processing
• Each processor handles different threads
of the program, and Each processor itself
has its own operating disk
• Teradata SQL is fully certified at the SQL
92
http://www.slideshare.net/alam7/module-02-teradata-basics
https://www.safaribooksonline.com/library/view/teradata-architecture-for
TABLEAU
• Visualization Tool
• Connect with kinds of database
• VizQL
• Tableau Server
http://www.clearpeaks.com/blog/tableau/tableau-8-2-new-features
https://www.youtube.com/watch?v=fYpy04vmG_o
m/services/business-intelligence-services/tableau-consulting/table
JENKINS
• Manage ETL processes
• Free & Many Plugins
• Monitor Jobs Status and dependency
• Communication with Git and SVM
• Email alert
User Interface回到首頁
管理選單
建置中項目
Joblist 建置資訊
建置狀態
下次建置項目及時間
ip:port
Job Name List
Job Name
List
Build Steps
call python script
call the remote shell
call local shell script
Build Graph
Job Name Job Name Job Name
Job Name Job Name Job Name
Job Name Job Name Job Name
LET’S PUT IT ALL
TOGETHER
Hadoop
Cluster 1
Hadoop
Cluster 2
Teradata
Tableau
Server
User
Data Transfer
Request
ETL
Live Query Too Slow
Data Slicing
Hadoop
Cluster 1
Hadoop
Cluster 2
Teradata
Tableau
Server
User
Data Transfer
Request
ETL
Extract Data Insufficient Space
Data Slicing
Hadoop
Cluster 1
Hadoop
Cluster 2
Teradata
Tableau
Server
User
Data Transfer
Request
ETL
Extract Data
Every Day
Table
View
Statistical Tables
Data Slicing
USER EXPERIENCE
TUNING
0
30
60
90
120
150
HIVE TERADATA BI
120X Faster
HOW TO CHOOSE THE
COMPONENT IN YOUR BI
FRAMEWORK ?
• The cost of data storage
• The cost of data management
• The time cost of data processing
CONSIDERINGS AND
SUGGESTIONS
• Time is money
• HDD space/ money for the time
• Understanding the components and
relationships
• Get balance of the needs and costs
• Good framework will help business growth
COST CURVE
Business Growth
CostofBusinessGrowth
Hardware
*More Nodes
*More Memories
*Graph Card
…
Software
*Spark
*Tez
*Tachyon
*Algorithm
…
IN THE FUTURE
Cloud
*EC2
*Big Query
*Bluemix
*SAP
…
THANK YOU FOR YOUR
LISTENING
Special Thank
Vpon
Hood, Meiyen, Gil and OPS Team
Q & A

Building your bi system-HadoopCon Taiwan 2015

Editor's Notes

  • #14 big data brings the problem in 3 ways. Variety: kinds of data types, data sources , databases Volume: log data, transection data, crawler data Velocity: real time ,near real time, batch
  • #15 Vpon is a big data advertising company. We receive and produce amount of data a day.
  • #16 業務需求反應能等待的處理時間
  • #18 We receive so many adhoc queries a day. Queries are com from each development like Business development, sales, Account services RD blahblah. For example, how many users a day, how many requests a day, click rate, etc.
  • #57 業務需求反應能等待的處理時間