How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
Opening Keynote for HadoopCon 2014
我們的身邊、網路上,圍繞著太多的 Big Data 論述與技術,Hadooper 今天聚集在這裡,都已經是 Big Data 的相關利益者,然而, 今天我們所理解的 Big Data,大部分都是透過自身的體驗而來,但 Hadoop Ecosystem 太過龐雜,Use Case 不同,必須取不同的 OSS 專案來完成,如此想來,我們哪一個人何曾看過所有的 Big Data 風景呢?
此 Talk 告訴我們如何透過更多的風景之窗,將 Big Data 的不同天地,看得更多更透。
亞洲 Hadoop 產品與解決方案引領者 Etu,於年度 Etu Solution Day (ESD) 活動中發表「2014 年台灣 Big Data 市場 5 大趨勢預測」。Etu 也首度發表兩岸的 10 大行業、21 種 Hadoop Big Data 已經被驗證的應用,如電信業的經營分析與客服查詢、電子商務的精準推薦、數位媒體的內容推薦、零售行業的使用者行為分析、高科技製造的資料倉儲工作分流卸載與製程良率分析、政府與地產的輿情分析、電力的能源管理、保險的巨量小圖檔管理等。預期 2014 年的台灣 Big Data 市場將更為成熟,經過驗證階段後,進入最後導入階段的企業也可望有倍數的成長。
Etu 負責人蔣居裕表示:「UDN 的採用,說明了台灣企業導入 Big Data 應用的需求在特定產業力道明顯上揚,『2014 年台灣 Big Data 市場的 5 大趨勢預測』也呼應了這樣的看法。」蔣居裕說:「一、首先過河的人,要開始挑戰資料價值的海洋,越早期投入者,越用越深,越深越廣;二、Total Data BI 帶動企業採用多結構化資料倉儲。客戶行為分析、精準行銷、客戶體驗是應用目標;三、從新舊系統整合到 End-to-End 解決方案,大部分企業期待廠商能夠完整交付 Big Data 應用與專業技術顧問。『容易』(Ease) 是 Big Data 產品進入企業的關鍵字;四、資料探索工具當道,力助 Business User 比 IT 人員更能挖掘 Big Data 的價值。『探索』(Discovery) 是 Big Data 分析的神髓所在 —— 探索關聯、探索意圖、探索缺少什麼;五、Big Data 教育訓練課程,從以處理技術為主者,快速擴展到資料分析。但均會被含括在『資料科學』大傘下。資料科學家萬中選一,強調專業分工的資料科學團隊,才是實踐資料價值希望之所在。」
ESD 2013 另外還展現了藉由 Etu Appliance 所架構起來的 Etu Ecosystem,展示了由 Etu 以及 ISV 夥伴們所開發的 End-to-End 解決方案:Etu Recommender,除了原有的個人化精準推薦,現在還可與第三方工具整合,進行資料視覺化探索,建置使用者行為分析資料倉儲;合作夥伴堂朝數位整合的雲端電子刊物加值平台、PilotTV 前線媒體的收視量測系統、樺鼎商業資訊的視覺化分析工具、以及衛信科技的 SDN 網路管理完整解決方案,則分別透過 Etu Appliance 來做巨量、可擴展的檔案格式轉換運算、臉部辨識資料及時處理與分析、多結構化資料倉儲、網路資料封包預處理等工作。這些方案的共同點,就是它們都是基於不斷獲得各種產品創新獎項的 Etu Appliance 所開發或整合的應用。
Summary of Insights Learned from the Data Science Program Team TrainingFred Chiang
Who really has the skills and talents to leverage the most value out of data? The Data Science Program (DSP) was co-founded by Code for Tomorrow and Etu. We believe that building and deploying a data science team consisting of members who possess and have the ability to utilize their different skill sets from a variety of industries is more practical and realistic. Versus hoping to find an individual data scientist who is an expert in a wide variety of technical fields ranging from math, statistics, and visualizations, as well as a solid background in other fields such as business, communication, and etc. The Data Science Program has identified four pertinent categories to place our members into. These four categories are Campaigner, Data Analyst, Data Hygienist, and Designer. Each team will have these four categories filled. During the training every team learns how to do data processing, data analysis, and visualization together with the sole purpose to use these skills to solve a common problem. After four weeks of intensive study, each team comes up with enterprise-grade team projects demonstrating the innovation of data-driven businesses.
After two rounds of DSP Team Training, DSP has accumulated 10 team projects and has graduated more than 60 alumni who are passionate about data science. During this journey of developing and deploying teams trained in data science, the most valuable aspects we walked away with was the witnessing of members growing in confidence from the learning and experience, the building of team work, and the overall growth of each individual. At the end of the day, our hope of as members of DSP, including myself is to instill and motivate more people to devote themselves to the exploration of data science. Now think about how you can do the same.
Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。
3. Hive
Hive Introduction
• Hive is a data warehouse infrastructure built on top of
hadoop
– Compile SQL queries as MapReduce jobs and run on hadoop
– HDFS for storage
• data warehouse (DW) is a database specific for analysis and reporting
purposes
HiveQL
JDBC ODBC
Thrift Server
Driver
MetaStore
web cli
8. 蜂 也會的程式設計
8
$ hive
hive> create table A(x int, y int, z int)
hive> load data local inpath ‘file1 ’ into table A;
hive> select * from A where y>10000
hive> insert table B select *
from A where y>10000
figure Source : http://hortonworks.com/blog/stinger-phase-2-the-journey-to-100x-faster-hive/
9. Hive和SQL 比較
Hive RDMS
查詢語法 HQL SQL
儲存體 HDFS
Raw Device or
Local FS
運算方法 MapReduce Excutor
延遲 非常高 低
處理數據規模 大 小
修改資料 NO YES
索引
Index, Bigmap
index…
複雜健全的索
引機制
9
Source : http://sishuok.com/forum/blogPost/list/6220.html
11. 用 Hive 整形後
11
北 A1 劉 12.5
HiveQL
> create table A (nm String, dp String, id String);
> create table B (id String, dt Date, hr int);
> create table final (dp String, id String , nm String, avg float);
> load data inpath ‘file1’ into table A;
> load data inpath ‘file2’ into table B;
> Insert table final select a.id, collect_set(a.dp), collect_set(a.nm), avg(b.hr)
from a,b where b.hr > 8 and b.id = a.id group by a.id;
nm dp Id id dt hr
劉 北 A1 A1 7/7 13
李 中 B1 A1 7/8 12
王 中 B2 A1 7/9 4
Tips :
• 沒有 local 模式;
• hive load 資料完後會將 input file 刪掉;
• 資料格式須嚴格檢查;
• 分欄位元僅1 byte;
• create table & load data 建議用 tool 匯入
資料較不會錯
12. 練習一 : 實作
cd ~;
git clone
https://github.com/waue0920/hadoop_example.git
cd ~/hadoop_example/hive/ex1
hadoop fs -put *.txt ./
hive -f exc1.hive
練習 : 將 exc1.hive 每一行單獨執行,並搭配 select * from
table_name 來看結果,如 :
hive> select * from final
Q : 資料是否有改進空間 ?
Q : 如何改進 ?
13. 更多:table
CREATE TABLE page_view(
viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING
)
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,‘
tblproperties ("skip.header.line.count"="1");
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';
DROP TABLE pv_users;
ALTER TABLE old_table_name REPLACE
COLUMNS (col1 TYPE, ...);
Create Table Alter Table
Drop Table
14. 更多:data
LOAD DATA
INPATH '/user/data/pv_2008-06-08_us.txt'
INTO TABLE page_view PARTITION(date='2008-06-08',
country='US')
INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND
page_views.date <= '2008-03-31' AND
page_views.referrer_url like '%xyz.com';
Insert Table
Import data
15. 更多:Query
INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;
SELECT page_views.*
FROM page_views
WHERE
page_views.date >= '2008-03-01'
AND
page_views.date <= '2008-03-31'
AND
page_views.referrer_url like '%xyz.com';
Query
Partition Based Query
16. 更多:aggregate
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u
JOIN
page_view pv
ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender,
count (DISTINCT pv_users.userid),
collect_set (pv_users.name)
FROM pv_users
GROUP BY pv_users.gender;
Joins
Aggregations
17. 更多:union
• Tips:
– UNION : 兩個 SQL 語句所產生的欄位需要是同樣的資料種類
– JOIN : 透過相同鍵值將兩table合併成大的
INSERT OVERWRITE TABLE actions_users
SELECT u.id, actions.date
FROM (
SELECT av.uid AS uid
FROM action_video av
WHERE av.date = '2008-06-03'
UNION ALL
SELECT ac.uid AS uid
FROM action_comment ac
WHERE ac.date = '2008-06-03'
) actions JOIN users u ON(u.id = actions.uid);
UNION ALL
18. 更多:動動腦
name sex age
michael male 30
will male 35
shelley female 27
lucy female 57
steven male 30
sex_age.se
x
age name
female 57 lucy
male 35 will
See : http://willddy.github.io/2014/12/23/Hive-Get-MAX-MIN-Value-Rows.html
SELECT employee.sex_age.sex, employee.sex_age.age, name
FROM
employee JOIN
(
SELECT
max(sex_age.age) as max_age, sex_age.sex as sex
FROM employee
GROUP BY sex_age.sex
) maxage
ON employee.sex_age.age = maxage.max_age
AND employee.sex_age.sex = maxage.sex;
Solution
mkdir source; cd source
wget "http://plvr.land.moi.gov.tw//Download?type=zip&fileName=lvr_landcsv.zip" -O lvr_landcsv.zip
unzip lvr_landcsv.zip
mkdir ../input;
for i in $(ls ./*.CSV ) ;do iconv -c -f big5 -t utf8 $i -o $i".utf8";done
mv *.utf8 ../input/
cd ../input/
rm *_BUILD.CSV.utf8
rm *_LAND.CSV.utf8
rm *_PARK.CSV.utf8
A = LOAD 'myfile.txt' USING PigStorage('\t') AS (f1,f2,f3);
B = LOAD 'B.txt' ;
Y = FILTER A BY f1 == '8';
Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1));
======
(1,{(1,2,3)})
(4,{(4,3,3),(4,2,1)})
(7,{(7,2,5)})
(8,{(8,4,3),(8,3,4)})
======
X = GROUP A BY f1;
====
(1,{(1,2,3)})
(4,{(4,3,3),(4,2,1)})
(7,{(7,2,5)})
(8,{(8,4,3),(8,3,4)})
====
Projection
X = FOREACH A GENERATE f1, f2;
====
1,2)
(4,2)
(8,3)
(4,3)
(7,2)
(8,4)
====
X = FOREACH A GENERATE f1+f2 as sumf1f2;
Y = FILTER X by sumf1f2 > 5.0;
=====
(6.0)
(11.0)
(7.0)
(9.0)
(12.0)
=====
C = COGROUP A BY $0 INNER, B BY $0 INNER;
====
(1,{(1,2,3)},{(1,3)})
(4,{(4,3,3),(4,2,1)},{(4,9),(4,6)})
(8,{(8,4,3),(8,3,4)},{(8,9)})
====