How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
Big Data Taiwan 2014 Track1-3: Big Data, Big Challenge — Splunk 幫你解決 Big Data...Etu Solution
講者:SYSTEX 數據加值應用發展部產品經理 | 陶靖霖
議題簡介:認清現實吧! Big Data 是個熱門詞彙、熱門議題,但是問題的核心仍然圍繞在資料處理的流程、架構與技術,要踏入 Big Data 的領域,使用者會遭遇哪些挑戰? Splunk 被譽為「全球最佳的 Big Data Company」,究竟在資料處理的流程中擁有什麼獨特的技術優勢,能夠幫助使用者克服這些挑戰?又有哪些成功幫助使用者從資料中萃取出價值的應用案例?歡迎來認識 Splunk 以及全球 Big Data 成功案例。
Big Data Taiwan 2014 Track1-3: Big Data, Big Challenge — Splunk 幫你解決 Big Data...Etu Solution
講者:SYSTEX 數據加值應用發展部產品經理 | 陶靖霖
議題簡介:認清現實吧! Big Data 是個熱門詞彙、熱門議題,但是問題的核心仍然圍繞在資料處理的流程、架構與技術,要踏入 Big Data 的領域,使用者會遭遇哪些挑戰? Splunk 被譽為「全球最佳的 Big Data Company」,究竟在資料處理的流程中擁有什麼獨特的技術優勢,能夠幫助使用者克服這些挑戰?又有哪些成功幫助使用者從資料中萃取出價值的應用案例?歡迎來認識 Splunk 以及全球 Big Data 成功案例。
Yarn Resource Management Using Machine Learningojavajava
HadoopCon 2016 In Taiwan - How to maximum the utilization of Hadoop computing power is the biggest challenge for Hadoop administer. In this talk I will explain how we use Machine Learning to build the prediction model for the computing power requirements and setting up the MapReduce scheduler parameters dynamically, to fully utilize our Hadoop cluster computing power.
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)Jing-Doo Wang
With the state-of-the-art computation mode, Map&Reduce,
This talk will present a novel approach to speed up the computation of extracting maximal repeats from tagged sequences and meanwhile computing the class frequency distribution of these repeats. An USA patent based on above approach is being applied as "Wang, Ching-Tu. Method for Extracting Maximal Repeat Patterns and Computing Frequency Distribution Tables. Patent Application Serial Number 15/208,994. 13 July 2016."
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
1. Introduction to SparkR
2. Demo
Starting to use SparkR
DataFrames: dplyr style, SQL style
RDD v.s. DataFrames
SparkR on MLlib: GLM, K-means
3. User Case
Median: approxQuantile()
ID Match: dplyr style, SQL style, SparkR function
SparkR + Shiny
4. The Future of SparkR
Logs are one of the most important sources to monitor and reveal some significant events of interest. In this presentation, we introduced an implementation of log streams processing architecture based on Apache Flink. With fluentd, different kinds of emitted logs are collected and sent to Kafka. After having processed by Flink, we try to build a dash board utilizing elasticsearch and kibana for visualization.
Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)涛 吴
This slide delivered by Zuoyan Qin, Chief engineer from XiaoMi Cloud Storage Team, was for talk at Arch summit Beijing-2016 regarding how Pegasus was designed.
4. Gartner Hype Cycle 2014
2016/7/12 CSA Summit 2016 4
萌芽期 夢幻期 幻滅期 平原期 高原期
Cloud Computing
Big Data
Internet of Things
5. Gartner Hype Cycle 2015
2016/7/12 CSA Summit 2016 5
萌芽期 夢幻期 幻滅期 平原期 高原期
“ Hybrid ’’
Cloud Computing
Internet of Things
VR
AR
6. Big Data 退燒 畢業了!!
隱身進入以下領域:
• Internet of Things ( 物聯網 )
• Business Intelligence and Analytics ( 商業智慧 )
• Enterprise Architecture
• Web-Scale IT
• Digital Banking Transformation ( 數位金融 )
• Utility Industry IT
• CRM Customer Service and Customer Engagement
• CRM Marketing Applications
• Digital Commerce ( 電子商務 )
6
14. 高可用性 High Availability (HA)
CDH 4.7 CDH 5.2 CDH 5.3 CDH 5.4 CDH 5.7
文件日期 2014/12 2015/9 2015/10 2015/11 2016/06
管理者介面 ClouderaManager V V
金鑰管理 Key Trustee KMS V
稽核者介面
Cloudera Navigator
Key Trustee Server
V V
使用者介面 Hue V V
查詢引擎 Llama / Impala V V V 沒寫?
ODBC 接口 HiveServer2 V
Schema Hive Metastore V V V V
工作流程 Oozie V V V V
索引引擎 Solr ( Search ) V V V V
運算引擎 MRv1 / YARN V V V V V
快速查表 HBase V V V V
儲存層 HDFS V V V V V
2016/7/12 CSA Summit 2016 14
• 架一座 Big Data Platform 的叢集,其實同時買了很多不同功能的元件!
• 國際大廠對於各元件高可用性的支援還持續隨著時間,正在慢慢增加中
20. 身分認證 Authentication
• 現況:支援度最廣的還是 Kerberos
2016/7/12 CSA Summit 2016 20
CM CN HDFS MRv1 YARN Flume HBase
HCat
olog
Hive
Server
2
HiveS
erver
Hue Impala Llama Oozie
ZooK
eeper
一個
以上
擇一 擇一 擇一 擇一 擇一 擇一 擇一 擇一 擇一 擇一 雙重* 擇一 擇一 擇一
simple V V V V V V V V V V V V
Data
base
V V V
Open
LDAP
V V V V V*
AD V V V V V*
LDAPS V V V V*
Kerberos V V V V V V V V V V V V V V
External
Program V CLASS
SAML V V
OpenID V
Oauth V
26. 現在進行式:Data Governance
• 全球頂尖的極極少數創新者已經走到「數據品管」的階段!!
• Data Governance
• Apache Atlas
2016/7/12 CSA Summit 2016 26
專案規劃
大數據
平台建置
大數據
平台資安
大數據
品質管制
Data governance is a control that ensures that the data
entry by an operations team member or by an automated
process meets precise standards, such as a business rule,
a data definition and data integrity constraints in the
data model.
http://atlas.incubator.apache.org/