Build 1 trillion warehouse based on carbon databoxu42
Apache CarbonData & Spark Meetup
Build 1 trillion warehouse based on CarbonData
Huawei
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。
How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
Build 1 trillion warehouse based on carbon databoxu42
Apache CarbonData & Spark Meetup
Build 1 trillion warehouse based on CarbonData
Huawei
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。
How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...acelyc1112009
A presentation in Apache Pegasus meetup in 2022 from Wei Wang.
Apache Pegasus is a horizontally scalable, strongly consistent and high-performance key-value store.
Know more about Pegasus https://pegasus.apache.org, https://github.com/apache/incubator-pegasus
How does Apache Pegasusused in SensorsDataacelyc1112009
A presentation in COSCon (China Open Source Conference) 2023 from Guohao Li.
Apache Pegasus is a horizontally scalable, strongly consistent and high-performance key-value store.
Know more about Pegasus https://pegasus.apache.org, https://github.com/apache/incubator-pegasus
此簡報為 Will 保哥 於 2015/6/25 (四) 接受 SQL PASS Taiwan 邀請演講的內容。
現場錄影: http://www.microsoftvirtualacademy.com/training-courses/sql-server-realase-management?mtag=MVP4015686
[ Will 保哥的部落格 - The Will Will Web ]
http://blog.miniasp.com
[ Will 保哥的技術交流中心 ] (Facebook 粉絲專頁)
https://www.facebook.com/will.fans
[ Will 保哥的噗浪 ]
http://www.plurk.com/willh/invite
[ Will 保哥的推特 ]
https://twitter.com/Will_Huang
[ Will 保哥的 G+ 頁面 ]
http://gplus.to/willh
22. Cluster
基本架構
分散式計算簡介
Worker Node
Worker Node
Worker Node
Worker Node
Data
Data
Data
Executor
Executor
Executor
Client
Cluster
Manager
1. 叢集管理理節點安排
⼯工作節點
2. 暫存資料於記憶體
Scheduling
Logistic
Regressioon
22
Cache
Data
Cache
Data
Cache
Data
23. Cluster
基本架構
分散式計算簡介
Worker Node
Worker Node
Worker Node
Worker Node
Data
Data
Data
ExecutorTaskTask
ExecutorTaskTask
ExecutorTaskTask
Client
Cluster
Manager
1. ⼯工作節點執⾏行行指定
任務
2. 此階段可能發⽣生節
點間的資料交換
Scheduling
Logistic
Regressioon
23
Cache
Data
Cache
Data
Cache
Data
24. Cluster
基本架構
分散式計算簡介
Worker Node
Worker Node
Worker Node
Worker Node
ExecutorTaskTask
ExecutorTaskTask
ExecutorTaskTask
Client
Cluster
Manager
Data
Data
Data
ExecutorTask
Reduce
1. 結合各⼯工作節點的
計算結果
Logistic
Regressioon
24
Cache
Data
Cache
Data
Cache
Data
25. Cluster
基本架構
分散式計算簡介
Worker Node
Worker Node
Worker Node
Worker Node
ExecutorTaskTask
ExecutorTaskTask
ExecutorTaskTask
Client
Cluster
Manager
Data
Data
Data
ExecutorTask
1. 將最終計算結果回
傳給使⽤用者
2. 根據需求,判斷是
否進⾏行行下次迭代
Logistic
Regressioon
25
Cache
Data
Cache
Data
Cache
Data
26. 基本架構
RDD 物件:Resilient Distributed Dataset
26
⽤用⼾戶端
RDD 物件
1. Create
2. Tranformation
3. Action
4. Cache
Work Node
Cache
Data
Work Node
Cache
Data
Work Node
Cache
Data Data
Data
Data