Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。
Build 1 trillion warehouse based on carbon databoxu42
Apache CarbonData & Spark Meetup
Build 1 trillion warehouse based on CarbonData
Huawei
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。
Build 1 trillion warehouse based on carbon databoxu42
Apache CarbonData & Spark Meetup
Build 1 trillion warehouse based on CarbonData
Huawei
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Big Data Taiwan 2014 Track1-3: Big Data, Big Challenge — Splunk 幫你解決 Big Data...Etu Solution
講者:SYSTEX 數據加值應用發展部產品經理 | 陶靖霖
議題簡介:認清現實吧! Big Data 是個熱門詞彙、熱門議題,但是問題的核心仍然圍繞在資料處理的流程、架構與技術,要踏入 Big Data 的領域,使用者會遭遇哪些挑戰? Splunk 被譽為「全球最佳的 Big Data Company」,究竟在資料處理的流程中擁有什麼獨特的技術優勢,能夠幫助使用者克服這些挑戰?又有哪些成功幫助使用者從資料中萃取出價值的應用案例?歡迎來認識 Splunk 以及全球 Big Data 成功案例。
How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
Big Data Taiwan 2014 Track1-3: Big Data, Big Challenge — Splunk 幫你解決 Big Data...Etu Solution
講者:SYSTEX 數據加值應用發展部產品經理 | 陶靖霖
議題簡介:認清現實吧! Big Data 是個熱門詞彙、熱門議題,但是問題的核心仍然圍繞在資料處理的流程、架構與技術,要踏入 Big Data 的領域,使用者會遭遇哪些挑戰? Splunk 被譽為「全球最佳的 Big Data Company」,究竟在資料處理的流程中擁有什麼獨特的技術優勢,能夠幫助使用者克服這些挑戰?又有哪些成功幫助使用者從資料中萃取出價值的應用案例?歡迎來認識 Splunk 以及全球 Big Data 成功案例。
How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
标题:
Architecture and Practice for DAL (5) Data Sharding
Architecture and Practice for Data Access Layer (5) Data Sharding
联动优势数据访问层DAL架构和实践之五:分片数据分片
说明:
How to implement a dalet to access sharding databases.
和已有DAL软件(如许超前DAL手机之家、陈思儒Amoeba/贺贤懋Cobar等)不一样,在前端访问方式的选择上,抛弃JDBC方式,而是为同一个dalet数据服务,同时提供自定义TCP长连接和HTTP长连接两种接口。
因而通过抛弃JDBC可以获得多方面的好处——
1)可减少S端协议解析和查询分析的开销;
2)也简化C端编程。
3)后端存储就不再限于RDB了,而可以是任意NOSQL、文件、缓存、甚至是Tuxedo等在线服务。
4)可以实现无状态了,更容易横向扩展。
5)从接口上就可消除join等关键字的误用,避免引起服务端负担过重。
2. 自我介绍
• Tech Lead and Engineering Manager at Databricks
• Apache Spark Committer and PMC Member
• Previously, IBM Master Inventor
• Spark, Database Replication, Information Integration
• Ph.D. in University of Florida
• Github: gatorsmile
7. Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or streaming
项 目 经 理 如 是 说,
数 据 工 程 师
8. Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or streaming
项 目 经 理 如 是 说,
数 据 工 程 师 的 第 一 份 架 构 草 图,
20. Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or streaming
项 目 经 理 如 是 说,
数 据 工 程 师 的 第 一 份 架 构 草 图,
?
到底最初的方案,哪里错了???
为何选择复[ keng ] 杂[ die ] 的 Lambda 架构!!!
24. Delta On Disk
Transaction Log
Table Versions
(Optional) Partition Directories
Data Files
my_table/
_delta_log/
00000.json
00001.json
date=2019-01-01/
file-1.parquet
25. Action Types
• Change Metadata – name, schema, partitioning, etc.
• Add File – adds a file (with optional statistics)
• Remove File – removes a file
Table = result of a set of actions
Result: Current Metadata, List of Files, List of Txns, Version
26. Changes to the table are stored as ordered, atomic
units called commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
。。。
Atomicity 的 实 现
27. 1. Record start version
2. Record reads/writes
3. Attempt commit, check
for conflicts among
transactions
4. If someone else wins,
check if anything you
read has changed.
5. Try again.
乐 观 并 发 控 制
000000.json
000001.json
000002.json
User 1 User 2
Write: Append
Read: Schema
Write: Append
Read: Schema
41. Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
3)遇 到 错 误 写 出 可 以 回 滚 (rollback) 可以删改(update/delete/merge)
update/delete/merge 能提供标准SQL文法吗?
正在努力!Spark 3.0 is coming!
支持 Spark 2.4,需要 Delta 需要加上自己的 SQL parser
42. 4)在 线 业 务 不 下 线 的 同 时 可 以 重 新 处 理 历 史 数 据 (replay historical data)
Stream the backfilled historical data through the same pipeline
因为 ACID support,删掉相关的结果,重新改业务逻辑,历史数据的做批处理,
流可以同时持续处理最新的数据。
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
43. 5)处 理 迟 到 数 据 (late arriving data) 而 无 需 推 迟 下 阶 段 的 数 据 处 理
Stream any late arriving data added to the table as they get added
因为 ACID support ,迟到的数据也可以通过MERGE/UPSERT 来处理
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
45. Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or streaming
项 目 经 理 如 是 说,
数 据 工 程 师 的 第 一 份 架 构 草 图 - Delta 架 构
69. 自我介绍
Delta Lake Roadmap
Releases Features
0.2.0 • Cloud storage support
• Improved concurrency
0.3.0 • Scala/Java APIs for DML commands
• Scala/Java APIs for query commit history
• Scala/Java APIs for vacuuming old files
0.4.0 • Python APIs for DML and utility operations
• In-place Conversion of Parquet to Delta Lake table
Q4 • Enable Hive support reading Delta tables
• SQL DML support with Spark 3.0
• And more
70. Delta Lake Community
2+
Exabytes of Delta
Read/Writes per month
3700+
Orgs using Delta
0
5,000
10,000
15,000
20,000
M
arch
April
M
ay
June
July
AugustSeptem
ber
74. Unified data analytics platform for accelerating innovation across
data science, data engineering, and business analytics
Original creators of popular data and machine learning open source projects
Global company with 5,000 customers and 450+ partners