SlideShare a Scribd company logo
1 of 20
Download to read offline
VIPSHOP Alluxio Cache System
唯品会缓存系统
聂凡博/Gus
2022.01
议程
Agenda
1
● 问题&挑战
● 解决方案
● 展望未来
● Q&A
离线数据的数据特征:
● 数据访问量大需求复杂
● 热点数据导致高频IO
● 冷热趋势明显
● 数据刷新不固定
唯品会离线数据平台架构
热数据场景分析
1
2
3
4
极热数据:有极热的维度表,也有数据产
品的热点数据
这样的数据访问每天能达到 Day>1K
窗口热数据:
● DataPipline 上刚刚生成最
新版本数据
● 数据是后继任务的热点依赖
数据访问次数的范围很广 Day > 50
冷数据:需要一直保留的
重要数据访问次数很低
,Day Avg = 0
历史的DataPipline数据:这样的数据往往
访问很少,但用户还希望保留作为备份或
者月度统计,Day > 10
数据步长
10
W1 W2 W3 W4
09
W1 W2 W3 W4
08
W1 W2 W3 W4
07
W1 W2 W3 W4
07 Data Date
08 Data Date
10 Data Date
09 Data Date
LOREM
Audit Date
-1
0 -2 -3
0
-1 -2
0 -1
0
数据步长=访问日期和数据日期的相 对值
数据步长
数据步长
基于数据步长的数据冷热数据
HDFS Audit日志分析
收集所有集群的数据访问
日志。
分区级别数据访问
清洗日志数据到分区粒度
的真实数据访问。
表步长数据访问分析
聚合到步长级别,进而分
析表的不同步长,在长时
间访问波动情况。
表步长级别冷热分析
能够预估到每个表,那些
分区在什么时候是热的,
在什么时候会冷掉。
表细粒度数据冷热,生命周期画像
冷热数据的处理设计思路
Table A
HDFS
Distribute Cache
Distribute FS
(SSD)
HDFS EC
步长:0-1 步长:2-4 步长:5-6 步长: > 6
表级生命周期画像
数据管理
缓存系统设计
Table A
HDFS
步长:0-1 步长:2-4 步长:5-6 步长: > 6
Cache/SSD
MetaStore
Auto Load
OverWrite
Data
计算引擎
缓存系统设计
Data Consitence Pertection
缓存写入
Also write TimeStamp to
Metastore for Data
Consitence
缓存读取 Compare TimeStamp
from Metastore for Data
Consitence
元数据接入
元数据系统对接:
● 展示表缓存需求
● 展示缓存分区
● 申请接入缓存
● 申请审核
元数据接入
元数据系统对接:
● 展示表缓存需求
● 展示缓存分区
● 申请接入缓存
● 申请审核
缓存效果
缓存效果
加速系统-展望:
● Alluxio 加载数据的性能
● 全面使用Alluxio 接入SSD温热数据数据
● Alluxio Federation分流
● 接入其他的加速引擎方案 : CK/Doris
● 直接对接数据服务路由层,导流所有客户端SQL
关于我们:
大数据组件研发: Hadoop, Alluxio, Spark, Flink, Presto, Hudi, CK, Doris…
议程
Agenda
Q&A

More Related Content

What's hot

Google LevelDB Study Discuss
Google LevelDB Study DiscussGoogle LevelDB Study Discuss
Google LevelDB Study Discuss
everestsun
 
海量统计数据的分布式MySQL集群——MyFOX
海量统计数据的分布式MySQL集群——MyFOX海量统计数据的分布式MySQL集群——MyFOX
海量统计数据的分布式MySQL集群——MyFOX
aleafs
 
“云存储系统”赏析系列分享三:Sql与nosql
“云存储系统”赏析系列分享三:Sql与nosql“云存储系统”赏析系列分享三:Sql与nosql
“云存储系统”赏析系列分享三:Sql与nosql
knuthocean
 
腾讯大讲堂45 解剖ttc
腾讯大讲堂45 解剖ttc腾讯大讲堂45 解剖ttc
腾讯大讲堂45 解剖ttc
areyouok
 

What's hot (18)

Use Alluxio to Unify Storage Systems in Suning
Use Alluxio to Unify Storage Systems in SuningUse Alluxio to Unify Storage Systems in Suning
Use Alluxio to Unify Storage Systems in Suning
 
京东实时消息队列JDQ技术实践与探索
京东实时消息队列JDQ技术实践与探索京东实时消息队列JDQ技术实践与探索
京东实时消息队列JDQ技术实践与探索
 
数据分片方法的分析和比较
数据分片方法的分析和比较数据分片方法的分析和比较
数据分片方法的分析和比较
 
Leveldb background
Leveldb backgroundLeveldb background
Leveldb background
 
分布式存储的元数据设计
分布式存储的元数据设计分布式存储的元数据设计
分布式存储的元数据设计
 
Google LevelDB Study Discuss
Google LevelDB Study DiscussGoogle LevelDB Study Discuss
Google LevelDB Study Discuss
 
用Python实现hadoop任务调度管理
用Python实现hadoop任务调度管理用Python实现hadoop任务调度管理
用Python实现hadoop任务调度管理
 
The Evolution of Data Systems
The Evolution of Data SystemsThe Evolution of Data Systems
The Evolution of Data Systems
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
 
翟艳堂:腾讯大规模Hadoop集群实践
翟艳堂:腾讯大规模Hadoop集群实践翟艳堂:腾讯大规模Hadoop集群实践
翟艳堂:腾讯大规模Hadoop集群实践
 
Spark sql培训
Spark sql培训Spark sql培训
Spark sql培训
 
Tachyon 2015 08 China
Tachyon 2015 08 ChinaTachyon 2015 08 China
Tachyon 2015 08 China
 
海量统计数据的分布式MySQL集群——MyFOX
海量统计数据的分布式MySQL集群——MyFOX海量统计数据的分布式MySQL集群——MyFOX
海量统计数据的分布式MySQL集群——MyFOX
 
“云存储系统”赏析系列分享三:Sql与nosql
“云存储系统”赏析系列分享三:Sql与nosql“云存储系统”赏析系列分享三:Sql与nosql
“云存储系统”赏析系列分享三:Sql与nosql
 
腾讯大讲堂45 解剖ttc
腾讯大讲堂45 解剖ttc腾讯大讲堂45 解剖ttc
腾讯大讲堂45 解剖ttc
 
淘宝数据魔方的系统架构 -长林
淘宝数据魔方的系统架构 -长林淘宝数据魔方的系统架构 -长林
淘宝数据魔方的系统架构 -长林
 
第三届阿里中间件性能挑战赛冠军队伍答辩
第三届阿里中间件性能挑战赛冠军队伍答辩第三届阿里中间件性能挑战赛冠军队伍答辩
第三届阿里中间件性能挑战赛冠军队伍答辩
 
罗李:构建一个跨机房的Hadoop集群
罗李:构建一个跨机房的Hadoop集群罗李:构建一个跨机房的Hadoop集群
罗李:构建一个跨机房的Hadoop集群
 

Similar to Vipshop Offline Data Cache Acceleration System - Alluxio Integration

Taobao casestudy-yufeng-qcon
Taobao casestudy-yufeng-qconTaobao casestudy-yufeng-qcon
Taobao casestudy-yufeng-qcon
Yiwei Ma
 
Greenplum技术
Greenplum技术Greenplum技术
Greenplum技术
锐 张
 
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
Jazz Yao-Tsung Wang
 

Similar to Vipshop Offline Data Cache Acceleration System - Alluxio Integration (20)

大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点 大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点
 
揭开数据虚拟化的神秘面纱
揭开数据虚拟化的神秘面纱揭开数据虚拟化的神秘面纱
揭开数据虚拟化的神秘面纱
 
逻辑数据编织 – 构建先进的现代企业数据架构
逻辑数据编织 – 构建先进的现代企业数据架构逻辑数据编织 – 构建先进的现代企业数据架构
逻辑数据编织 – 构建先进的现代企业数据架构
 
ssdc-移动互联网技术挑战
ssdc-移动互联网技术挑战ssdc-移动互联网技术挑战
ssdc-移动互联网技术挑战
 
构建现代数据架构的基础
构建现代数据架构的基础构建现代数据架构的基础
构建现代数据架构的基础
 
2014 Hpocon 姚仁捷 唯品会 - data driven ops
2014 Hpocon 姚仁捷   唯品会 - data driven ops2014 Hpocon 姚仁捷   唯品会 - data driven ops
2014 Hpocon 姚仁捷 唯品会 - data driven ops
 
Taobao casestudy-yufeng-qcon
Taobao casestudy-yufeng-qconTaobao casestudy-yufeng-qcon
Taobao casestudy-yufeng-qcon
 
Qcon2013 罗李 - hadoop在阿里
Qcon2013 罗李 - hadoop在阿里Qcon2013 罗李 - hadoop在阿里
Qcon2013 罗李 - hadoop在阿里
 
Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lake
 
Greenplum技术
Greenplum技术Greenplum技术
Greenplum技术
 
How Enterprises Leverage Data to Overcome Business Challenges During Coronavirus
How Enterprises Leverage Data to Overcome Business Challenges During CoronavirusHow Enterprises Leverage Data to Overcome Business Challenges During Coronavirus
How Enterprises Leverage Data to Overcome Business Challenges During Coronavirus
 
产品演示:Denodo平台如何加速您获取洞察的时间
产品演示:Denodo平台如何加速您获取洞察的时间产品演示:Denodo平台如何加速您获取洞察的时间
产品演示:Denodo平台如何加速您获取洞察的时间
 
Flash存储设备在淘宝的应用实践
Flash存储设备在淘宝的应用实践Flash存储设备在淘宝的应用实践
Flash存储设备在淘宝的应用实践
 
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
 
产品信息收集系统Infoc的演变
产品信息收集系统Infoc的演变产品信息收集系统Infoc的演变
产品信息收集系统Infoc的演变
 
Performance Data Analyze
Performance Data AnalyzePerformance Data Analyze
Performance Data Analyze
 
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
 
Can data virtualization uphold performance with complex queries? (Chinese)
Can data virtualization uphold performance with complex queries? (Chinese)Can data virtualization uphold performance with complex queries? (Chinese)
Can data virtualization uphold performance with complex queries? (Chinese)
 
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
2015-05-20 製造業生產歷程全方位整合查詢與探勘的規劃心法
 
大规模数据库存储方案
大规模数据库存储方案大规模数据库存储方案
大规模数据库存储方案
 

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Vipshop Offline Data Cache Acceleration System - Alluxio Integration