Engineering practices in big data storage and processing

1,142 views
908 views

Published on

Engineering practices in big data storage and processing. A summary of that experience.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,142
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
27
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Engineering practices in big data storage and processing

  1. 1. Engineering Practices in Big Data Storage and Processing Nov.20, 2013 Schubert (Songbo) Zhang
  2. 2. About me • 张松波 (Schubert Zhang) • Backgrounds • Senior Engineer Tech Lead and Architect, Infrastructure Data Team, @Baidu • VP Engineering, Cloud & Big Data R&D, @Hanborq • Senior Engineering Manager, @UTStarcom • 10 years of Telecom, 5 years of Cloud Storage & Big Data, 1 year of Internet 2
  3. 3. Categories of (Big) Data • Rows / Records • • • • Logs User Profiles Shopping Orders … • Files / Objects • • • • Documents Photos Videos … • Presentation • Presentation • A mess -> organizing, indexing -> fast to retrieve … • Batch and sequential processing … • Organizing, indexing -> fast to retrieve … • Batch and sequential processing … • Tables with Schema • Data Types • Database, Data-Warehouse • Files in File-System • Objects in Object-Storage-System • With metadata … Over the common underlayer storage and IO system: Hardware, Disk, Network … 3
  4. 4. Products and Engineering Projects Object Storage, Data Warehouse, Cluster Management, etc. For enterprise! 4
  5. 5. Products Line 大数据工程 (Big Data) 云存储 (Cloud Storage) HB-CDW产品线是基于云计算技术实现的面向大数据(PB级)存储、 HB-CSS产品线为企业或个人提供云存储解决方案及服务。提供类 查询和分析以及挖掘的大数据仓库系统。核心产品包括基于 似Amazon AWS S3的服务层API和用户体验,可扩展、安全、快速 Hadoop生态系统的大数据仓库、海量结构化数据管理系统 的云对象存储系统oNest。基于oNest,为企业和个人提供接入云 HugeTable。基于Hanborq增强并扩展的Hadoop、HBase、Hive、 存储服务的存储网关(Storage Gateway)及类似Dropbox的在线云 Pig等大数据基础软件,实现特有的数据模型、系统架构和标准 存储服务(uDrop/eDrop)。在大型互联网、教育、电信、媒体、交 的SQL/API,提供对大数据的快速加载、实时索引查询,以及基 通等行业领域有广泛的使用案例。 Hanborq Products 系统提供灵活的扩展性和安全可靠性。在电信、电力、交通、 于MapReduce和MPP等并行计算技术的深度统计、分析和挖掘。 大型互联网等大数据行业领域有广泛的使用案例。 管理系统 (Management) HB-ClusterMaster是大规模数据中心集群规划、操作系统及应用程序自动化安 装部署、配置管理、监控及运营维护的软件系统,实现大规模云计算集群的高 效部署和运维。目前部署和管理的最大单系统案例超过2000个物理服务器节点。 5
  6. 6. Cloud Object Storage System : oNest • Web Service and API • Amazon AWS S3 RESTful API • S3 Data Model (User->Buckets->Objects) • Backend Distributed Object Storage System • Google GFS + Facebook Haystack • • • • • Triple copy of data trunks Write-through, Strong consistency Append only and Compaction High efficient Local Index … SDK (C++/Java/Python/PHP/Go…) Web Service (RESTful API over HTTP) Metadata Layer • Backend Distributed Metadata Layer • Flexible data model • NoSQL Object/Trunk Storage Layer 6
  7. 7. Cloud Object Storage System : oNest Logic Physical Rock User Bucket Object/Pebble Chunk Part Rock Chunk Object Part Bucket2 Bucket3 Bucket4 Chunk Chunk Rock Chunk Chunk Chunk Object Part Chunk Object Bucket1 Chunk Part Chunk Chunk Object Object Object Object Object Object Object Object Chunk Object Object Rock & Chunks Data Model and Data Organization 7
  8. 8. Cloud Object Storage System : RockStor-> oNest 应用系统1 …… 应用系统N SDK (Java) for Developers HTTP接口 HTTP接口 HTTP接口 RESTful API (Cloud Service) HTTP接口 HTTP接口 接口层 RockStor Service Load Balancers WEB服务 (访问请求负载均衡器,多点部署,LVS) WEB服务 …… WEB服务 计量信息 RockMaster AAA, CAS RockServer 管理接口 管理接口 系统管理 负载均衡 分布式云对象存储系统 Management Console 资源管理平台 RESTful API (Internal) RockServer 对象 对象访问 服务层 相关 功能 对象属性 RockServer 容器 容器访问 相关 功能 容器属性 用户 相关 功能 认证 用户控制 日志管理 鉴权 统计报表 RockServer 运维管理 分布式存储系统集群 Hadoop (存储和管理Rock文件) 分布式数据库集群 HBase (存储和管理元数据) Fast/Simple Prototype Leverage Open Source 存储层 分布式存储系统 To be a Product and Service. 8
  9. 9. Cloud Object Storage System : oNest Region Console Console WebServer WebServer 机房A Console Console WebServer WebServer Console Console WebServer WebServer Console Console WebServer WebServer ClusterMaster ClusterMaster Master Master AAA Slave Stats Master Stats Master Stats Slave Stats Slave AAA AAA Slave Slave Master Proxy AAA AAA Web Web Service Service Stats Cluster Master Master Stats Master (1) 支持高可靠,多副本数据存储,支持动 态环境下数据副本的自动修复 Stats Master Discovery Service Cluster AZ OAS Cluster OAS DataStorage Cluster OAS Healer Cluster Healer DataNode DataNode DataNode MetaNode Cluster Healer MetaNode MetaNode SlaveSlave Master Healer MetaNode Slave Stats Slave Stats Slave AZ OAS Cluster OAS OAS DataStorage Cluster OAS Healer Cluster Healer DataNode DataNode DataNode MetaNode Cluster Healer Master Master • oNest对象云存储平台系统以对象的形式存 储数据,为互联网业务和企业用户提供可达 百PB级的云存储服务 • oNest系统提供的对象云存储服务的主要特 点: AAA AAA Web Web Service Service Proxy Discovery Service Cluster OAS 机房B AAA Cluster AAA AAA Slave Slave Master Console Console WebServer WebServer ClusterMaster ClusterMaster AAA Cluster AAA Slave Console Console WebServer WebServer MetaNode Slave MetaNode MetaNode SlaveSlave Master Master Master (2) 支持大规模存储(容量x100PB级以上), 存储对象数量和容量的线性扩容 (3) 支持一个数据中心内和跨数据中心备份 数据 (4) 支持大规模并发访问 (5) 支持安全的数据访问 Healer To be a more Complete Product and Service. 9
  10. 10. Cloud Object Storage System : oNest 创建Bucket 新建目录 上传对象 刷新列表 查看属性 操作记录 用户名 右键菜单 对象集列表 对象列表 对象基本属性描述 点击进入详细属性描述,包括对象下载地址 点击进入ACL权限管理 10
  11. 11. Cloud Object Storage System : oNest 教育云应用的用户 教育云App-1 SDK 教育云应用服务 REST oNest提供统一标准的云存储接口,教育云应用可 以通过该接口存储、读取、或操作这些数据对象 教育云App-2 教育云应用即是oNest云存 储的用户。 REST 注册、登录、 Console oNest云存储服务 BC-oNest对象云存储服务 oNest是一个弹性的对象云存储系统,可类比Amazon AWS S3。 为教育云提供视频、音频、图片、文档等数据的存储服务。 11
  12. 12. Dropbox-Like NetDisk Service: uDrop / eDrop • Hack Dropbox 208.43.202.5 ... Softlayer Datacenter keep alive (http) login (https) list, delete rename and sync (https) 67.228.78.114 67.228.78.116 67.228.78.117 ... Dropbox Web Server Client download and upload data (https) 75.101.145.128 75.101.138.84 ... Amazon S3 & EC2 • keep-alive mechanism • Delta update • Mechanism of shared file block • Dropbox client database: Sqlite • 数据/文件分割和指纹 • 增量上传算法 • 所谓“秒传” 12
  13. 13. Dropbox-Like NetDisk Service: uDrop / eDrop PC Client Mobile Client Browser REST AccessServer REST AccessServer MetaAPI DataAPI MetaAPI Meta Server Meta Server DataAPI Web Server MetaAPI DataAPI Register Meta Server Meta Server Matcher oNest ZooKeeper HBase 13
  14. 14. Big Data Platform Users, Applications SQL/Scrpits/Java/Web Backup Smart SQL and Executi on Engine Big Data Source Big Data Source Hive HugeTable BulkLoad (Flume Flive) ETL Data Mini ng MapReduce/Impala Hcatalog Bigtable Bigtable HBase Oozie …… …… Big Data Source Pig file file file HD FS Ganglia Nagios Clus terMaster (Deplo yment) Shared Cluster of Serv ers 14
  15. 15. Big Data Warehouse: HugeTable -> Horizon • 以HDFS为基础存储平台,支持多种存储格式,可扩展 SQuirreL SQL Client (GUI) SQLLine (CLI) Web SQL Client Apps (Programming) JDBC Driver JDBC Driver JDBC Driver JDBC Driver • • • • 多种数据访问模型 • • • Smart SQL Engine Smart SQL Engine 智能SQL引擎 智能SQL引擎 Pig HugeTable Data Model 数据建模 Unified Schema 统一元数据 Impala (MPP) MapReduce HFile TextFile SequenceFile (SSTables) (Recorded) (Key-Value Rows) HDFS HBase MapReduce MPP: Impala • HugeTable特有的数据存储模型 • • • • Encodeing/Decoding Indexing Partitioning … • 统一的Data Schema Metadata管理 Hive HBase HBase/HFile, 行存储:TextFile, SequenceFile 列存储:RCFile/ORCFile, Rarquet, … RCFile/ ORCFile (Columnar) • Smart SQL Engine and Server • • 高性能、高并发、高稳定性、分布式 选择不同的数据访问模型路径 • 兼容Hive和Pig Parquet (ColumnIO) User-Defined Formats ... • 标准化JDBC客户端接口和客户端工具 • 工程辅助工具 • • 快速批量加载 BulkLoad和导出 (提供SQL界面) 快速部署工具 15
  16. 16. Big Data Warehouse: HugeTable -> Horizon JDBC and ODBC REST API Management ... SQL Engine (Standard, Familiar, Low Learning Curve, ...) Data Warehouse Utilities / Tools (SpeedLoader, SpeedScan, Data LifeCycle, ...) Bigtable (HBase) DFS (Hadoop HDFS) Connectors Integrating into Hadoop Ecosystem Data Model (Data Organization, Indexing, Partitioning, Encoding, Compressing, ...) Oozie HCatalog Pig Hive MapReduce 16
  17. 17. NoSQL vs. SQL • NoSQL, BigTable, Cassandra, etc., are just the “Storage Engine Layer” of DBMS. • Users always like and be familiar with SQL to touch their data. MySQL Server Horizon SQL Engine Layer Distributed SQL Engine vs. Storage Engine Layer (MyISAM, InnoDB, etc.) Distributed Storage Engine (NoSQL, HBase) How about to build a Distributed DBMS? Megastore, Greenplum/Pivotal/GitusDB, 17 etc.
  18. 18. 经分大数据平台 Plan & Design 数据存储模型定义 (Schema, Types, Indexes, StorageEngine, etc.) 数据处理操作和流程定义 (SQL, Scripts, Java, WorkFlow, etc.) BOSS 帐详单CDR数据 批量加载工具 (Files, BulkLoad, etc.) 网络 CDR数据 (Gn/Gb/IuPS ...) 信令数据 (Iub/Iucs/mmsc ...) 日志数据 (WAP, WLAN ...) DPI采集数据 统一大数据存储和分析平台 Client 根据实 际业务 数据进 行开发 和移植 实时加载工具 (Flume, Flive, etc.) 离线接 口一般 无需修 改 数据库数据转 移工具 (Sqoop, etc.) SQL Scripts ... Java Hive Horizon ETL处理 逻辑 HBase MapRedu ce Impala Hadoop HDFS基础存储层 CRM 用户资料 MapReduce 其他工程工具 Pig 根据实 际业务 数据进 行开发 和移植 离线接 口一般 无需修 改 统计、汇总 分析、报表 类业务 即席查询 类业务 (ad-hoc) 数据挖掘 类业务 Data Mining 其他OLAP 业务 数据处理和访问 业务功能 其他数据 大数据来源 (多样性) 数据加载和预处理 数据存储、组 织和处理平台 原则:以离线、批量分析为主,兼顾数据查询和管理 18
  19. 19. 大数据服务平台 JDBC for Local Deployment RESTful for Remote Deployment Load Balancer (LVS, with HA) HugeTable Web Service Web Service Web Service SQL Engine Server SQL Engine Server SQL Engine Server LifeCycle file Online Generated Data (CDR) (On/Offline, DataDrop) Connector Flive HugeTable Data Model BulkLoad file Hive/Pig MapReduce Hive/Pig MapReduce HBase, Hadoop (with SpeedScan) Analysis ETL 原则:以实时低时延数据查询为主,兼顾数据分析 19
  20. 20. Cluster Management: ClusterMaster 20
  21. 21. Cluster Management: ClusterMaster 21
  22. 22. Hadoop and Open Source Ecosystem • MapReduce • Runtime Job/Task Schedule & Latency • • • Work Pool Transfer Job description information … • Processing Engine Improvements • • Shuffle: sendfile, Netty Server, Batch Fetch Sort Avoidance: Spilling and Partitioning, Hash Aggregation • HBase (to be a Data Warehouse backend) • • • • • Low Level HFile management Speed Bulk Load Speed Scan for Analysis Flexible control of Flush, Compaction, Split, Balance Coprocessor for parallel processing • Flume • Support more Data Sources and Data Storages • More flexible Command Line tool • Hive • Faster SQL Engine • Support more Storage Engines • More UDFs for database functions (such as NVL, DECODE from Oracle.) • More UDFs for OLAP (such as Roll-Up, Cube, Efficient Aggregations, etc. • More algorithms for efficient statistics and estimate (such as LogLog-Counter for estimated DISTINCT values) • Pig • Support more Data Storages • More UDFs for analysis, statistics and data mining (such as K-Mean, ID3 for Decision Tree, etc.) • Tools • • • • Deployment: Hdeploy, HTCfg, ClusterMaster Management: Integrate Ganglia, Nagios, Puppet, etc. Light and handy command line: Hman, etc. Benchmark Tools: Hbench, etc. 22
  23. 23. Know the Details of Hadoop … 23
  24. 24. MapReduce Runtime Optimization • Job/Task Schedule & Latency • Worker Pool Job Latency (in second, lower is better) Total Tasks (96 maps, 4 reduces) 50 MapReduce Client 45 RPC (JobConf) JobTracker 43 40 35 30 25 24 20 TaskTracker TaskTracker 15 TaskTracker 10 5 Child Worker Child Worker Worker Pool Child Worker Child Worker Child Worker Worker Pool Child Worker Child Worker Child Worker Child Worker 1 0 CDH3u2 (Cloudera) CDH3u2 (Cloudera) (reuse.jvm disabled) (reuse.jvm enabled) HDH3u2 (Hanborq) Worker Pool 24
  25. 25. MapReduce Processing Engine Optimization • Shuffle: Use sendfile to reduce data copy and context switch. • Shuffle: Netty Shuffle Server (map side) and Batch Fetch (reduce side). • Sort Avoidance. • Spilling and Partitioning, Counting Sort, Bytes Merge, Early Reduce, etc. • Hash Aggregation in job implementation. Real Aggregration Jobs (lower is better) Sort Avoidance and Aggregation 700 2400 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 600 2186 500 615 197 175 216 198 Case1 Case2 197 216 175 198 615 300 200 2186 HDH (Hanborq) 400 Case3 CHD3u2 (Cloudera) time (seconds) time (seconds) (lower is better) 100 0 Case1-1 Case2-1 Case1-2 Case2-2 CDH3u2 (Cloudera) 238 603 136 206 HDH (Hanborq) 233 578 96 151 25
  26. 26. 中国移动BigCloud 自2008年开始与中国移动研究院合作定义、设计和开发“大云”1.0体系结构和产品系列,目前已完成 了“大云”2.0的研发任务。 已支持“大云”系统在中国移动及其它行业用户广泛部署,提供软、硬件系统解决方案及服务。云存储 及数据仓库产品及服务,单一数据中心部署容量已超过2,000节点,管理超过20PB的存储容量。为电信 详单、日志、信令、文档、视频、图片及互联网页数据,提供存储、分析及检索服务。  BC-HugeTable(海量结构化数据管理系统)  大数据仓库 (分析和查询)  大数据库 (分析和查询)  BC-Hadoop(海量数据存储和分析平台)  研究院发行版  汉播发行版HDH  BC-oNest(分布式对象存储系统)  BC-NAS(分布式文件系统中间件) 26
  27. 27. CDR帐详单仓库和查询 清单量(亿条) HB-CDW集群系统 电信运营网络 450 数据存储和分析服务器集群 HB-CDW系统 (存储,索引,分析) OSS服 务器 400 350 300 250 200 移动核 心网 网络交换设备 报 表 查 询 实时 采集设备 批量 timeseries PC浏览器查询 清单量(亿条) 150 100 50 Internet 0 200906 200907 200908 200909 200910 200911 200912 RDBMS和 Web服务器 查询量(次数) 8000000 7000000 6000000 集群监控管理服务器 BSS 智能手机查询 5000000 4000000 查询量 3000000 2000000 Intranet 1000000 0 200906 200907 200908 200909 200910 200911 200912 Terminals 分析报表 PC浏览器监控 方案制定时间:2009-10 智能手机监控 - CDR实时生效延迟<1分钟 - 查询响应(Latency) < 3秒(平均<0.5秒) - 查询吞吐率:每月2亿次,忙时每秒1000 - 数据安全:数据在3个节点冗余备份 - 数据分析:每日或每月生成KPI报表 用户规模:约1亿用户 CDR详单数据量 - 每月:详单量500亿条,数据量20TB (每秒2 万条以上) - 总存储6个月:详单量3000亿条,数据量 120TB - 移动互联网业务详单数据量是普通业务CDR 的5倍以上 数据存储和处理集群规模 - 32台DELL PE C2100服务器 - 每台12 x 1TB数据硬盘,64GB内存 27
  28. 28. WorkFlow/Pipeline控制器 移动 – 经分ETL 周期(每小时)在接口机上运行Pig脚本,驱动MapReduce Job并行从接口机读取数据,并做格式转换、编码、压缩 和清洗,写成SequenceFile到HDFS。节省存储空间,提高 输出中间汇总(细粒度)数据 后续处理效率,易扩展新的ETL功能 月180GB,存储到HDFS 31 天,待月汇总 WAP日志文件 Hadoop Node 接口机每小时拉文件 每日400GB,约4.6万个小文件 高性能/高并发/大存储 华为WAP日志服务器 (FTP Server) #1 华为WAP日志服务器 (FTP Server) #2 平台对外总数据接口 …… (输入/输出) Hadoop Node 防 火 墙 大数据平台 接口机 (FTP Server) 大数据平台 (Hadoop/Hive/Pig/ HugeTable) Hadoop Node 亚联系统 日汇总Job (Hive SQL) …… 31天 日汇总Jobs (Hive SQL) 日汇总 一经规整 (Pig/Scrpits) 31天 月汇总Jobs (Hive SQL) 月汇总 一经规整 (Pig/Scrpits) 日汇总 一经规整 (Pig/Scrpits) 每日输出5GB规整 后的数据到接口机 每月输出规整后的 数据到接口机 Hadoop Node 每天更新号段维表数据 每月更新用户信息维表 数据 每日定时取前一日汇总数据 每月定时取前一月汇总数据 数据需符合一经规范 28
  29. 29. 29
  30. 30. Lessons Learned Many lessons and many feelings. 30
  31. 31. 1. Right Design Comes from Basic Knowledge of Computer System / Computer Science • Computer Architecture and How Computer Works • Representing and Manipulating Information and Programs • Processor Architecture (Pipeline, Parallel …) • Storage Architecture • IO System, etc. • • • • • The core issues of database. • File-system … • To be distributed now. Memory/Storage Hierarchy Modern Operation System Networking Languages … 31
  32. 32. Basic Knowledge of CS - Sequential vs. Random Access … - Long latency of Disk Seek … - Throughput All solutions of database and big data processing system are stand on the characters of computer architecture, especially disk, network ... 32
  33. 33. Basic Knowledge of CS by Jeff Dean 33
  34. 34. Basic Knowledge of CS • What every data engineer needs to know about disks • Basic Algorithms (Sorting, Searching, Strings, Bitmap, …) • Linux Virtual Memory, Exceptions, Concurrency, etc. •… 34
  35. 35. 2. Keep Simple and Straightforward • Master-Slave vs. Decentralized (DHT, Consistent Hash) • Almost all Google products follow Master-Slave pattern. GFS/BigTable/MapReduce/ZooKeeper, etc.. • MapReduce: Simplified Data Processing on Large Clusters • A simple programming model that applies to many large-scale computing problems • Hide messy details • Bigtable provides the simple data model, distributed B+ tree … • Shards and Replicas • Simple and clean API design 35
  36. 36. Keep Simple and Straightforward • Example: Bigtable vs. Cassandra Master Master Tablet Server Tablet Server Tablet Server Tablet Server Tablet GFS Bigtable Cassandra 36
  37. 37. Keep Simple and Straightforward Bigtable (++) Cassandra (--) • Master – Tablet Servers • Dynamic Tablet Splits • WAL + MemTable + SSTable • Three Level Distributed B+Tree • Replication in GFS •… • • • • • • • • • • • • Bigtable ’s architecture and data model make more sense. Identical Data Nodes, Gossip Consistent Hash, Virtual Nodes WAL + MemTable + SSTable Hinted Handoff DHT Ring (neighbor nodes) Eventual consistency Read Rapir Merkle Tree Clock Vector Anti-entropy protocol (反熵) … 好复杂:架构的错误,导致系统越来越复杂 … http://www.slideshare.net/schubertzhang/cassandra-dynamo-paper http://www.slideshare.net/schubertzhang/dastorcassandra-report-for-cdr-solution 37
  38. 38. 3. There is no “one-size-fits-all” solution • There are too many contradictory requirements in the structured data world. • The contradiction of data processing • Real-time or near-real-time data availability. • Batch processing for large size of data, such as aggregation. • The contradiction of data access: • Low-latency fast query response, like Lookup. • High-latency ad-hoc analytic query for historical data. • But, there is no one-size-fits-all answer for above contradictory requirements. • Identify common problems, and build systems to address them in a general way. • “Important not to try to be all things to all people!” – Jeff Dean, Keynote at LADIS’09 38
  39. 39. There is no “one-size-fits-all” solution • MapReduce • Dremel (MPP) • Tez/Stingger • NoSQL/Bigtable (and with Coprocessor) • DBMS •… Lambda Architecture: New data is sent to both layers and queries merge views from both layers. 39
  40. 40. There is no “one-size-fits-all” solution SQL, Scripts, Java, etc. Hive Pig MapReduce Java Impala GoldenOrb Dremel Pregel 不同的查询和分析请求,采用不同的并行执行引擎操作数据。 40
  41. 41. 4. Monitorable and Metrizable at any time • Sufficient Statistic, Monitoring … • Add Sufficient Monitoring/Status/Debugging Hooks • If your system is slow or misbehaving, can you figure out why? • Don’t rely on logs too much, log is too costly and inefficient. • Use real-time statistics/metrics. • Use tools, jmxetric, JMX, Ganglia, Nagios, Noah … 41
  42. 42. Monitorable and Metrizable at any time The magic matrix ??! Captured from UTStarcom mSwitch R5 system, Guangxi Site, 2004. 42
  43. 43. Monitorable and Metrizable at any time Write/Insert Operation Benchmark Read/Query Operation Benchmark 43
  44. 44. Monitorable and Metrizable at any time SLA Metrics: • • Latency o tAvgLat: Total Average Latency (ms) o dAvgLat: Delta Average Latency (ms) o dMaxLat : Delta Maximum Latency (ms) o dMinLat : Delta Minimum Latency (ms) • percentage of read ops Throughput o tThrou :Total Throughput (operation count) o dThrou : Delta Throughput (operation count) Quantile % • • Total : from benchmark start to present. Delta: between each statistical interval (2 minutes here) 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 100ms  Read Throughput: average ~140 ops/s  Latency: average ~500ms, 97% < 2s (SLA)  Bottleneck: disk IO (random seek) (CPU load is very low) 44
  45. 45. Monitorable and Metrizable at any time 45
  46. 46. 5. Try to make data in-situ • The ability to access data ‘in place’. • ProtocolBuffers/Parquet encoding Real-Time Data Service Writes (Puts) • Example: • Horizon over HDFS + HBase Reads (Get/Scan) Real-Time API Schema Meta Bulk Load HBase Flush/Compaction (Batch Input) Coprocessor MapReduce/ Impala HFiles (Batch Processing) HDFS (HFile) HFiles 46
  47. 47. 6. Approximated vs. Precise • For large data sets, it can be prohibitively expensive to find the precise result, but there are efficient estimating methods. • Example Queries: • How many distinct elements are in the data set (i.e. what is the cardinality of the data set)? • What are the most frequent elements (the terms “heavy hitters” and “top-k elements” are also used)? • What are the frequencies of the most frequent elements? • How many elements belong to the specified range (range query, in SQL it looks like SELECT count(v) WHERE v >= c1 AND v < c2)? • Does the data set contain a particular element (membership query)? • … 47
  48. 48. Approximated vs. Precise • The algorithms are approximate: with high probability it returns approximately the correct result. (e.g. ±2%) • select count(distinct userid) from userlogs; • select top(100) of count(*) from orders group by itemname; •… • Statistical and Probabilistic Analysis, Very interesting! 48
  49. 49. Approximated vs. Precise • Usually Sample/Hash/Bitmap … • Cardinality Estimation • Linear Counting • Loglog Counting … • Frequency Estimation / Heavy Hitters • Count-Min Sketch • Count-Mean-Min Sketch • Stream-Summary … • Range Query • Array of Count-Min Sketches … • Membership Query • Bloom Filter • … 49
  50. 50. 5. Open Source and Open Spirit • Choose you Building Blocks in Engineering view • Know Your Basic Building Blocks, Not just their interfaces, but understand their implementations (at least at a high level) • 善用开源,回馈开源,使开源更好更强大 50
  51. 51. 6. And more … • Description and Documents • Avoid inventing new Interface for Users • From simple to complete, From prototype to product • Make the architecture robust, try it, and then improve and complete it. • Product vs. Tech. vs. Trick •… 51
  52. 52. 7. Read Books – Read English Books 52
  53. 53. Thank You! 53
  54. 54. Find me outside • SlideShare: http://www.slideshare.net/schubertzhang http://www.slideshare.net/hanborq • Github: https://github.com/schubertzhang https://github.com/hanborq • Email & Gtalk: schubert.zhang@gmail.com • Weibo: @schubertzh • LinkedIn: http://cn.linkedin.com/pub/schubertzhang/6/b51/b5b/ • Blog: • WeChat: schubertzh http://cloudepr.blogspot.com • Facebook: https://www.facebook.com/schubertzhang 54

×